Given that there are millions of single-nucleotide polymorphisms (SNPs) in the entire human genome, a major difficulty faced by scientists in planning costly population-based genotyping is to choose target SNPs that are most likely to affect phenotypic functions and ultimately contribute to disease development. Although it is widely accepted that sequences with important functionality tend to be less variable across species because of selective pressure, to what extent evolutionary conservation is mirrored by epidemiological outcome has never been demonstrated. In this study, we surveyed odds ratios detected for 46 SNPs in 39 different cancer-related genes from 166 molecular epidemiological studies. The conservation levels of amino acid that these SNPs affected were calculated as a tolerance index by comparing sequences from different species. Our results provide evidence of a significant relationship between the detected odds ratios associated with cancer risk and the conservation levels of the SNP-affected amino acids (P = 0.002; R2 = 0.06). Tolerance indices were further calculated for 355 nonsynonymous SNPs identified in 90 human DNA repair genes, of which 103 caused amino acid changes in very conserved positions. Our findings support the concept that SNPs altering the conserved amino acids are more likely to be associated with cancer susceptibility. Using such a molecular evolutionary approach may hold great promise for prioritizing SNPs to be genotyped in future molecular epidemiological studies.
In this post-genomic era, single-nucleotide polymorphisms (SNPs) are being intensively studied to understand the biological basis of complex traits and diseases. Besides numerous ongoing efforts to identify millions of these SNPs, there is now also a focus on studying associations between disease risk and these genetic variations using a molecular epidemiological approach. Indeed, it has been estimated that each person has about 24,000–40,000 nonsynonymous SNPs and that there are 100,000–300,000 nonsynonymous SNPs constituting about 1% of the total SNPs in the entire human genome (1) . This plethora of SNPs points out a major difficulty faced by scientists in planning costly population-based genotyping, which is to choose target SNPs that are most likely to affect phenotypic functions and ultimately contribute to disease development. Currently, most molecular epidemiological studies are focusing on SNPs located in coding and regulatory regions, yet many of these studies have been unable to detect significant associations between SNPs and disease susceptibility.
To develop a coherent approach for prioritizing SNP selection for genotyping in molecular epidemiological studies, we applied an evolutionary perspective to SNP screening in molecular epidemiological studies. We correlated findings from molecular epidemiological studies of cancer with the evolutionary conservation levels of nonsynonymous SNPs using a sequence homology-based tool. Our hypothesis was that amino acids conserved across species are more likely to be functionally significant; therefore, SNPs that change these amino acids might be more likely to be associated with cancer susceptibility.
MATERIALS AND METHODS
For this analysis, we used the keyword combinations “polymorphism + cancer” and “genotype + cancer” limited to the field of title/abstract to search the literature for epidemiological studies investigating SNPs in the PubMed database. 1 The studies selected for our analysis had to meet the following criteria: (a) be case-control studies; (b) examine SNPs in the coding region and alterered amino acids; and (c) investigate the association between SNPs and cancer risk. We chose the overall adjusted odds ratio (OR) for the homozygous variant allele if ORs were reported for heterozygous and homozygous variant alleles separately in a study.
The conservation level of a particular position in a protein was determined by using a sequence homology-based tool, SIFT, which sorts intolerant from tolerant amino acid substitutions (2) . Briefly, SIFT searches for similar protein sequences from different species in the database, obtains the multiple alignments of these sequences, and then calculates from the alignment the tolerance index (from 0 to 1) for all possible substitutions at each position. The sequences in the alignment are restricted to those homologous sequences that are available in the protein database; therefore, the resulting alignment information is expected to vary from protein to protein. The higher a tolerance index, the less functional impact a particular amino acid substitution is likely to have. We then correlated the tolerance index of an amino acid change caused by a SNP to the overall OR of the corresponding variant genotype detected in molecular epidemiological studies of cancer. If a SNP had been found to be protective (OR < 1) in a study, we re-expressed the OR in terms of the risk genotypes (OR > 1) to facilitate the comparisons.
We also estimated tolerance indices for SNPs in human DNA repair and repair-related genes included in two popular databases 2 that contain up-to-date SNP information for the DNA repair and repair-related genes being sequenced. The definition of conserved SNPs was based on the tolerance index, a score representing the normalized probability that the amino acid change is tolerated. SIFT predicts substitutions with scores of <0.05 as deleterious. However, we included SNPs with scores ≤ 0.1 in this study to provide more SNPs for exploration.
Spearman’s rank correlation coefficient was calculated to assess the correlation between ORs and tolerance indices. Linear regression analysis was performed to further quantify the relationship between these two variables. Because small studies tend to report larger OR due to publication bias, we applied weighted linear regression by specifying sample size as a weighting variable in the model. All statistical analyses were done using the SAS system (version 8.0). All tests were two-sided with a significance level of 0.05.
By using both keywords “polymorphism” and “cancer” limited to the field of title and abstract, we identified 3386 articles in the database from the literature search. A similar search using the keywords “genotype” and “cancer” gave us 1836 articles. Following the selection criteria, we chose 165 case-control studies and 1 cohort study with a nested case-control component on SNPs that resulted in amino acid changes in our analysis (Table 1) ⇓ . These studies examined a total of 46 SNPs in 39 different genes for risk estimates at 16 different cancer sites. All of these SNPs were located in cancer-related genes, such as those involved in DNA repair, metabolism, and cell cycle checkpoints.
There was considerable variation in sample sizes of these studies, ranging from 50 cases and 50 controls to 1534 cases and 1504 controls. The majority of these studies investigated SNPs in a specific ethnicity, including Asian, Caucasian, African American, Jewish, Brazilian, Portuguese, and Egyptian, whereas a few others included ethnically heterogeneous populations. The 16 cancer types included lung, esophageal, colon, breast, ovarian, head and neck, gastric, oral, prostate, endometrial, hepatocellular, stomach, cervical, bladder, skin, and non-Hodgkin’s lymphoma. The overall ORs retrieved from these studies were adjusted by different factors across studies. However, age and gender were the two most common factors adjusted, whereas smoking status was adjusted for in most of the smoking-related cancers. There are a variety of other confounding factors adjusted in some specific studies, such as family history of cancer, family income, menopausal status, and so forth.
Overall, there was a significant inverse correlation between the ORs and tolerance indices (Spearman rank correlation coefficient r = −0.247; P = 0.001). Further analysis revealed a significant linear regression relationship between these two variables with a regression coefficient of −0.763 (R2 = 0.06; P = 0.002). The OR showed a negative correlation with sample size (r = −0.27; P = 0.0004), which might be due to some publication bias in the studies, with the smaller studies having higher ORs. The OR was skewed, and log2 transformation resulted in an approximately normal distribution. Specifically, the skewness and kurtosis for ORs were 4.81 and 35.47 for the untransformed ORs, whereas skewness and kurtosis for the log-transformed ORs were 1.29 and 2.47, respectively. We therefore also performed linear regression analysis of the log-transformed ORs and tolerance index weighted by sample size, an approach that gives the most weight to the largest and therefore most reliable studies. One cohort study and one case-control study with an unspecified sample size were not included in the weighted regression analyses. Again, we found a highly significant result (regression coefficient = −0.448; R2 = 0.08; P = 0.0002). Moreover, the weighted mean ORs from different studies examining the same SNPs were calculated and correlated with the corresponding tolerance indices. This approach revealed similar results.
Fig. 1 ⇓ shows the linear relationship between the log2 ORs and the tolerance index. When we included ethnicity as a covariate using three coding groups (white, Asian, and other), we found no significant effect of ethnicity and again highly significant correlation between log OR and the tolerance index.
To apply our finding in prioritizing SNPs for future genotyping in molecular cancer epidemiological studies, we also estimated tolerance indices for SNPs in human DNA repair and repair-related genes. In 90 repair-related genes included in the above two SNP databases, we found 355 SNPs that caused amino acid changes. After calculating a tolerance index for each of them, we identified 103 conserved SNPs (with tolerance index ≤ 0.1; Table 2 ⇓ ). After categorizing these genes by different repair pathways, we found 16 SNPs located in nucleotide excision repair genes, 26 located in base excision repair genes, 5 located in nonhomologous end-joining genes, 5 located in homologous recombination genes, 9 located in mismatch genes, 14 located in DNA polymerase genes, and 28 located in other genes involved in repair pathways.
Information from interspecific alignments of homologous genes has long been recognized as an important factor for understanding contemporary deleterious genetic variation in human disease. Silent mutations are frequently observed and randomly distributed among species. In contrast, a certain fraction of amino acids in a given set of homologous genes is conserved even among distantly related species that diverged hundreds of millions of years ago. Variations that arose at such conserved positions have evidently been under strong selective constraints and eliminated from populations throughout long-term evolutionary history, suggesting critical roles for the existing invariant sequences in biological functions. In a recent study, SIFT was used to predict the damaging effects of SNPs in several human disease-related genes. The damaging effects of several SNPs in melanocyte-stimulating hormone receptor and methylenetetrahydrofolate reductase genes have been supported by evidence showing an association with human cancers (2) . Therefore, such information obtained from homologous sequence alignments may be used to predict amino acid residues in gene products that have a negative impact on protein function and are likely to cause disease if mutated in humans.
Although sequences with important functionality tend to be less variable across species because of selective pressure, to what extent evolutionary conservation is mirrored by epidemiological outcome has never been demonstrated. Results from our analyses suggest that the degree of cancer risk influenced by a particular SNP in a molecular epidemiological study was significantly associated with the conservation level of the amino acid it affected. It should be noted that some of SNP genotypes included in Table 1 ⇓ are associated with decreased cancer risks (ORs < 1). This indicates that not all SNPs are harmful, and some of them may be beneficial. However, when the 13 studies with inverted ORs were removed from the analysis, the overall result remained unchanged. Therefore, the inversion of ORs for protective associations (e.g., 0.3 converted to 3.3) in our analysis does not seem to affect the significance of a genetic variation on cancer risk estimate. Our study provides the first evidence demonstrating a linear relationship between the detected ORs associated with cancer risk and the conservation levels of the SNP-affected amino acids. These findings confirm earlier observations that substitutions observed frequently during evolution are likely to be neutral, whereas those observed rarely are likely to be deleterious from an evolutionary perspective. A similar approach was recently used to identify potential functionally missense changes in the breast cancer susceptibility gene BRCA1 by aligning sequences from 57 eutherian mammals. It was found that most conserved residues occur in a region with the highest concentration of protein-interacting domains. In addition, a total of 38 of the 139 missense changes in the conserved region are likely to affect protein function and might be targets of further investigation in studying breast cancer susceptibility (3) . It is becoming clear that application of the molecular evolutionary approach may be a powerful tool for prioritizing SNPs to be genotyped in future molecular epidemiological studies.
This approach further allows us to identify some target SNPs in DNA repair pathways that may be worth examining in future molecular epidemiological study. DNA repair is one of the frontline defenses against human cancer, and more than 125 genes are thought to be involved in different DNA repair systems (4) . Some of these SNPs selected based on our analysis have been detected as genetic risk factors in cancer case-control studies (Table 1) ⇓ , whereas most of them have not been investigated before. Therefore, our analysis will provide useful information in selecting SNPs that are likely to have potential functional impact and ultimately contribute to an individual’s cancer susceptibility.
However, there are several limitations to our analysis, even though the correlation we observed is readily apparent. Because some DNA repair and metabolic genes only trigger disease in the presence of certain environmental exposures such as tobacco smoke carcinogens, the overall ORs without stratification or adjustment for important environmental factors used in the analysis may not reflect a true association between some conserved SNPs with functional impact and the diseases. Next, the accuracy with which a SNP predicts cancer risk depends on the alignment obtained. In our analysis, the amino acid sequences from different species in the alignment were restricted to those homologous sequences that were available in the protein database; therefore, the resulting alignment information varied from protein to protein. In our analysis, the number of homologous sequences used in the SIFT alignment ranged from 15 to over 100. As protein databases grow with data from sequencing whole genomes of more organisms, a larger number of homologous sequences will become available, and SIFT prediction will become more accurate. The final limitation is that our analysis did not include many SNPs that are located outside the coding regions of genes. Those SNPs located in the regulatory regions will be constrained to some extent under selective pressure and may be equally important in disease risk estimates. However, given our limited knowledge of regulatory signals in the genome, it is difficult to recognize such SNPs from the large pool of noncoding SNPs.
Nevertheless, our finding does reveal a close association between conservation levels and risk estimates, despite variations among protein families attributable to different evolutionary pressures and the heterogeneous set of sequence alignments used. Given that hundreds of thousands of SNPs are estimated to exist in the human population, the number of SNPs screened for association with disease can be greatly reduced by identifying those most likely to alter gene function. Our current analysis focuses on SNPs in the coding regions, and our findings could explain a significant fraction of the cancer risk that has been detected. This approach might also be applied to a relationship between SNP conservation levels and epidemiological studies of diseases other than cancer. More importantly, this study builds a bridge from evolutionary biology to molecular epidemiology, which may further our understanding of disease-related SNPs and ultimately facilitate SNP genotyping in future studies.
Grant support: National Cancer Institute Grants CA 74880, CA 91846, CA 85576, CA 86390 and HG 02275
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Note: Y. Zhu is currently at the Department of Epidemiology and Public Health, Yale University, New Haven, Connecticut.
Requests for reprints: Xifeng Wu, Department of Epidemiology, Unit 189, The University of Texas M. D. Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, Texas 77030. Phone: (713) 745-2485; Fax: (713) 792-0807; E-mail:
↵2 http://www.genome.utah.edu/genesnps and http://greengenes.llnl.gov/dpublic/secure/reseq/bin/gene_info.
- Received September 4, 2003.
- Revision received December 1, 2003.
- Accepted January 13, 2004.
- ©2004 American Association for Cancer Research.