| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Priority Reports |
1 Cancer Prevention Studies Branch, 2 Laboratory of Population Genetics, and 3 Laboratory of Pathology, Center for Cancer Research; 4 Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland; 5 Information Management Service, Inc., Silver Spring, Maryland; and 6 Shanxi Cancer Hospital, Taiyuan, Shanxi, People's Republic of China
Requests for reprints: Maxwell P. Lee, Laboratory of Population Genetics, Center for Cancer Research, National Cancer Institute, Building 41, Room D702, 41 Library Drive, Bethesda, MD 20892. Phone: 301-435-8956; Fax: 301-435-8963; E-mail: leemax{at}mail.nih.gov or Philip R. Taylor, Cancer Prevention Studies Branch, Center for Cancer Research, National Cancer Institute, Bethesda, MD 20892. Phone: 301-594-2932; Fax: 301-435-8645; E-mail: ptaylor{at}mail.nih.gov.
| Abstract |
|---|
|
|
|---|
Key Words: SNP esophageal cancer high-risk population control
| Introduction |
|---|
|
|
|---|
Two approaches, linkage analysis and association studies, are commonly used to identify susceptibility genes involved in tumorigenesis. Linkage analysis involves genotyping of individuals from affected families, whereas association studies are done using subjects from population-based or family studies. In one example of such an association study, Sun et al. found that polymorphisms in apoptosis pathway genes Fas and FasL were associated with increased risk of developing ESCC (11). However, most studies were limited to reports of using a few single nucleotide polymorphism (SNP; refs. 1214). It is estimated that SNPs occur one in every 1,000-bp nucleotides. Several genotyping studies on the chromosome-wide level using high-density SNPs have already been reported (15, 16) . Recently, the GeneChip Mapping 10K Array for whole genome SNP analysis became available (Affymetrix, Inc., Santa Clara, CA) and a few initial reports of allelic imbalance or loss in cancer as well as cancer cell lines using the 10K SNP array have been published (1723).
Here, we report the results of a pilot ESCC case-control study using the 10K SNP array. We had two primary and one secondary aims in this study. Our primary aims were to identify SNPs and genes that are associated with ESCC and to develop initial approaches appropriate for the analysis and interpretation of genome-wide association studies, including describing limitations and applications of such studies. Our secondary aim was to begin development of a classification method that combines multiple genotypes and environmental factors to predict susceptibility to ESCC.
| Materials and Methods |
|---|
|
|
|---|
ESCC patients selected. Patients diagnosed with ESCC between 1998 and 2000 in the Shanxi Cancer Hospital in Taiyuan, Shanxi Province, People's Republic of China and considered candidates for curative surgical resection were identified and recruited to participate in this study. None of the patients had prior therapy and Shanxi was the ancestral home for all. After obtaining informed consent, patients were interviewed to obtain information on demographic and lifestyle cancer risk factors (smoking, alcohol drinking, and family history of cancer) and clinical data. We selected 50 males by identifying the first 25 with a positive family history of esophageal cancer and the first 25 without a family history of esophageal cancer from our roster ordered by study identification number.
Controls. Age-, sex-, and neighborhood-matched controls were selected and evaluated within 6 months of the case being diagnosed. The "neighborhood" in China refers to the residence blocks within communities. The ancestral home for all controls was also in Shanxi Province.
Biological Specimen Collection and Processing
Venous blood (10 mL) was taken from patients before surgery and from controls after interview. Germ line DNA was extracted and purified using standard methods.
GeneChip Mapping 10K Array
The 10K SNP array provides comprehensive coverage of the genome for genotyping studies. Each array contained 11,555 biallelic polymorphic sequences randomly distributed throughout the genome, except for the Y chromosome. The median physical distance between SNPs is
105 kb and the mean distance between SNPs is 210 kb. The average heterozygosity for these SNPs is 0.37, with an average minor allele frequency of 0.25. The algorithm used for making genotype calls was described previously by Affymetrix (24, 25).
Target preparation. DNA samples, including two control DNA samples from Affymetrix, were assayed according to the protocol (GeneChip Mapping Assay manual) supplied by Affymetrix. The procedure was similar to the one described previously (24). Briefly, a total of 250 ng germ line DNA was digested with XbaI and then ligated to XbaI adaptor before subsequent PCR amplification. All the steps mentioned above were carried on in the pre-PCR clean room. Cycling was conducted as follows: 95°C for 3 minutes followed by 35 cycles of 95°C for 20 seconds, 59°C for 15 seconds, and 72°C for 15 seconds. Final extension was done at 72°C for 7 minutes (DNA Engine Tetrad PTC-225, MJ Research, Waltham, MA). To evaluate PCR products, 3 µL of each PCR product was mixed with 3 µL of the 2x gel loading dye on 2% Tris-borate EDTA gel and run at 120 V for 1 hour to check for the expected product (bands) between 250 and 1,000 bp. After purification and elution of the PCR products using Qiagen MinElute 96 (Qiagen, Valencia, CA), quantification of purified PCR product was done using spectrophotometric analysis. A final 20 µg of PCR product was fragmented with DNase I. An aliquot of the fragmented PCR product was run on a 4% Tris-borate EDTA gel at 120 V for 30 minutes to 1 hour. Successful fragmentation was confirmed by the presence of a smear with the darkest region corresponding to 50 to 100 bp. The fragmented PCR product was end labeled with biotin and hybridized to the array. Arrays were incubated at 48°C for 18 hours in the Affymetrix GeneChip system hybridization oven. Microarrays were washed and stained in the GeneChip Fluidics Station 450 (Affymetrix) following the manufacturer's instructions.
Scanning and genotype generation. The 10K SNP arrays were scanned with the Affymetrix GeneChip Scanner 3000 using GeneChip Operating System 1.0 (Affymetrix). Data files were generated automatically. Genotype assignments (i.e., calls) were made automatically by GeneChip DNA Analysis Software 2.0 (Affymetrix). The genetic map used in the analysis was obtained from GeneChip Mapping 10K library files: Mapping10K_Xba131. "Signal Detection Rate" is the percentage of SNPs that pass the discrimination filter. "Call Rate" is the percentage of SNPs called on the array. The genotype calls are defined as AA, AB, or BB; "no call" means the SNP does not pass the discrimination filter.
Statistical analyses. All statistical analyses were developed using R and Splus packages. We applied the generalized linear model (GLM) implemented in the function GLM to evaluate the risk of each SNP that satisfied Hardy-Weinberg equilibrium at the significance level of P > 0.01. Three numerical coding schemes were used to represent genotypes: (a) (AA, AB, BB) = (1, 0, 0), (b) (AA, AB, BB) = (1, 1, 0), and (c) (AA, AB, BB) = (1, 0.5, 0). The first scheme corresponds to the assumption that allele A is recessive (equivalently, the allele B is dominant), the second scheme assumes that allele A is dominant (equivalently, the allele B is recessive), and the third scheme assumes a continuous mode.
GLM was applied to model the probability of being a case based on each SNP plus five potential explanatory variables, including x1 (family history positive, yes/no), x2 (alcohol use, yes/no), x3 (tobacco use, yes/no), x4 (pickled vegetable consumption, yes/no), and x5 (age, continuous):
![]() |
![]() |
2 goodness-of-fit test.
2 statistic is D0-D1 with 3 df. To account for multiple comparisons, we used the Bonferroni-adjusted significance level to select our GLMs. We used principal components analysis (PCA) to visualize similarity and variability among individuals. We applied PCA to each of the three numerical genotype coding schemes for all 100 case/control samples. The 100 samples were projected in the space defined by the first and second principal components. When case and control samples have two cluster structures in two principal components spaces, one or two principal components can be used to construct a classifier to separate cases and controls. The classifier was based on the genotyping of selected SNPs and its performance was evaluated for accuracy = (Tp + Tn) / 100, sensitivity = Tp / (Tp + Fn), and specificity = Tn / (Fp + Tn), where Tp and Tn are the numbers of true positives and true negatives and Fp and Fn are the numbers of false positives and false negatives. The odds ratio of the classifier is defined as Tp * Tn / [(50 Tp) * (50 Tn)]. Although developing and testing predictors using the identical same data is acknowledged to result in upward bias of predictor estimates (i.e., sensitivity, specificity, and accuracy), we calculated these values as a frame of reference only and not for clinical application without further confirmation (26).
| Results and Discussion |
|---|
|
|
|---|
|
We first compared cases and controls for each of the 10,264 SNPs individually using multivariate analyses in the GLM assuming each of the three different modes of transmission described above (i.e., recessive, dominant, and continuous). Potential explanatory variables that might influence the analysis were adjusted for in the GLM. Because 10,264 separate analyses were done, multiple comparisons were a major concern. We corrected for multiple comparisons using Bonferroni-adjusted significance levels, which, for 10,264 analyses, means that we accepted as significant only Ps < 4.87187e06 (which corresponds to a single test with
level of 0.05). Using multivariate GLMs with Bonferroni adjustment as described, we identified 37 statistically significant SNPs under the recessive transmission mode assumption, 48 SNPs for the dominant mode, and 53 SNPs assuming a continuous mode.
A secondary aim of this study is to develop in the future a method to predict individual risk of ESCC based on genotypes and explanatory variables. To begin approaching this aim, we combined the 37 SNPs selected from the recessive mode GLM to classify samples using PCA (Fig. 1A). With few exceptions, the cases and controls were clearly separated into two different clusters. As a comparison, we also did a PCA using all available SNPs in which there were no missing genotype data (n = 3,369 SNPs; Fig. 1B). It is clear that the PCA using all available SNPs resulted in no segregation between cases and controls, which serves to show that cases and controls came from the same population and that there were no major genotype differences between cases and controls at the population level. Given that there was good separation between cases and controls in the PCA using the 37 SNPs identified from GLM in the recessive mode, we developed a classifier to predict individual risk of esophageal cancer. Our classifier was defined by the first principal component (PC1), which contains weighed combinations of genotypes from these 37 SNPs. A person was classified as a case if PC1 was
0 or a control if PC1 was >0. Using PC1, we were able to correctly classify 46 of 50 cases and 47 of 50 controls. The accuracy, sensitivity, and specificity for this PCA classification were 0.93, 0.94, and 0.92, respectively (Table 2), and the odds ratio for being a case was 180.2. Similar results were also obtained when SNPs selected from the dominant or continuous mode GLMs were used (Table 2). We also did PCA loading analyses to assess discrimination when smaller numbers of the SNPs were used for classification. This analysis indicated that we could predict individual cancer risk using just 10 SNPs with an overall accuracy of 80%, sensitivity of 76%, and specificity of 84%; the odds ratio for these 10 SNPs was 16.6 (Table 2). We also did permutation tests (1,000 tests) using randomly selected two thirds of the samples for training and one third of the samples for testing in PCA analysis. The permutation tests indicated that our PCA classification can be generalized. Hierarchical cluster analysis using the 37 SNPs selected from the GLMs in recessive mode was also able to classify cases and controls with similar performance (data not shown).
|
|
|
| Acknowledgments |
|---|
We thank Jenny Kelley for critical reading of the article.
Received 9/ 7/04. Revised 11/17/04. Accepted 1/18/05.
| References |
|---|
|
|
|---|
gene that are associated with susceptibility to myocardial infarction. Nat Genet 2002;32:6504.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
A. Statnikov, C. Li, and C. F. Aliferis A Statistical Reappraisal of the Findings of an Esophageal Cancer Genome-Wide Association Study Cancer Res., April 15, 2008; 68(8): 3074 - 3075. [Full Text] [PDF] |
||||
![]() |
H. H. Yang, Y. Hu, M. P. Lee, N. Hu, D. Ng, A. M. Goldstein, C. Wang, and P. R. Taylor Cancer Res., April 15, 2008; 68(8): 3075 - 3075. [Full Text] [PDF] |
||||
![]() |
R. L. Milne, G. Ribas, A. Gonzalez-Neira, R. Fagerholm, A. Salas, E. Gonzalez, J. Dopazo, H. Nevanlinna, M. Robledo, and J. Benitez ERCC4 Associated with Breast Cancer Risk: A Two-Stage Case-Control Study Using High-throughput Genotyping Cancer Res., October 1, 2006; 66(19): 9420 - 9427. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |