DNA pooling in combination with high-throughput sequencing was done as a part of the Sequenom-Genefinder project. In the pilot study, we tested 83,715 single nucleotide polymorphisms (SNP), located primarily in gene-based regions, to identify polymorphic susceptibility variants for lung cancer. For this pilot study, 369 male cases and 287 controls of both sexes (white Europeans of Southern German origin) were analyzed. The study identified a candidate region in 22q12.2 that contained numerous SNPs showing significant case-control differences and that coincides with a region that was shown previously to be frequently deleted in lung cancer cell lines. The candidate region overlies the seizure 6-like (SEZ6L) gene. The pilot study identified a polymorphic Met430Ile substitution in the SEZ6L gene (SNP rs663048) as the top candidate for a variant modulating risk of lung cancer. Two replication studies were conducted to assess the association of SNP rs663048 with lung cancer risk. The M. D. Anderson Cancer Center study included 289 cases and 291 controls matched for gender, age, and smoking status. The Liverpool Lung Project (a United Kingdom study) included 248 cases and 233 controls. Both replication studies showed an association of the rs663048 with lung cancer risk. The homozygotes for the variant allele had more than a 3-fold risk compared with the wild-type homozygotes [combined odds ratio (OR), 3.32; 95% confidence interval (95% CI), 1.81–7.21]. Heterozygotes also had a significantly elevated risk of lung cancer from the combined replication studies with an OR of 1.15 (95% CI, 1.04–1.59). The effect remained significant after adjusting for age, gender, and pack-years of tobacco smoke. We also compared expression of SEZ6L in normal human bronchial epithelial cells (n = 7), non–small cell lung cancer (NSCLC; n = 52), and small cell lung cancer (SCLC; n = 22) cell lines by using Affymetrix HG-U133A and HG-U133B GeneChips. We found that the average expression level of SEZ6L in NSCLC cell lines was almost two times higher and in SCLC cell lines more than six times higher when compared with normal lung epithelial cell lines. Using the National Center for Biotechnology Information Gene Expression Omnibus database, we found a ∼2-fold elevated and statistically significant (P = 0.004) level of SEZ6L expression in tumor samples compared with normal lung tissues. In conclusion, the results of these studies representing 906 cases compared with 811 controls indicate a role of the SEZ6L Met430Ile polymorphic variant in increasing lung cancer risk. [Cancer Res 2007;67(17):8406–11]
Although up to 90% of lung cancers are attributable to smoking, only a small fraction of smokers develop lung cancer over their lifetimes ( 1), suggesting that genetic variation may contribute to lung cancer susceptibility ( 2). Results of segregation analyses suggest that rare autosomal dominant polymorphisms may explain susceptibility to early-onset lung cancer; however, only a minority of lung cancer cases can be explained by the presence of such variants ( 3– 6). There is also a rapidly expanding body of literature on the association of common, low-penetrance genes with lung cancer risk ( 5– 8). According to the latest update of Cancer Genetics Web database, 10 more than forty genes may be involved in susceptibility to, and progress of lung cancer.
Identification of genetic factors modulating lung cancer risk requires a combination of effective genotyping technologies with an appropriate and efficient study design. Sequenom (San Diego, CA) has developed a DNA analysis platform, capable of high-throughput genotyping with pooled DNA allele frequency analysis. Using this approach, Sequenom implemented a Genetics Discovery platform with dense genome-wide single nucleotide polymorphism (SNP) markers ( 7, 8). A hypothesis-free approach using allele frequency estimates of many thousands (for lung cancer, 83,715 SNPs) of SNPs was used as a first step (pilot study) in identifying potentially relevant genetic variants. Significant SNPs identified in this first step were then individually genotyped and validated in replication studies using independent samples. The efficiency of this strategy has been shown by the rediscovery of genes shown previously to be involved in several common diseases ( 8– 10). The purpose of the current study was to implement this strategy to identify genetic variation modulating lung cancer risk.
Materials and Methods
Sequenom-Genefinder pilot study (Southern Germany). Lung cancer cases for the pilot study were recruited from the Departments for Respiratory Medicine and Thoracic Surgery, Schillerhöhe Specialist Hospital (Stuttgart-Gerlingen, Germany) and the Department for Respiratory Medicine, Asklepios Specialist Hospitals (Munich-Gauting, Germany). Controls were sampled from patients with nonmalignant disease at the same hospitals. The final sample consisted of 369 male cases and 287 controls of both genders, all with a positive history of tobacco smoke exposure. A total of 83,715 SNPs, mostly in gene-based regions, were used in the analysis. Epidemiologic data were collected by personal interview. Table 1 provides a description of cases and controls used in the pilot study.
The M. D. Anderson Cancer Center replication study (United States). Cases and controls for the M. D. Anderson Cancer Center (MDACC) study were recruited from an ongoing lung cancer case-control study enrolling patients with newly diagnosed, untreated lung cancer at The University of Texas MDACC (Houston, TX). The study has been described in detail previously ( 11). Control subjects were recruited from the largest private multispecialty physician group in the Houston metropolitan area. The controls did not have a previous diagnosis of any type of cancer and were frequency matched to the cases on age (±5 years), sex, ethnicity, and smoking status (current, former, and never). There were no never smokers in this subgroup (defined as a person who smoked fewer than 100 cigarettes in his/her lifetime). A former smoker had quit smoking for at least 1 year before the interview. Pack-years were defined as the number of cigarettes smoked per day divided by 20 and then multiplied by the number of years smoked. Epidemiologic data were collected by personal interview. In total, 289 lung cancer cases and 291 controls were included in the analysis. Table 1 provides a description of cases and controls used in the MDACC.
The Liverpool Lung Project replication study (United Kingdom). The lung cancer case-control data were derived from the Liverpool Lung Project (LLP) from an ongoing molecular epidemiologic study of lung cancer in Liverpool, United Kingdom ( 12). Histologically or cytologically confirmed lung cancer cases with primary tumors were recruited from participating chest clinics. Population controls were selected from registers of General Practitioners in Liverpool to ensure similar age-sex distributions to the cases. In all studies, a standardized questionnaire was used to determine basic demographic characteristics in addition to details on smoking history, lifetime residence and occupation, history of lung diseases, family history of cancer in first-degree relatives, and exposure to environmental tobacco smoke. Smoking status was defined as in the U.S. study. In total, 248 lung cancer cases and 233 controls were included in this analysis. Table 1 provides a description of cases and controls used in the LLP study.
SNP markers and genotyping. Genomic DNA was extracted from blood peripheral leukocytes by using the Qiagen DNA blood mini kit (Qiagen) according to the manufacturer's instruction. DNA pools were formed by combining equimolar amounts of individual samples as described elsewhere ( 13). For the pilot study, one pool of 369 cases and one pool of 287 controls, respectively, were constructed. For assays carried out on sample pools, 25 ng of a 5-ng/L pool were used for PCRs. All PCR and MassEXTEND reactions were conducted using standard conditions. Relative allele frequency estimates were derived from calculations based on the area under the peak of mass spectrometry measurements from four analyte aliquots ( 14). Tests of association between disease status and each SNP were carried out as previously discussed ( 15). When three or more replicate measurements of a SNP were available, the corresponding variance component was estimated from the data. Otherwise, the following historical laboratory averages were used to calculate sources of variability: pool formation, 5.0 × 10−5; PCR/mass extension, 1.7 × 10−4; and chip measurement, 1.0 × 10−4. The same procedure was used for individual genotyping except 2.5 ng DNA was used and only one mass spectrometry measurement was taken. The following gene-specific primers were used to genotype rs663048: the forward PCR primer was 5′-TGGGCTATGAGCTCCAGGG-3′; the reverse PCR primer was 5′-TGCGGCTTGGAGGCATTGAT-3′; and extend primer was 5′-GAGCTCCAGGGCGCTAAGAT-3′.
The Sequenom-Genefinder pilot study included 83,715 SNPs selected based on their location within a gene region (including the coding region plus additional 10 kb at the both ends) and minor allele frequency (MAF) from a total of 125,799 experimentally validated polymorphic variations ( 7, 8, 10). In the first step, one PCR and primer extension reaction was carried out for 83,715 SNPs on each pool (case and control). In the second step, 4,293 SNPs (∼5%) with the most statistically significant associations were remeasured in triplicate on each DNA pool. In the third step, the 301 most significant SNPs (∼7%) from step two were individually genotyped in each sample. A total of 160 SNP markers were identified with statistically significant differences between cases and controls (P < 0.05) after individual genotyping in the German pilot study and were then genotyped in the MDACC and LLP replication samples.
Expression of the seizure 6-like gene. Besides analyzing the effect of the Met430Ile variant on lung cancer risk, we also compared the seizure 6-like (SEZ6L) expression level in normal versus cancer cell lines (our data) and in primary tumors versus normal lung tissues [data from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database]. Affymetrix HG-U133A and HG-U133B high-density oligonucleotide microarrays were also used to evaluate expression of the SEZ6L gene in cell lines. Gene expression profiling was done on a panel of 52 non–small cell lung cancer (NSCLC) and 22 small cell lung cancer (SCLC) cell lines. As a control, we used seven normal human bronchial epithelial cell lines immortalized with cyclin-dependent kinase 4 and hTert or with E6/E7 with or without hTert ( 16), and two unimmortalized lung cell lines (NHBEC and SAEC). The list of the cell lines used in the study can be found in Supplementary Table S1. Four probes were used for SEZ6L. Signals were median normalized and log 2 transformed. Averaging of the signal across the five probes was used to estimate signal intensity for each cell line as well as for controls. Expression of the SEZ6L was detected in all cell lines.
For comparing the SEZ6L expression in normal tissues and primary lung tumors, we used the GDS619 data set with microarray gene expression data in SCLC (19 samples), non–small lung cancer (12 samples), and normal lung tissue (20 samples). Tumor samples were obtained from the patients undergoing surgery at the Cancer Institute Hospital (Tokyo, Japan). The control samples were obtained by bronchial brushes from unrelated healthy individuals. Normalized and log-transformed gene expression data were downloaded and the SEZ6L expression in normal and tumor tissues was compared (see description for GDS619 data set from the GEO database). 11
Statistical analysis. The distributions of the demographic variables between cases and controls were compared using the χ2 test. For categorical variables (sex, ethnicity, and smoking status), two-sample Student's t test was used. A goodness-of-fit χ2 test was used to determine whether the polymorphisms were in Hardy-Weinberg equilibrium. The adjusted odds ratios (OR) were calculated using multiple logistic regression to control for age, sex, and intensity of smoking (pack-year), to estimate effect of SNPs on lung cancer risk. To estimate the overall ORs from the replication studies, we used the Mantel-Haenszel test ( 17). All statistical analyses were done using STATISTICA (StatSoft, Inc.).
The Sequenom-Genefinder pilot study. The pilot study identified a candidate region located at 22q12.2. The size of the candidate region was ∼37 kb. The region contains a cluster of 17 of 25 SNPs significantly associated with lung cancer risk (P < 0.05). Most of the SNPs in the region were in linkage disequilibrium (minimum r2 ≥ 0.8). Based on moderate or low r2 ≤ 0.6 levels of linkage disequilibrium, six significant SNPs were chosen for genotyping in two replication sample sets.
We found that the position and size of the candidate region coincides well with the position and size of the SEZ6L gene. The candidate region occupied the distal part of the 428-kb region frequently deleted in lung cancer cell lines ( 18). The deletion contains two genes: SEZ6L and MYO18B ( 18). No significant SNPs were detected in the MYO18B region ( Fig. 1 ).
The MDACC replication study. SNP markers with allelic frequencies found to be significantly different between cases and controls in the pilot study were validated in the MDACC replication study. The total number of SNP markers genotyped in the MDACC study was 147. Thirteen SNPs exhibited significant (P < 0.01) deviation from Hardy-Weinberg equilibrium in controls and were excluded from the analysis. The remaining 133 SNPs were analyzed to estimate their association with lung cancer. Analysis of all histologic types together enabled an identification of six cancer-associated SNP markers. In NSCLC, the same SNP markers were identified as showing significant associations with lung cancer as well as two additional SNPs ( Table 2 ).
To further prioritize the significant SNPs listed in Table 2, we annotated the SNPs by using available data that potentially can help to identify causal SNPs ( Table 3 ). The SNP rs663048 was the only nonsynonymous (Met430Ile) SNP in the list. Sorting Intolerant from Tolerant (SIFT; ref. 19) and Polymorphism Phenotyping (PolyPhen; ref. 20) software were used to predict the functional effect of the Met430Ile amino acid substitution. Both algorithms predict that the Met430Ile is a functional (protein disturbing) variant; therefore, our further analysis was concentrated on rs663048 polymorphism.
To estimate the relative risk associated with the significant SNPs, we used a logistic regression model. Table 4 shows the predicted OR adjusted for age, intensity of smoking (pack-year), and gender. There was no significant deviation from Hardy-Weinberg equilibrium in both replication samples neither in controls nor in cases. Homozygotes for the more frequent allele were used as reference groups. OR values ranged from 1.29 to 1.72, with four of them being statistically significant. The OR for the SNP rs663048 was 1.48 (P = 0.03) with rare genotypes being associated with increased risk for NSCLC.
The LLP replication study (United Kingdom). The LLP replication sample was used to further validate the association between rs663048 and lung cancer risk. In this study, we also found a significant association between Met430Ile polymorphism and lung cancer with cases having a higher frequency of the variant T/T genotype (8.9 ± 1.9%) than the controls [3.0 ± 1.1%; χ2 = 7.4; degrees of freedom (df) = 1; P = 0.006]. We observed a 3.8-fold increased risk of lung cancer in individuals who carried the rs663048 null genotype T/T compared with those who carried the rs663048 more common (wild-type) genotype G/G (95% CI, 1.40–10.42) after adjusting for age, sex, and smoking ( Table 4).
Expression of the SEZ6L in normal and lung tumor cell lines. Supplementary Fig. S1 shows the expression level of SEZ6L in 9 normal lung cell lines, 54 NSCLC, and 22 SCLC cell lines. The average expression signal in controls was 6.2 (n = 9) compared with the average expression signal in the NSCLC cell lines of 7 (n = 54) and in SCLC cell lines of 8.8 (n = 22). A nonparametric Mann-Whitney U test was significant for all pairwise comparisons. The smallest (but still very significant) difference was found between the normal lung cell lines and NSCLC cell lines (Z = −3.3; P = 0.001). We also found significant differences in expression level of GATA2, with the expression being higher in cancer lines compared with controls (Mann-Whitney U test, Z = −3.9; P = 0.0005). Other candidate genes listed in Table 3 did not show differential expression between lung cancer and normal lung cell lines. SNP genotypes were not available for this analysis.
Expression of the SEZ6L in primary tumor and normal lung tissues. We used the NCBI GEO database containing microarray data on the genome wide assessment of gene expression. Using the key words “lung AND cancer OR tumor,” we have identified several entries with data on gene expression. The GDS619 data set (platform GPL962: CHUGAI41K) was most appropriate for our goal to compare SEZ6L expression in normal and tumor tissues. This data set contains data on gene expression in SCLC, adenocarcinomas, and normal tissues. We found that the average log 2-transformed SEZ6L expression value was significantly higher in adenocarcinoma compared with normal lung tissues (0.34 ± 0.13 versus −0.38 ± 0.15; two sided t test = 3.1; df = 29; P = 0.004). The expression of SEZ6L was also insignificantly higher in SCLC compared with normal tissues (Student's t test = 1.4; df = 43; P = 0.15). The variance of expression values in NSCLC sample was significantly higher compared with the normal tissues (Var = 0.07 among controls versus Var = 0.42 among SCLCs; F = 6.2; P = 0.001). This result together with our analysis of the SEZ6L expression in cell lines suggests an increased level of the SEZ6L expression in lung tumor compared with normal lung tissues.
There are two major approaches used to identify cancer-related genes: (a) candidate gene approach and (b) genome-wide scan approach. The candidate gene approach is based on prior data indicating that a gene(s) is involved in cancer. For example, it has been shown that increased lung cancer risk correlates with decreased nucleotide excision repair (NER) capacity ( 21– 23). Targeting genes involved in NER has identified several polymorphisms in NER genes associated with elevated lung cancer risk ( 24– 26). The advantage of the candidate gene approach is that the number of candidate genes is limited; therefore, the number of false positives among significant associations is also expected to be low. A disadvantage of the approach is that genes without prior data on associations will not be included in the analysis and, therefore, cannot be identified by the candidate gene approach. In the genome-wide scan approach, potentially all genes in the genome are targeted, allowing the identification of novel cancer-related genes. A disadvantage of the genome-wide scan is the large number of independent tests so that a large sample size is required to distinguish between true- and false-positive associations.
In our study, we have combined elements of both approaches. At the first step, more than 83,000 SNP markers covering the coding region of the whole genome were analyzed. This analysis yielded many candidate SNPs showing significant associations with lung cancer. Significant SNPs identified in the pilot study are a mix of false positives and true associations. For selected SNPs in SEZ6L candidate region, two independent replication studies were then conducted to identify true associations. The MDACC replication study yielded eight SNPs showing associations with lung cancer risk. We then used additional SNP-related information to identify most promising genes. As a result of this analysis, the SEZ6L gene emerged as a top candidate gene to be associated with lung cancer. It is interesting that based only on the significance of the χ2 test, this gene was number five in the list. The LLP replication study further validated the association of SNP rs663048 in SEZ6L with lung cancer risk.
An analysis of the expression of the SEZ6L gene showed different expression of SEZ6L in normal and NSCLC SCLC cell lines. Interestingly, we found that the two histologic types of lung cancer had different levels of expression of SEZ6L. The average expression signal in NSCLC was 7.0 ± 0.1 and in SCLC 8.8 ± 0.2 (Mann-Whitney U test, Z = 6.0; P < 0.001). It needs to be noted that there is an apparent inconsistency between the results of the analysis of the expression of SEZ6L and the results of the association studies. In the MDACC and United Kingdom samples, the variant allele (that was predicted to be protein disturbing based on analysis of the protein structure and evolutionary conservation) was associated with increased risk for lung cancer, suggesting that loss of normal SEZ6L function may be a risk factor for lung cancer. This is consistent with the finding that SEZ6L region is often deleted in lung cancer cell lines ( 18). On the other hand, we found that expression of SEZ6L is elevated in lung cancer cell lines. One explanation might be that SEZ6L is both a tumor marker and a variant affecting lung cancer susceptibility. Loss of normal SEZ6L function is a risk factor for development of lung cancer; however, when lung cancer is caused by factors other than loss-of-SEZ6L function, expression of the SEZ6L is adaptively up-regulated to suppress tumorigenesis.
Several lines of evidence support the hypothesis that SEZ6L might modulate lung cancer risk. First, frequent allelic losses on 22q in NSCLCs have been reported, indicating the presence of tumor suppressor gene(s) on that chromosome arm ( 18). Cloning of the breakpoints revealed a 400-kb deletion containing the SEZ6L and MYO18B genes ( 18, 27). A study conducted by Suzuki at al. ( 28) suggests that SEZ6L gene may also influence development and progression of colorectal cancer. The authors found that SEZ6L was one of the few genes highly hypermethylated in primary colorectal tumors.
In the pilot study, we populated the candidate region with 35 SNPs and found that markers located in the SEZ6L gene region show a strong association with lung cancer; however, no significant associations were found in the neighboring MYO18B gene. Applying a sliding window of five neighboring SNPs revealed a peak of −log10 (P values) that coincides with the position of the SEZ6L gene. We found that the principal contributor to the peak was rs663048. The association of this SNP with lung cancer risk was verified in two independent replication studies. The rs663048 SNP is a Met430Ile amino acid substitution that has been predicted to be functional by both SIFT and PolyPhen, suggesting that this amino acid substitution is protein disturbing.
Nishioka et al. ( 18) found that 95% (43 of 45) of primary tumor samples carry the Met430Ile mutation. The authors did not estimate the frequency of the variant in controls. We found that 38% of controls and 48% of cases carry at least one variant allele. This suggests that ∼40% of tumors may carry a somatic Met430Ile mutation. If we consider that according to the HapMap the frequency of the Met430Ile polymorphism is lower in Japanese than in Caucasians, the percentage of accumulated somatic Met430Ile may actually be higher.
We found that homozygotes for the variant allele had 3-fold higher lung cancer risk compared with the normal variant homozygotes. Lung cancer risk was also significantly elevated in heterozygotes. According to our estimates, the frequency of the variant allele for rs663048 is 22%, which is very similar to the 20% reported for Caucasians by the HapMap database. We found that 36% of Caucasian controls are hetorozygotes and ∼4% are homozygotes for the risk allele, making the portion of Caucasians having at least one risk allele ∼40%. Results from combined, Mantel-Haenszel, analysis yielded ORs of 1.15 [95% confidence interval (95% CI), 1.04–1.59] for heterozygotes and 3.32 (95% CI, 1.81–7.21) for homozygotes. The population attributable risk percentage [PAR% = (OR − 1) × P / [(OR − 1) × P + 1] × 100, where P is the risk genotype frequency in the controls] was 7.5 for homozygotes and 8.3 for heterozygotes, suggesting ∼16% of excess risk in lung cancer cases is due to the presence of the variant allele.
In conclusion, our data together with published studies suggest that the Met430Ile variant might be a causal variant affecting risk of lung cancer. Although the strongest evidence from our study indicates this SNP, it is possible that another closely located SNP plays a dominant role in promoting lung cancer risk and that the Met340Ile variant is a marker in linkage disequilibrium with the underlying causal variant. However, further studies, especially those implementing functional assays, are warranted to provide more conclusive evidences on causal association between the Met430Ile and lung cancer risk.
Grant support: National Cancer Institute grants R01 CA55769 and CA 70907, Specialized Programs of Research Excellence grant P50CA70907, and Flight Attendant Medical Research Institute.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/) and at Cancer Genetics Web database: http://www.cancerindex.org/geneweb/X1501.htm.
I.P. Gorlov, P. Meyer, R. Dierkesmann, J.K. Field, and C.I. Amos contributed equally to this work.
- Received December 27, 2006.
- Revision received April 27, 2007.
- Accepted June 20, 2007.
- ©2007 American Association for Cancer Research.