| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Experimental Therapeutics, Molecular Targets, and Chemical Biology |
1 Division of Life and Pharmaceutical Sciences, Ewha Womans University; 2 Department of Thoracic Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine; 3 Cancer Research Division, Center for Clinical Research, Samsung Biomedical Research Institute; and 4 Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea
Requests for reprints: Sanghyuk Lee, Division of Life and Pharmaceutical Sciences, Ewha Womans University, 11-1 Daehyun-dong, Seodaemun-gu, Seoul 120-750, Korea. Phone: 82-2-3277-2888; Fax: 82-2-3277-3760; E-mail: sanghyuk{at}ewha.ac.kr or Kwhanmien Kim, Department of Thoracic Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine, 50 Ilwon-dong, Gangnam-gu, Seoul 135-710, Korea. Phone: 82-2-3410-3485; Fax: 82-2-3410-0089; E-mail: kwhanmien.kim{at}samsung.com.
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
The emergence of high-throughput molecular tools such as microarrays and large-scale databases of serial analysis of gene expression (SAGE) and expressed sequence tags (EST) brought a new paradigm for biomarker discovery. Numerous candidate biomarkers have been reported for risk assessment, screening, diagnosis, prognosis, and for selection and monitoring of therapies since the application of such new technologies and techniques had been initiated (3). For example, microarrays have been successfully applied to find new classes of diseases, to predict prognosis, and to identify diagnostic markers for early detection (4). Typically, these studies have used classifiers that consist of tens or hundreds of genes. However, it is still not clear whether or not such molecular signatures would be more effective than a few biomarkers of high sensitivity and specificity.
SAGE is a sequencing-based technique for quantitatively profiling the gene expression. Tag counts provide an estimation of the expression levels in SAGE. The merit of the technique is that it does not require any prior knowledge of the gene sequence, thus generating unbiased profiling even for unknown genes (5). Meta-data analysis is relatively simple due to the simple nature of the data. Two research groups have used the SAGE technique to find the differentially expressed genes as biomarkers for lung cancer (6, 7).
EST is a relatively low-throughput technique as compared with the microarray or SAGE techniques. However, the vast amount of public data in dbEST makes it a valuable information source for studying gene expression pattern as can be seen in the case of the Cancer Genome Anatomy Project (8). In addition, several studies have reported the successful application of EST data to find tissue-specific and/or cancer-specific genes (9).
SAGE and EST data are often complementary in their library characteristics. Gene expression from EST is not quantitative because preparation of many libraries has included normalization or subtraction steps. SAGE provides a quantitative profile, but the number of public libraries is rather small compared with the EST cDNA libraries (
300 versus
8,600 for human). Furthermore, the tag-to-gene assignment is not a trivial task in SAGE analysis. Bioinformatics methods that integrate these public data, taking merits and shortcomings of each into consideration, would significantly facilitate the discovery of biomarkers.
A major obstacle in finding biomarkers with using public expression data is the problem of sample heterogeneity. Many cDNA libraries from the Cancer Genome Anatomy Project are from bulk samples and cell lines. The coverage of SAGE libraries is quite limited and, consequently, lacks statistical significance for valid biomarkers. For example, only six SAGE libraries are publicly available for lung tissues, and their preparation protocols significantly vary from one another. Thus, it is essential to critically validate the predicted candidate markers using a number of well-defined clinical samples. Investigating the molecular properties and pathways they are involved in may also be helpful for deciding the value of the genes as drug targets in addition to biomarkers.
We describe here a bioinformatics method and clinical validation to identify diagnostic marker genes for lung cancer. The SAGE and EST data were meta-analyzed to produce a list of differentially expressed genes in lung cancers. A systematic examination of the annotated gene properties led to 20 genes which were subjected to experimental validation using clinical specimens from lung cancer patients. Semiquantitative reverse transcriptase-PCR (RT-PCR) followed by extensive statistical analyses established seven genes (CBLC, CYP24A1, ALDH3A1, AKR1B10, S100P, PLUNC, and LOC147166) as the potential diagnostic markers for lung cancer. Quantitative real-time PCR experiments were carried out with additional samples for the seven identified markers as the final validation step. We further describe the molecular properties of these genes, especially their relationship to lung cancer and regulatory signaling pathways, to examine their value beyond diagnostic or prognostic biomarkers.
| Materials and Methods |
|---|
|
|
|---|
The clustering algorithm is more conservative in terms of selecting valid alignments and splice site sharing. Approximately 6% to 7% of ESTs in the UniGene do not satisfy our conservative criteria of alignment. Additional 5% or so of ESTs belong to different EST clusters, most of which turned out to be neighboring clusters. We found that such ESTs aligned with the opposite strand or in the intronic regions without sharing any splice sites. Approximately 11% to 12% difference in EST members leads to a substantial difference in the final result. The schematic overview of our approach is given in Fig. 1 .
|
EST analysis. We used the EST clusters from ECgene version 1.2 based on National Center for Biotechnology Information (NCBI) human genome 35. Approximately 7 million human EST sequences from more than 8,600 cDNA libraries are currently available in the GenBank. The total number of lung ESTs from 304 cDNA libraries is 401,672 as of August 2006. The diverse source of samples is one of the main advantages for meta-analysis of EST data.
Each EST cluster was tested for its differential expression in the lung tissue and in the cancerous tissue. Tissue specificity was tested separately from cancer specificity using Fisher's exact test. This approach has the advantage of retaining clusters with limited numbers of lung cancer ESTs that actually has a number of ESTs from normal lung tissue and other types of cancer. Gene expression is often deregulated with tumor progression. It would be reasonable to decide the cancer specificity regardless of their tissue origin.
Sample description. The primary lung tumor tissues were obtained from non–small-cell lung cancer patients who had undergone curative surgery. Our study was approved by the institutional review board, and the written informed consents were obtained from all patients (Institutional Review Board no. 2004-10-18). Lung cancer tissues were obtained from nonnecrotic tumor area. In each case, the normal parenchyma of the same lobe not continuous with the tumor was selected as "pathologically normal" tissue. All the specimens were soaked in liquid nitrogen immediately after resection and stored at –70°C.
Our clinical samples initially consisted of 11 adenocarcinoma, 11 squamous cell carcinoma, and 10 lung tissues from benign lung diseases. The detailed characteristics of patients and the pathology are provided in Table 1 . Paired samples were obtained from primary lung cancer and the adjacent normal tissues for validation tests. Semiquantitative RT-PCR experiments were carried out for initial screening of biomarkers from 20 candidate genes using these 22 paired samples. The 10 lung tissues from benign lung diseases were used as the control for noncancerous disease states. In the final validation step, the real-time quantitative RT-PCR was done for 7 biomarkers on 36 paired samples, which included 14 additional paired samples (7 adenocarcinoma and 7 squamous cell carcinoma) and the 22 original paired samples.
|
RT-PCR experiment. The total RNA was isolated from 20-µm-thick cryostat sections. Reverse transcription was carried out with 1 µg of RNA, 50 µmol/L oligodT (20), 250 µmol/L deoxynucleotide triphosphate (dNTP), 10 units of Moloney murine leukemia virus reverse transcriptase III (Invitrogen), and 5x reverse transcriptase buffer in a total volume of 50 µL. The PCR reactions were done using 1 µL (20 ng) of each cDNA, 250 nmol/L dNTP, 10 pmol/L of each primer, and 2 units of i-StarTag polymerase (Intron Biotechnology, Inc.). The sequences of oligonucleotide primers used in the experiment are listed in Supplementary Table S1. The results were expressed as the ratio of the relative levels.
Quantitative real-time RT-PCR experiment. The cDNA used for quantitative real-time RT-PCR was prepared in the same manner as in the semiquantitative RT-PCR experiment. We used the FAM dye–labeled TaqMan MGB probes (Applied Biosystems). Each probe was designed to be specific to the seven final candidate genes (see above). In addition, TATA box binding protein–specific probe was used as internal control. The PCR reaction mixture consists of the reverse transcription product, TaqMan 2x Universal PCR Master Mix, and the appropriate 20x TaqMan Gene expression assay mix containing primers and probe for the gene of interest. Cycle variables for the PCR reaction were 50°C for 2 min (UNG activation) and then 95°C for 10 min, followed by 40 cycles of a denaturing step at 95°C for 15 s and an annealing/extension step at 60°C for 60 s. All reactions were run in triplicates. The relative expression values of each gene to internal control gene were analyzed using the equation 2–dCT, where dCT = (CTtarget gene – CTinternal control gene) (ref. 13).
Statistical analyses for biomarker evaluation. We measured the DNA band intensity using the BioRAD densitometer (Bio-Rad). The ANOVA statistical tests with the Tukey-Kramer multiple comparison method were done to find the differentially expressed genes between the benign disease tissues and cancer tissues. The paired t test was done to identify the differentially expressed genes between the pathologic normal and cancer tissues from each of the patients. Differences were considered significant at P < 0.05.
We also carried out the feature selection procedures that are frequently used to identify important features in classification problems. All samples were classified into the normal and cancer classes. The gene expression values of 20 genes were used as an input data. We applied three feature selection methods to identify the most important genes for classifying normal and tumor samples. The support vector machine classifier,
2 test statistics, and gain ratio methods were used as implemented in the WEKA package (14). Parameters were set as the default values in the WEKA and the 5-fold cross-validation was done with the constraint seed of 1. The search method was set as the ranker.
| Results and Discussion |
|---|
|
|
|---|
The SAGE data yielded the following 10 genes: MUC5AC, TFF3, PLUNC, CYBA, CGI-38, GBA, S100P, s-TIM, s-C20orf85, and CYP24A1 (details given in Supplementary Table S2). We used the following criteria for gene selection: (a) the availability of full-length clones; (b) no ambiguity in the tag-to-gene assignment; (c) the P values and the real tag counts in six libraries; (d) the gene properties; and (e) the literature survey.
Similarly, another set of 10 genes resulted from the EST analyses: ALDH3A1, TRIM16, AKR1B10, T, LOC147166, FOXA2, SCTR, DRD2, CBLC, and GLP2R (Supplementary Table S3). We used the criteria of (a) the number of ESTs and cDNA libraries, (b) multiplicity of exons, (c) the percentage of lung and cancer ESTs and libraries, and (d) the gene properties. In contrast to the SAGE data, the vast number of cDNA libraries makes it possible to use the tissue and cancer specificities as filtering criteria.
We specifically looked for genes that were up-regulated in the cancer samples because most of the known biomarkers for cancer diagnostics are the overexpressed ones (15). The reliability of gene annotation was also taken into account. To identify markers with testable biological functions, the unknown genes and immune-related genes were excluded in spite of their potential as good biomarkers. Full lists of the differentially expressed genes from SAGE and EST data are available in Supplementary Tables S7 and S8.
Comparing the candidate gene lists from SAGE and EST reveals that the overlap is not significant. This is frequently seen when the number of available libraries for one of the two data set is small. Only six libraries are in public for lung SAGE. A close examination of SAGE candidate genes shows that four genes (PLUNC, CGI-38, s-C20orf85, and GBA) would have been in the EST candidates without application of the specificity criteria.
Comparison with microarray data. Results from microarray experiments are the most abundant form of gene expression data. The expression level of the 20 candidate genes was compared with the public microarray data. We simply used the Gene Expression Omnibus (GEO) database from the NCBI (16) and the ONCOMINE database developed by Chinnaiyan's group (17).
The GEO serves as a public repository for a wide variety of high-throughput gene expression data from microarray, SAGE, and proteomic methods. We looked for the data sets whose characteristics are similar to our study design, comparing the normal and tumor tissues from lung cancer patients. The GEO search resulted in a data set (GDS1312) revealing differential expression between paired samples from 10 squamous cell carcinoma patients (18). The data set showed that 3 of the 10 SAGE candidates (CYP24A1, S100P, and PLUNC) and 4 of the 10 EST candidates (CBLC, ALDH3A1, AKR1B10, and TRIM16) were overexpressed in the tumor samples. However, several genes (CGI-38, CYBA, TFF3, GBA, and FOXA2) were down-regulated in the tumor samples contrary to our prediction.
The ONCOMINE provides a comprehensive interpretation of published microarray experiments including three pioneering works on lung cancer (19–21). Beer et al. compared nonneoplastic lung tissues with lung adenocarcinoma (10 nonneoplastic versus 86 adenocarcinoma tissues). Bhattacharjee et al. compared 17 normal lung tissues with 139 lung adenocarcinoma samples. Analysis of the data showed that three genes (S100P, s-TIM, and TFF3) were overexpressed in lung tumor samples compared with normal samples. CYBA was underexpressed in lung cancer samples in ONCOMINE database as well.
Comparison with the microarray data indicates that our prediction based on SAGE and EST data agrees to a significant extent with the microarray data. However, there exists substantial difference between three types of high-throughput expression data. This implies that experimental validation using well-defined clinical samples is an essential step for definitively identifying biomarkers.
Experimental Validation of the Candidate Biomarker Genes
The validation procedure consists of two steps. Initially, semiquantitative RT-PCR experiments were done for the 20 candidate genes using 22 paired samples and 10 inflammatory tissues. Seven genes were selected from extensive statistical analyses of RT-PCR results. Subsequently, quantitative real-time RT-PCR experiments were carried out for the seven genes using 36 paired samples, which included the original 22 paired samples and additional 14 paired samples.
RT-PCR results. We carried out semiquantitative RT-PCR experiments for the 20 candidate genes selected from the SAGE and EST analyses. Among the 10 genes from EST analysis, 6 genes (T, FOXA2, SCTR, GLP2R, TRIM16, and DRD2) were immediately excluded from further validation efforts because these genes were not detected in any of the lung tissue. This may reflect the fact that many EST libraries were normalized or subtracted to detect even the genes with extremely low level expression. Most ESTs for two of the genes (SCTR and GLP2R) in fact are from normalized libraries. The reason for the discrepancy for other genes is not clear. We finally evaluated 14 genes in 10 inflammatory tissues and 22 paired cancer tissues by RT-PCR (Fig. 2 ).
|
|
|
The results of the experiment according to the paired t test indicate that five genes—CBLC (P = 0.002), S100P (P = 0.031), CYP24A1 (P = 0.027), AKR1B10 (P = 0.035), and LOC147166 (P = 0.007)—were significantly overexpressed in tumor samples. In terms of the disease type, CBLC was significant in both tumor tissues. S100P was overexpressed in adenocarcinoma tissues, whereas ALDH3A1, AKR1B10, and LOC147166 were overexpressed in squamous cell carcinoma tissues.
Figure 3 shows the box plot analysis of the real-time RT-PCR results. For each gene, median fold change and distribution were analyzed for all samples together and separately for the two different cancer types. Consistent with the t test, CBLC stands out as the most probable biomarker for both types of lung cancers. CYP24A1 also seems to be a viable general biomarker for lung cancers. For adenocarcinoma, PLUNC and S100P also seem to be potential biomarkers, whereas AKR1B10 and ALDH3A1 seem to be likely candidates for squamous cell carcinoma. Importantly, when quantitated separately, the 22 original samples from which the 7 candidate genes were derived and the new 14 samples which could be considered as the validation set showed little difference in terms of fold change.
|
Critical evaluation of biomarker genes. Inspecting the details of gene expression shown in Table 2 and Fig. 3 provides deeper insights on the biomarker evaluation. The CBLC gene, the most significant marker according to the statistical tests, was expressed in just two normal samples (1 in 10 benign lung disease tissues, 1 in 11 pathologic normal tissues from the adenocarcinoma patients, and none from the squamous cell carcinoma patient tissues). In contrast, most cancer tissues showed a positive expression for this gene (7 of 11 in the adenocarcinoma tissues and 8 of 11 in the squamous cell carcinoma tissues). In 35 of 36 cases, real-time RT-PCR analysis showed elevated expression in tumor samples, which strongly indicates that CBLC is a highly specific and sensitive cancer biomarker.
CYP24A1 also showed similar degree of specificity and sensitivity. Interestingly, two inflammatory tissues from benign lung disease cases showed positive expression although CYP24A1 was not expressed in any pathologically normal tissues from 22 cancer patients. This indicates that CYP24A1 could be an excellent marker for distinguishing normal and tumor tissues in paired samples but its expression can be induced in other types of lung diseases. Although the result from the real-time RT-PCR experiment, with 4 of 33 cases showing suppressed expression in tumor samples, is not as impressive as the semiquantitative RT-PCR data, CYP24A1 still seems to be a promising biomarker for lung cancers.
Although S100P showed a good statistical correlation, it does not seem to be a good marker because about half of the normal samples showed its expression. Furthermore, S100P scored poorly in feature selection methods. The box plot in Fig. 3 nevertheless indicates that S100P might be a good biomarker for adenocarcinoma subtype of lung cancer. ALDH3A1 seemed to perform much better for its overall ability to distinguish cancer samples. The numbers were even more impressive for the squamous cell carcinoma patients (1 versus 10). Real-time RT-PCR data are consistent with these observations.
Among three additional genes obtained from the paired sample test, AKR1B10 and PLUNC had a small tendency to be preferentially expressed in cancer samples than over normal samples, but their merit as biomarkers was not obvious in the RT-PCR result. However, real-time data strongly support AKR1B10 as a promising candidate. In fact, all 14 additional samples showed increased expression in cancer tissues, whereas the original 22 samples had a mixed tendency. Similarly, PLUNC seems to have a fair potential as a biomarker for adenocarcinoma, with 16 of 18 samples showing elevated expression in cancer tissues.
In summary, we propose that the four genes (CBLC, CYP24A, AKR1B10, and ALDH3A1) that showed significant differences in both statistical tests and the RT-PCR validations are potential biomarkers for non–small-cell lung cancer patients. Two genes (CBLC and CYP24A1) are particularly promising. With respect to the histopathologic aspects, these genes were expressed in both adenocarcinoma and squamous cell carcinoma, indicating that they are not cancer type–specific markers.
Biological properties of the candidate genes. Biomarker discovery does not necessarily require understanding the biological function and regulatory mechanism of the candidate genes. However, molecular understanding of the biological function could still be worthwhile in that overexpression of these genes may be mechanistically linked to carcinogenesis. We therefore surveyed the literature and the knowledge databases such as Entrez Gene (22), Ingenuity Pathway Analysis (23), and TransPath Professional 7.3 (24) on the two most promising genes.
CBLC is a member of the Cbl family of multidomain signaling proteins with a tyrosine kinase binding domain and a RING finger domain, the latter of which interacts with the E2 ubiquitin conjugating enzymes of the ubiquitin pathway (25). Thus, the Cbl family gene products function as ubiquitin ligases toward activated protein tyrosine kinases such as Src (26) and Lck (27). CBLC is also known to bind to proteins with the Src homology-3 domain as well. It is recruited to the epidermal growth factor (EGF) receptor (EGFR) on EGF stimulation and increases ubiquitination of EGFR, thereby down-regulating EGFR signaling (28, 29). Mutations in the EGFR gene have been reported in non–small-cell lung cancer patients, especially in patients with adenocarcinoma, women, nonsmokers, and East Asians (30).
CYP24A1 is a member of the cytochrome P450 superfamily of enzymes involved in drug metabolism and synthesis of cholesterol, steroids, and other lipids. This mitochondrial protein initiates the degradation of 1,25-dihydroxyvitamin D3, the physiologically active form of vitamin D3. Albertson et al. (31) reported that gene copy number and expression are increased in breast cancer, and Mimori et al. (32) showed that its overexpression is linked to a poor prognosis for esophageal cancer. At the time of writing, Parise et al. (33) reported up-regulation of CYP24A1 in non–small-cell lung cancer. The promoter region of CYP24A1 contains two vitamin D response elements and an Ets-1 binding site (34). Its gene regulation is a complicated process involving vitamin D response, Ets-1, retinoid X receptor
, and various mitogen-activated protein kinases such as extracellular signal–regulated kinase (ERK)-1 and ERK5. A number of studies have reported that Ets-1 is a proto-oncogene in various types of cancer.
| Conclusion |
|---|
|
|
|---|
It is interesting to note the origin of biomarker genes. Two genes (CYP24A1 and S100P) were derived from SAGE data and others (CBLC, ALDH3A1, AKR1B10, and LOC147166) were from the EST data. This implies different ranges of coverage for the two data sets and the benefits of using both types of data. Our study also shows that candidates from meta-analysis of the public expression data should be carefully tested through validation using clinical samples.
One of the major strengths of our study is the use of multiple clinical samples. Strong statistical support was thus possible although additional clinical samples should be used for further validation down the road. In addition, we tested only 20 genes in this study with several hundreds of candidates remaining to be examined. Biochemical studies for promising biomarkers are necessary as well to examine the potential of the candidate genes as drug targets. As additional expression data become available, it would be also be interesting to see if combinations of several differentially regulated genes could function with more sensitivity and specificity in the diagnosis and prognosis of lung cancers.
| Acknowledgments |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
| Footnotes |
|---|
B. Kim and H.J. Lee contributed equally to this work.
Received 1/ 3/07. Revised 5/ 6/07. Accepted 5/25/07.
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |