| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Molecular Biology, Pathobiology, and Genetics |
1 Department of Transfusion Medicine, Warren G. Magnuson Clinical Center, 2 Biometrics Research Branch, and 3 Cancer Prevention Studies Branch, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, Maryland; 4 Department of Oncology and Surgical Sciences, Oncology Section, University of Padua, Padua, Italy; 5 Boston Strategic Patterns, Boston, Massachusetts; 6 Institute of Medical Immunology, Martin Luther King University Hall-Wittenberg, Halle, Germany; 7 Department of Gynecology and Oncology, The University of Texas, M.D. Anderson Cancer Center, Houston, Texas; and 8 James Graham Brown Cancer Center, University of Louisville, Louisville, Kansas
Requests for reprints: Ena Wang, NIH, Building 10, Room 1N224B, 10 Center Drive, Bethesda, MD 20892-1184. Phone: 301-451-8501; Fax: 301-402-1360; E-mail: ewang{at}cc.nih.gov.
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
Most studies, however, have restricted the analysis to individual tumor types (1). This approach has limited the usefulness of the identified biomarkers for two reasons: (a) the biomarkers may not be broadly used as standard molecular pathology tools and (b) genes whose expression is irrelevant to the oncogenic process may be included. This makes current biomarkers less useful for accurate pretreatment staging, monitoring of cancer recurrence after primary treatment, and long-term follow-up of cancer patients because their expression is irrelevant to local spread, metastatization, and uncontrolled growth. Indeed, several of the currently used cancer biomarkers stem as differentiation markers from the tissue from which specific cancers originate, such as tyrosinase in melanoma (4), prostate-specific antigen in prostate cancer (5, 6), carcinoembryonic antigen in epithelial malignancies (7), and CA-125 in ovarian cancer (8). Quantitative assessment of the expression of these markers may help in the identification of cancer cells; however, their usefulness is limited by the propensity of tumor cells to progressively lose their expression (1, 9). We recently observed that the majority of genes that transcriptionally define neoplasia depend on the ontogeny of individual cancers, whereas universal oncogenic processes affect only the minority (9, 10). Therefore, like well-defined tissue differentiation markers, the expression of most genes defining a cancer histotype is likely extinguished during the natural progression of the disease (9).
The identification of cancer biomarkers related to the oncogenic process and therefore ubiquitously expressed by most malignancies could increase the sensitivity and specificity of conventional histopathologic evaluation by targeting genes whose expression is critical for invasion, metastatization, and cell survival. Transcriptional profiling of the NCI-60 cancer cell lines and a limited number of tissue specimens showed that multiclass cancer classification may lead to the identification of biomarkers expressed by different cancer types (11). Thus, the current study was aimed at the identification of common genetic traits associated with aggressiveness, uncontrolled proliferation, and metastatic potential, which could, in turn, be exploited as ubiquitous identifiers of malignancy. Therefore, we searched for genes overexpressed by cancer tissues in 373 archival cDNA microarray samples encompassing a variety of malignant and benign samples. All samples were prepared and processed identically and cohybridized consistently with a differentially labeled reference onto a 17.5K custom-made cDNA array. Novel candidate biomarkers were identified that could define malignancy with high levels of accuracy. We also tested the predictive accuracy of a list of cancer biomarkers proposed by the literature (Supplementary Data 1) and identified 332 genes included as cDNA clones in the same 17.5K array platform.
| Materials and Methods |
|---|
|
|
|---|
|
Total RNA was extracted from frozen material using Mini or Midi kit (Qiagen, Valencia, CA) after homogenizing tissue in the presence of RLT buffer with fresh addition of 2-ß-mercaptoethanol and amplified into antisense RNA (13). Although the quantity of total RNA was sufficient in most cases for gene profiling, we have shown repeatedly the high-fidelity RNA amplification yielding superior results due to lack of contaminant rRNA and tRNA (3, 1315). Quality and quantity of total and amplified RNA were monitored using a Bioanalyzer 2000 (Agilent Technologies, Palo Alto, CA; ref. 14). Poor-quality samples were excluded. Amplified RNA from peripheral blood mononuclear cells pooled from six normal donors served as a constant reference in all experiments (3). Test and reference RNA were labeled with Cy5 (red) and Cy3 (green), respectively, and cohybridized to a custom-made 17.5K cDNA microarray printed at the Immunogenetics Section, Department of Transfusion Medicine, Warren G. Magnuson Clinical Center, Center for Cancer Research, National Cancer Institute, NIH, with a configuration of 32_24_23, and contained 17,500 elements. Clones used for printing included a combination of the Research Genetics RG_HsKG_031901 8K clone set, and 9,000 clones were selected from the RG_Hs_seq_ver_070700 40K clone set. The 17,500 spots included 12,072 uniquely named genes, 875 duplicated genes, and
4,000 expression sequence tags (complete gene list and printing layout are available at http://nciarray.nci.nih.gov/gal_files/index.shtml). Array quality was first validated using an internal reference concordance system based on the expectation that results obtained through the hybridization of the same test and reference material in different experiments should perfectly collimate. The level of concordance was measured by rehybridizing periodically the same arbitrarily selected test sample (A375 melanoma cell line) with the consistent reference sample as described previously (16).
Statistical Analysis
Identification of candidate biomarkers. Archival cDNA array experiments were retrieved from the NCI's microarray database eliminating those that based on image quality, background, and dye bias were considered of lower quality. The remaining 502 arrays were collated into the Biometrics Research Branch (BRB) array tool (http://linus.nci.nih.gov/BRB-ArrayTools.html) and further evaluated for quality using M/A plots [M = log2(R/G), A = log2
RG; ref. (17)] before and after Lowess smoother normalization. Sixty-nine arrays with skewed M/A plots were excluded from further analysis. The remaining arrays included 33 basal cell carcinomas. These were removed from the analysis because of the ambivalent behavior of these tumors characterized by an indolent and noninvasive conduct in between malignant and benign lesions (18). Finally, only one of paired bilateral normal samples collected from the same patient (12) was used for analysis excluding additional 27 samples. Both basal cell carcinomas and paired normal samples were, however, returned to the data set for display in the figures. In the end, a total of 373 samples were used for the analysis (Table 1). These test samples were subdivided in a training set (201 arrays; 98 from malignant and 103 from benign tissues) and a prediction/validation set (172 arrays; 85 from malignant and 87 from benign samples). Class prediction comparing benign and malignant phenotypes was applied to the resulting data set using different prediction methods [compound covariant predictor, diagonal linear discriminant analysis, k-nearest neighbors for k = 1 and 3, nearest centroid, and support vector machine (SVM)] supported by the BRB array tool. Most of the information reported in this article was derived using SVM and nearest-neighbor algorithms (Table 2
) that, as observed by others (1921), outdone other approaches when applied to transcriptional profiling. Gene pair identification was based on the Greedy pairs approach (22), which starts ranking all genes based on their individual t scores on the training set. The procedure selects the best-ranked gene gi and finds the other gene gj that together with gi provides the best discrimination using as a measure the distance between centroids of the two classes with regard to the two genes when projected to the diagonal linear discriminant axis. The two selected genes are removed from the gene set, and the procedure is repeated on the remaining set until the specified number of genes has been selected.
|
| Results |
|---|
|
|
|---|
1.5-fold change in either the positive or the negative direction from the median value of the gene were excluded. In addition, genes with >20% data missing were excluded, trimming the final working set to 13,254 genes. Scatter-plot analysis based on the average log ratio of malignant over benign lesions identified 6,264 of 13,254 genes, with a fold difference of >1 (defined as genes up-regulated in cancer). Class prediction was done by applying a univariate significance threshold (P < 1 x 103 and P < 1 x 107) to select genes suitable for LOOCV. This analysis identified 1,516 and 395 (Supplementary Data 2) genes, respectively. LOOCV based on these genes could segregate malignant from benign samples with a maximum predictive accuracy of 90% and 91%, respectively, under the SVM algorithm. The same set of genes showed lower accuracy when nearest-neighbor analysis was applied (Table 2). Individual gene expression patterns were visualized by Eisen's luster and Treeview (data not shown), showing that most genes were not exclusively expressed by malignant or benign samples, and significant overlap occurred.
To focus on the best predictors eliminating genes sporadically coexpressed by benign and malignant lesions, we ranked the 395 genes in ascending order of statistical significance (Student's t test; P2 comparing malignant versus benign lesions) and selected the first 50 (Tables 2 and 3 ). Eisen's clustering showed a high selectivity in the pattern of expression of these genes with specific overexpression in most malignant lesions (Fig. 1A ). LOOCV based on either SVM algorithm or 1-nearest neighbor showed a prediction accuracy of 88% when applied to the training set (Table 2). The list of genes was further narrowed to minimize the number of putative biomarkers by restricting the selection to the first 20 cDNA clones (cutoff P < 1.7 x 1016; Fig. 1B; Tables 2 and 3) representing a total of 16 genes as some spots represented duplicates of the same gene (3 MYO10, 2 PON2, and 2 UBE2C). LOOCV based on the 20 cDNA clones showed a prediction accuracy of 90% in the training set.
|
|
The predictive value of the finalist biomarkers was challenged on an independent prediction set (n = 172; Table 1). This was done in a stepwise fashion separating the prediction set into four independent groups each including
45 arrays. Each prediction group data was merged with the training set basing the prediction algorithm on the latter. Either the 20 most significant genes or the 14 genes identified via gene pairing (Table 2) were used to predict simultaneously the phenotype in the four separate sets. The predictive accuracy of the 14 genes for each of the four subgroups was very consistent and when combined resulted in an overall 87% maximum predictive accuracy similar to the one obtained with the training set. Interestingly, the 20-gene data set did less accurately using the SVM algorithm with a drop to 85% accuracy from the 90% observed in the training set.
Comparison of the cDNA clones and genes obtained with the two different methods (highest stringency of significance and gene pairing) showed that most of them overlapped with the exception of AB12, CD34, and OACT2 that were present in the list of the 14 genes identified by gene pairing and not in the 20 most significant genes. Table 3 summarizes the various sets of genes and provides references linking their expression to cancer invasion, progression, and/or metastatization.
To further corroborate the accuracy of the biomarkers identified with the training set and confirmed with the stepwise prediction analysis, we applied direct class comparison to the combined data set (n = 379) using the S plus program. The analysis was run at a univariate cutoff P < 1 x 107 and a ratio of malignant over benign change cutoff of >2. This analysis identified 168 genes of which 11 overlapped the 14 genes identified by gene pairing. The remaining 3 genes matched the significance criteria for cutoff (P < 1 x 107) but were excluded because they were slightly below the geometric mean selected for the distinction between malignant and benign lesions.
2 x 2 Tables showed that in the majority of cases the limited accuracy was due to false-negative choices by the statistical programs, therefore decreasing overall sensitivity. This was exemplified by the seven gene pair data set, in which false negatives occurred 15% of the times, whereas false positives occurred 10% in the training set (19% and 7% in the prediction set; Fig. 1D). It is of note that the majority of false-positive predictions (benign lesions predicted as malignant) occurred in samples from tissues (normal esophagus, renal epithelium, and ovarian) proximal to cancer interpreted at pathologic examination as free of cancer cell infiltration. Yet, subliminal contamination might have gone undetected. This hypothesis could not be confirmed, however, by this study because the amount of histologic material was not sufficient for further analysis. In addition, among the lesions that have been labeled as benign, there was a primary carcinoma in situ that was recognized as malignant by the analysis but on retrospect should have been placed a priori among the malignant lesions or excluded from the analysis.
The set of 14 genes was further validated using ROC curve analysis (23). This method portrays the proportion of true positives identified for any particular proportion of false positives and vice versa providing a better and more precise measure of diagnostic accuracy, because it is uninfluenced by decision biases and prior probabilities placing the performances of diverse systems on a common scale. Indeed, when ROC curves were calculated for the 14 biomarkers (Fig. 1B), they yielded a 93.6% accuracy underlying the superiority of this method in defining decision criteria (Fig. 1E). This level of accuracy was almost identical to that of the 50 most significant biomarkers (94.3%).
An extensive review of the literature and/or commercially promoted cancer biomarkers identified 332 genes present in our array platform (Supplementary Data 1). The predictive value of these genes was tested on the prediction set setting a univariate threshold P < 0.0001. LOOCV identified 56 genes among the 332 with significance under the set threshold. The genes were then clustered using Eisen's cluster and visualized using Treeview (Fig. 2A ). This analysis included genes that were either up-regulated or down-regulated in malignant compared with benign lesions. Class prediction based on these genes showed a maximum 88% accuracy in correctly segregating benign and malignant samples. Gene pairing analysis done on the 332 genes identified seven gene pairs with a maximum predictive accuracy of 85% (Fig. 2B). Among them, only three genes (CYC1, CD34, and ERBB3) had also been identified by the previous analyses (Table 2). Thus, the present study identified novel ubiquitous cancer biomarkers with a prediction performance at least as good as that of the best-known cancer biomarkers. PCA showed that the best degree of separation between benign and malignant lesions could be obtained with the seven gene pairs derived analyzing the 6,264 genes overexpressed in cancer (first component score = 67.4% and second component score = 50.3%; Fig. 2C). PCA based on the seven gene pairs identified by analyzing the 332 known biomarkers also showed good visual separation of benign from malignant lesions (Fig. 2D), but the calculated discrimination was not as strong as with the first component score (56.2%) and the second component score (33.9%). The combined utilization of the 14 gene pairs did not significantly increase the discriminatory power with the first component score (56.1%) and second component score (40.1%; Fig. 2E).
|
| Discussion |
|---|
|
|
|---|
We compared previously the gene expression profile of normal renal epithelium with that of renal cell carcinoma tissue and cancers of other histology (9). In this three-way comparison, we recognized that a small proportion of genes were specifically overexpressed by cancers independently of the lineage derivation. We therefore extended the analysis to a larger array of tissues, including normal peripheral blood mononuclear cells as a marker of systemic infiltration of normal cells; normal hyperplastic lymph nodes draining primary colon cancer areas, which closely relate to the clinical staging of primary disease;9 cancer-free peritoneum from patients with ovarian malignancy, which we have shown previously to harbor cancer-related signatures of inflammation that, however, are not related to the oncogenic process (12); and paired normal and cancerous epithelia (renal epithelial cells, esophageal mucosa, and normal ovary) adjacent to primary tumors judged on extensive pathologic examination to be free of cancer cells (9, 10, 16). All these tissues have been treated identically, and individual gene expression was internally controlled by a consistent reference source. We have analyzed previously the robustness, reproducibility, and concordance of this strategy comparing cDNA-based results with those obtainable with other molecular testing techniques (16).
This analysis was focused specifically on genes overexpressed by cancer tissues because these may be most useful when normal tissues are scrutinized for the presence of few, difficult to detect cancer cells. Different statistical approaches achieved rather consistent results. Of the 50 cDNA clones representative of 45 genes most significantly up-regulated in cancer using the class prediction BRB array tool, 27 were included among the 168 genes identified by direct class comparison using S plus program, whereas the remainder genes were excluded because they barely did not match the empirically set statistical thresholds. As class comparison included a variable descriptive of relative expression levels (log2 ratio
2 between malignant and benign tissues), the simultaneous identification of a large proportion of genes by both analyses supports not only the significance of the selection but also a substantial level of over expression in cancer tissue.
Basal cell carcinomas display a biological behavior in normal and malignant tissues with minimal local invasiveness and almost no metastatic potential (18). For this reason, these lesions were kept out of the analysis but were reintroduced in the figures to provide an intermediate biological reference (Figs. 1A-C and 2A and B). Visual inspection suggested that the expression pattern of most biomarkers by basal cell cancers (yellow bar) was closer to that of benign than malignant tissues, suggesting that most of the genes identified by this study are associated with aggressive behavior and metastatic potential; a conjecture also supported by the literature (refs. 11, 3353; Table 3).
Because the purpose of the analysis was to identify a minimal number of biomarkers with the highest predictive value, we focused our interest on the 14 cDNA clones identified by gene pairing. These candidate biomarkers were validated further on the completely independent prediction set with a consistent predictive accuracy of 87%. This level of accuracy is better than the accuracy of previously reported multiclass tumor classification biomarkers identified through the analysis of cell lines (54) and challenged against a limited number of tissue samples (11, 45, 55).
The uniqueness of the current study resides in the consistency of the platform used, constant reference, strict standardization of sample processing, and stringent quality selection criteria chosen to include sample in the analysis (16). On the other hand, the usefulness of the proposed biomarkers still depends on further validation. First, the analysis compared proportional gene expression between benign and malignant tissues rather than absolute copy numbers. Thus, it is not known whether some of the genes are uniquely expressed by tumor tissues or are expressed in benign conditions although at a lower level. Second, several important tissues, particularly involving chronic or acute inflammation, were not available to us. Although we attempted to include as many relevant normal tissues as possible, further work is needed to validate the relevance of these markers in other pathophysiologic conditions. Finally, this study was done only at the transcriptional level. Thus, the proposed genes may serve, for now, as useful molecular tools to complement histopathologic examination.
| Acknowledgments |
|---|
| Footnotes |
|---|
9 K. Zavaglia et al., in preparation. ![]()
Received 9/23/05. Revised 11/30/05. Accepted 1/19/06.
| References |
|---|
|
|
|---|
B and JNK. J Immunol 2005;175:1197205.This article has been cited by other articles:
![]() |
A. Worschech, M. Kmieciak, K. L. Knutson, H. D. Bear, A. A. Szalay, E. Wang, F. M. Marincola, and M. H. Manjili Signatures Associated with Rejection or Recurrence in HER-2/neu-Positive Mammary Tumors Cancer Res., April 1, 2008; 68(7): 2436 - 2446. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Magnusson, R. Ehrnstrom, J. Olsen, and A. Sjolander An Increased Expression of Cysteinyl Leukotriene 2 Receptor in Colorectal Adenocarcinomas Correlates with High Differentiation Cancer Res., October 1, 2007; 67(19): 9190 - 9198. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |