| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Molecular Biology, Pathobiology, and Genetics |
1 Rosetta Inpharmatics LLC, (A wholly owned subsidiary of Merck & Co. Inc.) Seattle, Washington; 2 Division of Diagnostic Oncology, Radiotherapy and Molecular Carcinogenesis and Center for Biomedical Genetics, the Netherlands Cancer Institute, Amsterdam, the Netherlands; 3 Merck Research Laboratories, Merck and Co., Inc., West Point, Pennsylvania; and 4 GHC Technologies, Inc., La Jolla, California
Requests for reprints: Stephen Friend, Merck Research Laboratories, Merck and Co. Inc., PO Box 4, WP14-2500, 770 Sumneytown Pike, West Point, PA 19486. Phone: 215-652-7313; Fax: 215-993-4114; E-mail: stephen_friend{at}merck.
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
We previously established a 70-genebased prognostic classifier (3) for breast cancers diagnosed before age 55. This classifier outperformed clinical predictors and showed good potential in selecting good outcome patients and thereby minimizing overtreatment (15). However, the group of patients that were predicted to have a poor outcome did not have uniform outcomes, with many (52%) patients not developing metastases (mean follow-up of
8 years). Moreover, the 70 prognostic genes are involved in a variety of biological processes and thus provided limited insight into biological mechanisms that affect clinical outcome. The uniform gene expression pattern for good outcome patients and heterogeneous patterns for the poor outcome patients in refs. (3, 15) suggest that the biological processes associated with good outcome are more homogeneous than those associated with poor outcome. These observations led to two topics that are the focus of the current study: (a) identifying a subset of patients with high risk to poor outcome and (b) identifying a coherent set of genes that provide biological insight into the mechanisms responsible for poor outcome.
Gene expression alone is likely to identify a subset of patients that are dominated by poor outcome only if the relevant patient groups have a distinctive gene expression pattern. When this is not the case, it may be possible to use clinical measures and existing understanding (even if incomplete) of the disease process to impose specific patient stratification to guide the machine-learning phase of gene expression analysis to develop a prognostic classifier. Such an integrated approach to find optimal prognostic classifiers is the subject of this study.
Specifically, we used the estrogen receptor (ER) level and its variation with age at diagnosis to subdivide the patients. ER status has a marked influence on the gene expression in breast cancer, affecting the expression of >10% of the genes in breast tumors (2, 3, 5, 16, 17), and is generally thought to have an important impact on survival (15, 1820). Age is also prognostic, with breast cancer in younger patients having a poorer outcome (21). These two variables have been previously used as independent prognostic factors, and interestingly, it has recently been reported that the percentage of ER+ breast carcinomas increases with patient age (22).
The current study shows that using this combination of clinical variables, a subgroup of patients is identified in which expression of proliferation-associated genes is a very strong predictor of outcome. In contrast, proliferation index and tumor grade (a histologic assessment about the aggressiveness of cell growth) have only limited predictive power when used without preselection of patients (see, e.g., refs. 15, 2327).
| Materials and Methods |
|---|
|
|
|---|
Data analysis
Estrogen receptor level. ER level was measured by a 60-mer oligonucleotide on our human microarray. Because every individual sample was compared with a pool of all samples, the ratio to pool was used to measure the relative expression level. We used the same threshold of 0.65 on log10 (ratio) to separate the ER+ group from the ER group as previously established in ref. (3).
Classification method. The basic algorithms for classification used here are the same as previously used in ref. (3), except for changes listed below.
Feature selection and performance evaluation. For the prognosis in each group, we started by filtering noninformative genes as described in ref. (3). The second step involved a double loop of leave-one-out cross-validation procedure, with the first loop to select the "training samples" (see section below), and the second loop to evaluate the performance. Prognostic features were selected based on the training samples by their correlation to outcome and were reselected during each step of leave-one-out cross-validation. See Supplementary Information for more details.
Identifying homogeneous patterns and dominant mechanism by iterative training sample selection. We developed a method called "iterative training sample selection", or "homogeneous pattern" in order to reveal the dominant mechanisms. In the first step, only the samples of those patients who had metastases within 5 years or who were metastases-free with more than 5 years of follow-up time were used as the training set. Based on these training samples, a complete leave-one-out cross-validation (including reselecting features) process was done. During this step, the number of features was fixed at 50 genes (the number is chosen to provide a stable classifier by our algorithm). The training samples that were not correctly classified (poor samples correlating more to the average good, or vice versa) by this leave-one-out cross-validation process were further removed from the training set in the second round of leave-one-out cross-validation (see Fig. S6 in Supplementary Information for training samples used for current study). This is the opposite of the "boost" algorithm (28). The boost algorithm increases the weight of the misclassified samples in the training for improving the accuracy of the classifier. The current algorithm focuses on the most common prediction rule (mechanism) within the data set by excluding the "unpredictable" from the training set for robust feature selection. With this method, we selected a very homogeneous group of genes which happened to all be associated with the cell cycle. Due to the homogeneous expression pattern, the classifier accuracy is relatively insensitive to the number of features included in the classifier. Even though improved classifier accuracy is not the objective of this algorithm, it resulted in an improved accuracy in this study, probably due to the identification of a robust feature set.
Error rate and odds ratio, threshold in the final leave-one-out cross-validation. Unless otherwise stated, the error rate is the average error rate from two populations: poor outcome samples misclassified as good divided by total poor samples, and good outcome samples misclassified as poor over total good samples. We report two odds ratios for a given threshold: the overall odds ratio and 5-year odds ratio (5-year odds ratio was calculated from those samples with more than 5-year metastases-free or metastasized within 5 years). The threshold was applied to cor1 cor2, where "cor1" is the correlation to the "average good profile" in the training set, and "cor2" is the correlation to the "average poor profile" in the training set. The threshold in the final round of leave-one-out cross-validation was determined by a method described in Supplementary Information to avoid overestimating the performance.
Correlation calculation. The correlation between each gene's expression log(ratio) and the end point data (final outcome) was calculated using the Pearson correlation coefficient. The correlation between each patient's profile and the average good profile and average poor profile is the cosine product (without mean subtraction).
Kaplan-Meier plot. Only the patients belonging to the original 295 cohort samples were used for the Kaplan-Meier plot. Overall survival was defined by death from any cause. In the analysis of distant metastasis-free probabilities, patients whose first event was distant metastases were counted as failures; all other patients were censored at the date of their last follow-up, nonbreast cancer death, local-regional recurrence or second primary malignancy, including contralateral breast cancer. Time was measured from the date of surgery. Metastasis-free curves were drawn using the method of Kaplan and Meier and compared using the log-rank test.
| Results |
|---|
|
|
|---|
|
Overall outcome is poor in the ER/age high group. The fraction of patients who developed metastases was 43% in the ER/age high group, and 24% in the ER/age low group. The probability of observing such an asymmetry in metastases rate by chance is 3 x 1011. This drastic difference in metastasis rate provides additional evidence of two subpopulations within the ER+ patients.
Cell cycle genes are strongly prognostic in ER/age high group, but less or nonprognostic in other groups. Within the ER/age high group, we identified a group of 50 prognostic reporter genes that were highly correlated with the outcome (see Materials and Methods and Table S3 in Supplementary Information). Moreover, the expression of these prognostic genes is relatively homogeneous as indicated by high similarity in expression patterns among those genes as shown in Fig. 2A. Leave-one-out cross-validation, including reporter selection, yielded an odds ratio for metastasis of 14.6 [95% confidence interval (CI) 4.7-45.4] and 5-year odds ratio (see Materials and Methods) of 24.0 (95% CI, 6.0-95.5; see Table 1 for summary information). In the group of patients predicted to have a poor outcome, 31 out of 45 (69%) developed metastases (mean follow-up time, 7.1 years). The 10-year metastasis-free probability is only 24% (for Kaplan-Meier plots, the leave-one-out cross-validation was used to predict samples into "good" and "poor" prognosis groups, Fig. 2C). In contrast, in the group predicted to have a good outcome, only 5 out of 38 patients (13%) developed metastases, and the 10-year metastasis-free probability is 85%. It is noteworthy that the overall survival rate at 10 years is only 46% for the poor prognosis group, in comparison with 96% for the good prognosis group (Fig. 2D).
|
|
Overexpression of cell cycle genes is indicative of cell proliferation, which in turn is known to be associated with poor outcome. Patients whose tumors have a high proliferation rate have an increased risk (10-20%) of metastasis or death (see, for example, refs. 32, 33). This relatively small difference in outcome may be due primarily to the fact that cell proliferation has less of an impact on outcome in the ER/age low patients (Fig. 3A) and essentially no impact on outcome in the ER patients (Fig. 3D). When the same classifier was applied to the ER/age low group (the ER+ patients not included in the ER/age high group), the overall odds ratio for metastasis is 1.59 (95% CI, 0.74-3.41) and 5-year odds ratio is 3.51 (95% CI, 1.24-10.0). To construct a classifier, a threshold is used to separate poor outcome from good outcome predictions. Even with a threshold reoptimized for the ER/age low group, the overall odds ratio is only 2.79 (95% CI, 1.31-5.95) and the 5-year odds ratio is 5.29 (2.04-13.7), far less than those for the ER/age high group. This limited power is shown in the Kaplan-Meier plots in Fig. 3B and C. With the reoptimized threshold, the separation between the predicted good and poor group measured by the metastasis-free probability and overall survival probability is only approximately 20% at 10 years. In the ER group (Fig. 3D), almost all of the patients have evidence of high proliferation, yet only 43% of patients develop metastases. The error rate for predicting metastasis is approximately 50% (no predictive value), no matter what threshold is chosen for the classifier. Figure 3E and F show that almost all samples were predicted to have a poor outcome due to the high expression of proliferation genes.
|
|
Results are robust to the choice of exact division of patient groups. To gain more confidence that these are truly two distinct subgroups, it is important to examine whether the loss of prognostic power from the ER/age high group to the ER/age low group occurs at a relatively discrete boundary, or is continuous. Thus, we developed a classifier for all ER+ samples. Due to the homogeneous pattern method we used (see Materials and Methods), the prognostic genes are again almost entirely cell cyclerelated (see Fig. S7 and "Prediction accuracy versus dividing line position in the ER/age plot" in the Supplementary Information). We then determined the prediction accuracy for the population of patients above the yellow line of Fig. 1C as we moved the yellow line position (in parallel to the line in the figure). As shown in Fig. 5, the error rate increased continuously as the yellow line shifted from left to right, but interestingly, became constant after it passed the position indicated in Fig. 1C (see also Fig. S8 of Supplementary Information for more details). This result suggests that (a) the strong prognostic power of the cell cycle genes in the ER/age high group is robust to the choice of exact division of patient groups, and (b) the ER/age low group is not simply a continuum of the ER/age high group because the error rate did not continue to increase as one moved through the ER/age low group.
|
| Discussion |
|---|
|
|
|---|
Although the present study is based on patients less than 55 years of age, our conclusions are unlikely to change when older patients are included. As shown in Fig. 1C, the lack of patients with ages greater than 50 years in the ER/age high group indicates that an additional number of older patients will not affect the performance of proliferation genes in this group. The inclusion of older age patients is unlikely to change the reduced prognostic power in other groups either, because the prognostic value of tumor grade for the entire patient population (the majority are older aged) is not as strong as that for the ER/age high subgroup we observed.
The different degrees of association between cell proliferation and poor outcome in different groups of patients confirms the concept that breast cancer pathogenesis and tumor maintenance is heterogeneous, with different subtypes likely having independent pathways of tumor progression. Previous prognostic factors for metastases are limited by their applications to all patients, but can be improved when applied to the right subgroup of patients as shown in the current paper.
It is worth noting that even though the patients in the ER/age high group are clinically heterogeneous, the incidence of distant metastases is strongly predicted by a biologically uniform set of genes, indicating that proliferation is the prime driving force for disease progression. In contrast, in other breast cancer subgroups, factors in addition to tumor cell proliferation may also be important in determining outcome.
The results revealed by the expression data in the ER/age high group has important clinical implications. In particular, the prognosis of patients in this group may be predicted solely using the combination of particular clinical and histopathologic variables. For example, one can use an immunohistochemical measurement of ER level if it has enough accuracy (the immunohistochemical measure of ER correlates with mRNA level of ER, see, for example, ref. 3). Otherwise, PCR measure of mRNA abundance of ER and patients' age at diagnosis can be used to select the ER/age high patients and to test whether tumor grade has a significant prognostic power. If validated, this would have a significant impact on the treatment decisions for these patients.
Biologically, the fact that grouping patients based on ER expression level and age yielded good results might imply that there is an important mechanism governing the relationship between ER expression level and patient age. After seeing good performance with such stratification, we assessed the error rate using various stratifications along the ER axis or age axis independently. None of them did as well as the approach using the ER and age dependence (see Supplementary Information).
From a data mining point of view, combining gene expression with other types of information represents a promising new direction. Gene expression data obtained from clinical samples are generally difficult to interpret because they provide only a snapshot of a complicated disease state. Integrating clinical information with gene expression is crucial for the interpretation of this rich and complicated information. From a model prediction point of view, Pittman et al. (35) made good progress in improving prediction accuracy by including gene expression and clinical variables in a decision tree. In this study, instead of equally mixing clinical data with gene expression data in a machine-learning model, we used clinical variables to stratify the patients.
It is not clear why patients with high ER/age seem to be so biologically distinct. It is possible that tumors in young patients with high ER have a unique propensity to depend on the identified proliferation-associated genes. It is noteworthy that a homotgeneous prognostic gene expression pattern was identified in this group, and confirmation in independent populations would support the significance of this unexpected finding.
In conclusion, by combining ER expression level and age, we identified a group of patients with relatively poor outcome. Within this group, a gene expression classifier identifies a subgroup of patients with an almost 70% chance of metastasis. Importantly, this gene expression classifier suggests that cell proliferation is the driving mechanism associated with poor outcome. These results suggest that further refinements of diagnostic predictors may more often be generated by combining different informative clinical and molecular variables. The integrative approach used in this study also shows the value of moving beyond single-variable statistical comparisons when introducing new prognostic markers.
| Acknowledgments |
|---|
We thank Drs. Peter Linsley, Doug Bassett, Vladimir Svetnik, Richard Raubertas, and I-Ming Wang for their critical and fruitful discussions. Dr. Jerald Radich also made valuable comments that helped to improve the manuscript. The cRNA samples of HCT116 provided by Dr. Carolyn Buser-Doepner and processed by Dr. Steven Bartz were used for the Supplementary Information.
| Footnotes |
|---|
Received 11/ 5/04. Revised 1/18/05. Accepted 2/18/05.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
N. P. S. Crawford, J. Alsarraj, L. Lukes, R. C. Walker, J. S. Officewala, H. H. Yang, M. P. Lee, K. Ozato, and K. W. Hunter Bromodomain 4 activation predicts breast cancer survival PNAS, April 29, 2008; 105(17): 6380 - 6385. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Chanrion, V. Negre, H. Fontaine, N. Salvetat, F. Bibeau, G. M. Grogan, L. Mauriac, D. Katsaros, F. Molina, C. Theillet, et al. A Gene Expression Signature that Can Predict the Recurrence of Tamoxifen-Treated Primary Breast Cancer Clin. Cancer Res., March 15, 2008; 14(6): 1744 - 1752. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Finetti, N. Cervera, E. Charafe-Jauffret, C. Chabannon, C. Charpin, M. Chaffanet, J. Jacquemier, P. Viens, D. Birnbaum, and F. Bertucci Sixteen-Kinase Gene Expression Identifies Luminal Breast Cancers with Poor Prognosis Cancer Res., February 1, 2008; 68(3): 767 - 776. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Garcia-Manero, H. Yang, C. Bueso-Ramos, A. Ferrajoli, J. Cortes, W. G. Wierda, S. Faderl, C. Koller, G. Morris, G. Rosner, et al. Phase 1 study of the histone deacetylase inhibitor vorinostat (suberoylanilide hydroxamic acid [SAHA]) in patients with advanced leukemias and myelodysplastic syndromes Blood, February 1, 2008; 111(3): 1060 - 1066. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Harris, H. Fritsche, R. Mennel, L. Norton, P. Ravdin, S. Taube, M. R. Somerfield, D. F. Hayes, and R. C. Bast Jr American Society of Clinical Oncology 2007 Update of Recommendations for the Use of Tumor Markers in Breast Cancer J. Clin. Oncol., November 20, 2007; 25(33): 5287 - 5312. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. F. Burkart, J. D. Wren, J. I. Herschkowitz, C. M. Perou, and H. R. Garner Clustering microarray-derived gene lists through implicit literature relationships Bioinformatics, August 1, 2007; 23(15): 1995 - 2003. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. H. Ahn, B. H. Son, S. W. Kim, S. I. Kim, J. Jeong, S.-S. Ko, and W. Han Poor Outcome of Hormone Receptor-Positive Breast Cancer at Very Young Age Is Due to Tamoxifen Resistance: Nationwide Survival Data in Korea--A Report From the Korean Breast Cancer Society J. Clin. Oncol., June 10, 2007; 25(17): 2360 - 2368. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Loi, B. Haibe-Kains, C. Desmedt, F. Lallemand, A. M. Tutt, C. Gillet, P. Ellis, A. Harris, J. Bergh, J. A. Foekens, et al. Definition of Clinically Distinct Molecular Subtypes in Estrogen Receptor-Positive Breast Carcinomas Through Genomic Grade J. Clin. Oncol., April 1, 2007; 25(10): 1239 - 1246. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Zhang, A. Y. Liu, P. Loriaux, B. Wollscheid, Y. Zhou, J. D. Watts, and R. Aebersold Mass Spectrometric Detection of Tissue Proteins in Plasma Mol. Cell. Proteomics, January 1, 2007; 6(1): 64 - 71. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. K Gruvberger-Saal, H. E Cunliffe, K. M Carr, and I. A Hedenfalk Microarrays in breast cancer research and clinical practice - the future lies ahead Endocr. Relat. Cancer, December 1, 2006; 13(4): 1017 - 1031. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. L. Moharita, M. Taborga, K. E. Corcoran, M. Bryan, P. S. Patel, and P. Rameshwar SDF-1{alpha} regulation in breast cancer cells contacting bone marrow stroma is critical for normal hematopoiesis Blood, November 15, 2006; 108(10): 3245 - 3252. [Abstract] [Full Text] [PDF] |
||||
![]() |
S Aebi and M Castiglione The enigma of young age. Ann. Onc., October 1, 2006; 17(10): 1475 - 1477. [Full Text] [PDF] |
||||
![]() |
R. Kirschner-Schwabe, C. Lottaz, J. Todling, P. Rhein, L. Karawajew, C. Eckert, A. von Stackelberg, U. Ungethum, D. Kostka, A. E. Kulozik, et al. Expression of Late Cell Cycle Genes and an Increased Proliferative Capacity Characterize Very Early Relapse of Childhood Acute Lymphoblastic Leukemia Clin. Cancer Res., August 1, 2006; 12(15): 4553 - 4561. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. V. Fournier, K. J. Martin, P. A. Kenny, K. Xhaja, I. Bosch, P. Yaswen, and M. J. Bissell Gene expression signature in organized and growth-arrested mammary acini predicts good outcome in breast cancer. Cancer Res., July 15, 2006; 66(14): 7095 - 7102. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Z. Ring, R. S. Seitz, R. Beck, W. J. Shasteen, S. M. Tarr, M. C.U. Cheang, B. J. Yoder, G. T. Budd, T. O. Nielsen, D. G. Hicks, et al. Novel Prognostic Immunohistochemical Biomarker Panel for Estrogen Receptor-Positive Breast Cancer J. Clin. Oncol., July 1, 2006; 24(19): 3039 - 3047. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Yuan, Y. Xu, J.-H. Woo, Y. Wang, Y. K. Bae, D.-S. Yoon, R. P. Wersto, E. Tully, K. Wilsbach, and E. Gabrielson Increased Expression of Mitotic Checkpoint Genes in Breast Cancer Cells with Chromosomal Instability Clin. Cancer Res., January 15, 2006; 12(2): 405 - 410. [Abstract] [Full Text] [PDF] |
||||
![]() |
S.H. FRIEND Emerging Approaches in Molecular Profiling Affecting Oncology Drug Discovery Cold Spring Harb Symp Quant Biol, January 1, 2005; 70(0): 445 - 448. [Abstract] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||