To investigate the phenotype associated with estrogen receptor α (ER) expression in breast carcinoma, gene expression profiles of 58 node-negative breast carcinomas discordant for ER status were determined using DNA microarray technology. Using artificial neural networks as well as standard hierarchical clustering techniques, the tumors could be classified according to ER status, and a list of genes which discriminate tumors according to ER status was generated. The artificial neural networks could accurately predict ER status even when excluding top discriminator genes, including ER itself. By reference to the serial analysis of gene expression database, we found that only a small proportion of the 100 most important ER discriminator genes were also regulated by estradiol in MCF-7 cells. The results provide evidence that ER+ and ER− tumors display remarkably different gene-expression phenotypes not solely explained by differences in estrogen responsiveness.
Estrogens are important regulators of growth and differentiation in the normal mammary gland and are also important in the development and progression of breast carcinoma. Estrogens regulate gene expression via ER, 3 however the details of the estrogen effect on downstream gene targets, the role of cofactors, and cross-talk between other signaling pathways are far from fully understood. As approximately two-thirds of all breast cancers are ER+ at the time of diagnosis, the expression of the receptor has important implications for their biology and therapy (1) . Opinions differ as to whether those breast cancers which lack ER expression at diagnosis arise from an ER− compartment within the mammary epithelium or represent evolution from an ER+ to an ER− state (2) .
The cDNA microarray technology allows for parallel analysis of the expression of thousands of genes (3) to address complex questions in tumor biology. Statistical tools are required to analyze the large amount of expression data generated by this methodology. ANNs are computer-based algorithms for pattern recognition that are capable of learning from experience (4) . The diagnosis of myocardial infarcts (5) and heart arrhythmias from electrocardiograms (6) are examples of applications of ANNs in medicine. We have recently demonstrated the utility of ANNs for the diagnostic classification of tumors using cDNA microarray data (7) . In this study, we have applied ANNs as well as conventional methods to analyze cDNA microarray data from a selected group of node-negative breast cancers that differ with respect to their ER status. Here we report that ER+ and ER− tumors display remarkably different phenotypes, which may be attributable to their evolution from distinct cell lineages.
Materials and Methods
Tissues and Cells.
Fifty-eight grossly dissected primary tumors from node-negative breast cancer patients, tumor size 20–50 mm, were collected at the University Hospital, Lund, Sweden. Microscopic examination of touch preparations verified the presence of cancer cells in all samples. To train the classifier described below, 47 tumors, all from two previous randomized studies (Ref. 8 ) 4 were selected so that roughly half, 23, were ER+ (range, 50–1900 fmol/mg protein; median, 160), whereas the remaining 24 were ER− (range, 0–9 fmol/mg protein, median 0.7). In addition, 14 of the patients were premenopausal (5 ER+ and 9 ER−) and 33 were postmenopausal (18 ER+ and 15 ER−). To obtain an independent test set, the remaining 11 of the 58 tumors were selected from an ongoing clinical trial and used here as a blinded test set. Of the 11 blinded samples, 5 were ER+ (range, 40–120 fmol/mg protein; median, 60), 6 were ER− (range, 0–3 fmol/mg protein; median, 1.5), and all were premenopausal. ER protein determinations were performed using standard methods in the routine clinical laboratory (9) . BT-474 cells, obtained from American Type Culture Collection, were maintained in RPMI 1640 supplemented by 10% fetal bovine serum, penicillin, and streptomycin. Cells were harvested at 60–80% confluency and used as a reference in all hybridizations.
RNA Isolation and cDNA Microarrays.
Total RNA was isolated from cell lines using the RNeasy kit (Qiagen, Valencia, CA) with subsequent Trizol (Life Technologies, Inc., Rockville, MD) purification. Total RNA from tumors was isolated using two successive rounds of Trizol. Microarrays were prepared and hybridized as described previously (3 , 10 , 11) and according to standard protocols. 5 Briefly, the arrays were spotted with 6,728 sequence-verified cDNA clones, of which ∼4000 were named human genes and the remaining clones were expressed sequence tags. BT-474 RNA (200 μg) and 65–100 μg of tumor RNA were used to produce labeled cDNA by anchored oligo(dT)-primed reverse transcription using SuperScript II reverse transcriptase (Life Technologies, Inc.) in the presence of either Cy5-dUTP or Cy3-dUTP (Amersham Pharmacia, Piscataway, NJ), respectively. Fluorescence scanning and image analysis with DeArray software were performed as described previously (12 , 13) .
For each gene, the fluorescent intensity of the most intense channel [red (Cy3) or green (Cy5)] for each sample, was averaged over all samples. All genes for which this average exceeded 2,000 fluorescence units (scale 0–65,535 units) were included in the analysis. In addition, we required, for all samples, that the red and green intensities both exceeded 20 fluorescence units and that the union (of the two channels) spot area exceeded 30 pixels. For the 58 (47 + 11) measured samples, these requirements left us with 3,389 of the original 6,728 genes. We used multilayered perceptrons, a class of ANNs, which are powerful and versatile regression models (4) to predict the ER status of the tumors from their gene expression patterns and to determine the genes which were most important for this classification (Fig. 1A) ⇓ . To allow for a supervised regression model with no “over-training” (because we have a large number of genes as compared with the number of samples), the dimensionality (3,389) of the samples was reduced by the PCA (14) . Thus, each sample was represented by 58 numbers, which resulted from a projection of the gene expressions using PCA eigenvectors. The samples were classified in two categories using a 3-fold cross-validation procedure, as follows. The 47 disclosed samples were randomly shuffled and split into three roughly equally sized groups. An ANN model was then calibrated with 8 or 10 PCA components as input variables using two of the groups (training), with the third group reserved for testing predictions (validation). This procedure was repeated three times, each time with a different group used for validation. The random shuffling was redone 200 times, and for each shuffling we analyzed three ANN models. Thus, in total, each disclosed sample belonged to a validation set 200 times, and 600 ANN models were calibrated. We selected the PCA components used as inputs based on the training set. For the ER− and ER+ classification, each ANN model gave an output between 0 (ER−) and 1 (ER+). For each validation sample, the 200 outputs were used as a committee: the average of all of the outputs (a committee vote) was calculated, and a validation sample was classified as ER− or ER+, depending on whether its committee vote was closer to 0 or 1 (the decision threshold was 0.5). All 600 models were used to classify the additional blinded samples. Different choices of the decision threshold correspond to different balances between the sensitivity and the specificity of the classification. All possible thresholds give rise to a so-called ROC curve in the (sensitivity, 1 − specificity)-plane. The area under this curve (ROC area) is a convenient measure of the classification performance. The sensitivity of the classification to individual genes was determined by the absolute value of the partial derivative of the output with respect to the gene expressions, averaged over samples and ANN models. A large sensitivity for a gene implies that changing its expression influences the output significantly. In this way, the genes can be ranked. For comparison with the ANN method, we also analyzed the data and visualized the differences between tumors based on ER status using MDS (10) , hierarchical clustering (15) , and weighted gene list (16) techniques. To test whether genes which discriminate ER+ from ER− tumors demonstrated a response to E2 in MCF-7 cells (17) at either 3 h or 24 h after E2 treatment, we searched the SAGE database 6 using the xProfiler tool with default settings.
Calibration and Validation of the ANN Models.
To calibrate ANN models to classify the tumor samples, we used the gene expression data from cDNA microarrays containing 6728 genes (Fig. 1A) ⇓ . Filtering for a minimal level of expression and spot area reduced the number of genes to 3389. PCA further reduced the dimensionality, and we found that using 10 PCA components/sample as inputs, with one output and five hidden nodes, produced well-calibrated ANN models. The 3-fold cross-validation procedure (see “Materials and Methods”) produced a total of 600 ANN models, and the training and validation was successful. In addition, inspection of the calibration curves showed that there was no sign of overtraining of the models (data not shown).
Optimization of Genes Used for Classification Using ANN Models.
We next determined the contribution of each gene to the classification by the ANN models. This was done by measuring the sensitivity of the classification to a change in the expression level of each gene, using the 600 previously calibrated models (see “Materials and Methods”). In this way, we ranked the genes according to their significance for the classification. The 100 most important genes were then extracted and formed the input for another and final calibration. When using only 100 genes, we found that using eight PCA component inputs and four hidden nodes were sufficient. In this way, all 47 samples were correctly classified in the validation phase. The output of the models generated a number between 0 (ER−) and 1 (ER+), reflecting the crispness of the classification. A plot of the output values from the committee is shown for all 47 training samples (Fig. 1B) ⇓ . The majority of the samples, in both groups, obtain output values close to either 0 or 1, with small variations between the output results from the different models. Thus, the committee members agree in general on the classification of each sample, and the result is a clear separation between ER+ and ER− tumors. The top 50 genes extracted from the ANN models, which significantly contribute to the classification, represent a wide spectrum of cellular functions (Table 1) ⇓ . The ER gene, which, as expected, appears at the top of the ranked genes, is closely followed by GATA-binding protein 3, a transcription factor previously associated with ER+ tumors (18) .
Prediction of ER Status of Blinded Breast Tumor Samples.
To test whether the ANN classifications for ER status were generally applicable, the calibrated ANN models were also used to predict the outcome of 11 (5 ER+ and 6 ER−) blinded test samples from an independent data set of early stage primary breast cancers. These samples were also predicted with 100% accuracy. A plot of the committee output values, shown in Fig. 1B ⇓ , displays a clear separation between the two categories.
Prediction of ER Status when Excluding ER and Other Top Discriminators.
When classifying the samples using the ER gene or the GATA3 gene alone (without PCA), good classifications were obtained, indicating that, as expected, these genes carry sufficient information for successful classifications. An interesting issue is to what extent ER+ and ER− tumors can be separated when not explicitly including the expression values for ER itself. Repeating the ANN cross-validation and gene extraction procedure above, but excluding ER, only 1 ER+ sample, 6582, of the 47 samples was incorrectly classified. Interestingly, when using this calibration while masking the data for ER, again all 11 blinded test samples were correctly classified. Thus, a successful prediction does not occur for the trivial reason that ER mRNA expression is related to ER protein levels. These results led us to examine how far down on the discriminator list we could find genes carrying enough information for an accurate prediction. To test this we performed a series of classifications using different sets of 100 genes, starting from the top of the discriminator list by excluding the top 50 genes and following this by the stepwise exclusion of 50 additional genes for every classification (i.e., excluding the top 50, 100, 150 … to 300 genes, respectively). The number of correctly classified samples and the ROC area for the predictions of both the 47 tumors in the validation set as well as for the 11 blinded test tumors were extracted (Table 2) ⇓ . Although the success of the predictions declined when using genes lower down on the discriminator list, the network performance was still fairly good. This was demonstrated as the 100 genes in positions 301 to 400 on the discriminator list achieved ROC areas of 93.7% and 96.7% for the validation set and the test set, respectively. However, the committee votes for these samples are now closer to the threshold value 0.5 and also display an increased variance (Fig. 1C) ⇓ , indicating that the classification is less stable and conclusive than when using the top 100 genes. Still, the results clearly demonstrate that the classification is not only controlled by a few very strong discriminator genes, but results from a far more complex expression pattern involving a substantial number of genes.
MDS and Hierarchical Clustering.
We used two standard clustering techniques to illustrate further the differences, found by the ANN models, in gene expression profiles between the two tumor categories. An MDS plot was created displaying the position of each tumor sample in a three-dimensional Euclidean space, with the distance between the samples reflecting their approximate degree of correlation (Fig. 2A) ⇓ . This MDS clustering was based on a WGA (16) that generated a set of 113 genes that showed significantly differentiated expression levels between the two tumor categories. There was an ∼50% overlap between the 100 most important discriminatory genes derived from WGA analysis and the ANN models, indicating that a substantial number of important discriminatory genes are revealed independent of the choice of analytical method. As can be seen from the MDS plot, the two categories (ER+ and ER−) are well separated, with the exception of one ER+ tumor, 6582, which clusters with the ER− tumors. This separation, consistent with the ANN analysis above, was confirmed additionally by hierarchical clustering of the 47 tumors and the 113 genes from the WGA (Fig. 2B) ⇓ , which organized the 47 tumor samples along the horizontal axis and created a dendrogram based on their similarities in gene expression profiles. The clustering organized all of the tumors into two separate dendrogram branches corresponding to ER+ and ER− tumors with the exception of two ER+ tumors, 6582 and 5955.
E2-Responsive Genes in the MCF-7 Cell Line.
To examine the relationship between ER function and the genes discriminating ER+ and ER− tumors, we compared our results with SAGE gene expression data reported for E2 stimulated MCF-7 cells 7 (17) . By reference to this data, 61 of the top 100 ER-discriminating genes uncovered by ANN analysis were represented by SAGE tags. Of these, only four genes (CCND1, STC2, SLC7A5, and KRT18) were regulated by E2 in MCF-7 cells, and one of these (KRT18, ranked 59 in the ANN sensitivity list) was regulated in a direction discordant with its relative expression in the tumor specimens (Table 1) ⇓ . This result suggests that the difference in gene expression profile between ER− and ER+ tumors can only in part be explained by the activity of a functional ER pathway in ER+ tumors.
To characterize in more detail the phenotypic characteristics of ER+ and ER− breast cancers, the expression of 6728 genes was investigated in primary tumor tissues from 58 breast cancer patients. Gene expression profiles of breast tumors have been investigated previously (19, 20, 21) but not in a tumor set suited to specifically study the ER+ and ER− classification problem. Here we have identified a homogenous group of node-negative breast cancers, 20–50 mm in diameter, of which about one-half (28) were ER+, whereas the remaining 30 were ER−. We then classified the samples using ANN models. Compared with the majority of other methods used, our approach has the advantage that it takes nonlinear dependencies in the data into account. In addition, because we were interested in classifying two well-known cancer types, rather than discovering new classes, a supervised approach was optimal. The ANN models were successfully trained to recognize gene expression patterns generated by cDNA microarray analysis, inasmuch as they accurately classified both the 47 training samples and the 11 blinded test samples (Fig. 1B) ⇓ using only 100 of the genes most important for the classification. Of note, the microarray analyses of the blinded test samples were performed separately using a different scanner and batch of microarray slides from those used for training, indicating the robustness of the methods involved, with regard to both measurements and analysis. Standard clustering algorithms such as MDS and hierarchical clustering showed similar results (Fig. 2) ⇓ , strengthening the conclusion based on ANN models that ER+ and ER− tumors exhibit distinct patterns of gene expression. We achieved our best classification results using the ANN models, demonstrating the potential to obtain improved results with nonlinear classification methods. This is also important for the extraction of relevant genes, because a nonlinear method may extract important genes that cannot be found by linear methods. However, this relatively small data set does not allow for a rigorous comparison of methods.
In addition to the accurate classification of disclosed as well as blinded samples based on the top 100 discriminatory genes, we found that a fairly good classification could be accomplished using lower-ranked genes. Although the reliability of the classification declined when using genes farther down on the list, the results indicate that information that contributes to the classification is carried by genes deep on this list. This is consistent with the gene expression profiles of ER+ and ER− tumors differing in a complex way, indicating the existence of two phenotypically very distinct groups of tumors. The ER status of breast tumors has been suggested to either reflect tumor progression with ER− tumors evolving from ER+ precursors, or to indicate a distinct origin from different types of epithelial cells in the mammary gland. Metastases from ER+ tumors may be ER− (22) supporting the former view. On the other hand, ER+ tumors have been suggested to exhibit the phenotype of luminal epithelial cells, whereas ER− tumors resemble myoepithelial (basal) cells (19) . Recently, it has been proposed that myoepithelial cells derive from self-renewing luminal cell precursors, an observation which might explain the predominant luminal phenotype of breast cancers (23) . Several of the ER status-discriminator genes are relevant to mammary gland histology. For example, we found that P-cadherin, characteristic of myoepithelial cells (24) , was more highly expressed in ER− tumors. The correlation between the expression of P-cadherin and ER-negativity in tumors has been observed previously (25) . The transcription factor C/EBP β, which has been suggested to control the cell-fate decision in the mammary gland (26) , is more highly expressed in ER− tumors. Of interest, C/EPB β-null mice have a defect in lobuloalveolar development and an abnormally high proportion of cells expressing the progesterone receptor (27) . We also identified lipocalin 2 as a gene associated with ER− tumors, consistent with a previous report (28) . Another gene expressed more highly in ER− tumors, ladinin, though not previously studied in breast cancer, is a basement-membrane protein that may well be associated with the basal/myoepithelial compartment (29) . Perou et al. (19) emphasized varying patterns of cytokeratin expression in breast cancer, and our results are consistent with that report, to the extent that the arrays used in these studies overlapped.
Several genes previously associated with ER positivity or a ductal/luminal localization were also identified as more highly expressed in this group of tumors. Among these were not only GATA3, but also TFF3, belonging to the same family of trefoil factors as pS2, a gene whose expression is regulated by ER. Although TFF3 was not present in the SAGE data, its induction by estrogen has been reported previously in MCF-7 cells (30) . Cyclin D1, a gene that is strongly associated with ER expression in breast cancer in this and other studies (31) , is strongly induced by E2 in MCF7. Carbonic anhydrase XII has very recently been localized to the ductal epithelium where it may promote tumor invasion by modifying the extracellular pH (32) . It is striking, though, that only a few genes on our discriminator list are E2-responsive in cell culture. This observation is consistent with the unique patterns of gene expression being largely explained on the basis of cell lineage, with a component of the ER+ pattern resulting from the function of an ER signaling pathway. In addition, the in vitro response of a single cell line to E2 may not faithfully reproduce the physiological effects of ER signaling in vivo, and the role of genes regulated by the progesterone receptor remains to be explored.
In conclusion, we have found that ER+ and ER− tumors display very different gene expression phenotypes. From examining expression patterns alone, we cannot establish whether the ER+ and ER− phenotypes reflect tumorigenesis from populations which diverged during normal differentiation or represent a phenotypic interconversion during tumorigenesis. Notably, only a small proportion of cells in the normal mammary epithelium express ER (33) , in sharp contrast to the high proportion of ER+ tumors. The underlying biology of the mammary epithelium is complex and the distinct cellular compartments, which give rise to cancers, are not fully defined. The mechanisms, which regulate these distinct gene expression programs, remain to be investigated, and are of importance for future breast cancer research.
We are indebted to participating departments of the South Sweden Breast Cancer Group for providing us with breast cancer samples and to Mattias Ohlsson for valuable discussions regarding the properties of ANNs.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
↵1 Supported in part by the Swedish Research Council and the Knut and Alice Wallenberg Foundation through the SWEGENE consortium (to M. R.) and the Swedish Foundation for Strategic Research (to C. P.). This work was partly supported by grants from the Lund University Medical Faculty, the Swedish Cancer Society, Berta Kamprad’s Foundation, the Gunnar Arvid and Elisabeth Nilsson Foundation, the Hospital of Lund Foundations, the E and F Bergqvist Foundation, and King Gustav V ’s Jubilee Foundation.
↵2 To whom requests for reprints should be addressed, at National Human Genome Research Institute, NIH, 49 Convent Drive, Bethesda, MD 20892-4470. Phone: (301) 594-5283; Fax: (301) 402-3281; E-mail:
↵3 The abbreviations used are: ER, estrogen receptor α; ANN, artificial neural network; E2, estradiol; PCA, principal component analysis; ROC, receiver operating characteristic; MDS, multidimensional scaling; WGA, weighted gene analysis; SAGE, serial analysis of gene expression; GATA3, GATA-binding protein; 3 TFF3, trefoil factor 3.
↵4 Å. Borg, M. Fernö, unpublished results.
↵5 Internet address: http://www.nhgri.nih.gov/DIR/LCG/15K/HTML/protocol.html.
↵6 Internet address: http://www.ncbi.nlm.nih.gov/SAGE/sagexpsetup.cgi.
↵7 Internet address: http://sciencepark.mdanderson.org/ggeg.
- Received April 26, 2001.
- Accepted June 25, 2001.
- ©2001 American Association for Cancer Research.