DNA microarrays allow quick and complete evaluation of a cell’s transcriptional activity. Expression genomics is very powerful in that it can generate expression data for a large number of genes simultaneously across multiple samples. In cancer research, an intriguing application of expression arrays includes assessing the molecular components of the neoplastic process and utilizing the data for cancer classification (Miller LD, et al. Cancer Cell 2002;2:353–61). Classification of human cancers into distinct groups based on their molecular profile rather than their histological appearance may prove to be more relevant to specific cancer diagnoses and cancer treatment regimes. Several attempts to formulate a consensus about classification and treatment of thyroid carcinoma based on standard histopathological analysis have resulted in published guidelines for diagnosis and initial disease management (Sherman SI. Lancet 2003;361:501–11). In the past few decades, no improvement has been made in the differential diagnosis of thyroid tumors by fine needle aspiration biopsy, specifically suspicious or indeterminate thyroid lesions, suggesting that a new approach to this should be explored. Therefore, in this study, we developed a gene expression approach to diagnose benign versus malignant thyroid lesions in 73 patients with thyroid tumors. We successfully built a 10 and 6 gene model able to differentiate benign versus malignant thyroid tumors. Our results support the premise that a molecular classification system for thyroid tumors is possible, and this in turn may provide a more accurate diagnostic tool for the clinician managing patients with suspicious thyroid lesions.
It is well known that cancer results from changes in gene expression patterns that are important for cellular regulatory processes such as growth, differentiation, DNA duplication, mismatch repair, and apoptosis. It is also becoming more apparent that effective treatment and diagnosis of cancer is dependent upon an understanding of these important processes. Classification of human cancers into distinct groups based on their origin and histopathological appearance has historically been the foundation for diagnosis and treatment. This classification is generally based on cellular architecture, certain unique cellular characteristics, and cell-specific antigens only. In contrast, gene expression assays have the potential to identify thousands of unique characteristics for each tumor type (1 , 2) . Elucidating a genome-wide expression pattern for disease states not only could have a enormous impact on our understanding of specific cell biology but could also provide the necessary link between molecular genetics and clinical medicine (3, 4, 5) .
Thyroid carcinoma represents 1% of all malignant diseases but 90% of all neuroendocrine malignancies. It is estimated that 5–10% of the population will develop a clinically significant thyroid nodule during their lifetime (6) . The best available test in the evaluation of a patient with a thyroid nodule is fine needle aspiration biopsy (FNA; Ref. 7 ). Of the malignant FNAs, the majority are from papillary thyroid cancers (PTCs) or its follicular variant (FVPTC). These can be easily diagnosed if they have the classic cytological features, including abundant cellularity and enlarged nuclei containing intranuclear grooves and inclusions (8) . Indeed, one-third of the time these diagnoses are clear on FNA. FNA of thyroid nodules has greatly reduced the need for thyroid surgery and has increased the percentage of malignant tumors among excised nodules (9 , 10) . In addition, the diagnosis of malignant thyroid tumors, combined with effective therapy, has led to a marked decrease in morbidity due to thyroid cancer. Unfortunately, many thyroid FNAs are not definitively benign or malignant, yielding an indeterminate or suspicious diagnosis. The prevalence of indeterminate FNAs varies but typically ranges from 10 to 25% of FNAs (11, 12, 13) . In general, thyroid FNAs are indeterminate because of overlapping or undefined morphological criteria for benign versus malignant lesions or focal nuclear atypia within otherwise benign specimens. Of note, twice as many patients are referred for surgery for a suspicious lesion (10%) than for a malignant lesion (5%), an occurrence that is not widely appreciated because the majority of FNAs are benign. Therefore, when the diagnosis is unclear on FNA, these patients are classified as having a suspicious or indeterminate lesion only. It is well known that frozen section analysis often yields no additional information.
The question then arises: should the surgeon perform a thyroid lobectomy, which is appropriate for benign lesions or a total thyroidectomy, which is appropriate for malignant lesions when the diagnosis is uncertain both preoperatively and intraoperatively? Thyroid lobectomy as the initial procedure for every patient with a suspicious FNA could result in the patient with cancer having to undergo a second operation for completion thyroidectomy. Conversely, total thyroidectomy for all patients with suspicious FNA would result in a majority of patients undergoing an unnecessary surgical procedure, requiring lifelong thyroid hormone replacement and exposure to the inherent risks of surgery (14) .
There is a compelling need to develop more accurate initial diagnostic tests for evaluating a thyroid nodule. Recent studies suggest that gene expression data from cDNA microarray analysis holds promise for improving tumor classification and for predicting response to therapy among cancer patients (15, 16, 17) . No clear consensus exists regarding which computational tool is optimal for the analysis of large gene expression profiling data sets, especially when they are used to predict outcome (18) .
Barden et al. (19) recently showed that in a group of 17 thyroid follicular tumors, 105 genes were found to be significantly different between follicular adenoma (FA) and carcinoma. These 105 genes were informative enough to diagnosis correctly 5 follicular tumors for which the final diagnosis was undisclosed (19) . In this study, we also describe the use of gene expression profiling and supervised machine learning algorithms to construct a molecular classification scheme for thyroid tumors (20) . The gene expression signatures discovered in this study include new tumor-related genes whose encoded proteins may be useful for improving the diagnosis of thyroid tumors.
MATERIALS AND METHODS
Thyroid tissues collected under Johns Hopkins University Hospital Institutional Review Board-approved protocols were snap-frozen in liquid nitrogen and stored at −80°C until use. The specimens were chosen based on their tumor type: PTC (n = 17); FVPTC (n = 15); FA (n = 16); and hyperplastic nodule (HN; n = 15). All diagnoses were made by the Surgical Pathology Department at Johns Hopkins.
Tissue Processing and Isolation of RNA.
Frozen sections of 100–300 mg of tissue were collected in test tubes containing 1 ml of Trizol. Samples were transferred to FastRNA tubes containing mini beads and homogenized in a FastPrep beater (Bio101Savant, Carlsbad, CA) for 1.5 min at speed 6. The lysate was transferred to a new tube, and total RNA was extracted according to the Trizol protocol (Molecular Research Center, Inc., Cincinnati, OH). Approximately 12 μg of total RNA were obtained from each tumor sample. The total RNA was then subjected to two rounds of amplification following the modified Eberwine method (21 , 22) . resulting in ∼42 μg of mRNA. The quality of the extracted RNA was tested by spectrophotometry and by evaluations on minichips (BioAnalyzer; Agilent Technologies, Palo Alto, CA).
Hybridization was performed on 10,000 human cDNA gene microarrays, Hs-UniGem2, produced by the National Cancer Institute/NIH (ATC, Gaithersburg, MD). Comparisons were made for each tumor with the same control which consisted of amplified RNA extracted from normal thyroid tissue and provided by Ambion, Inc. (Austin, TX). Fluorescent marker dyes (Cy5 and Cy3) were used to label the test and control samples, respectively. The respective dyes and samples were also switched to test for any labeling bias. The mixture of the two populations of RNA species was then hybridized to the same microarray and incubated for 16 h at 42°C. cDNA microarrays were then washed and scanned using the GenePix 4000B (Axon Instruments Inc., Union City, CA), and images were analyzed with GenePix software version 3.0. For each sample, a file containing the image of the array and an Excel file containing the expression ratio values for each gene was uploaded onto the MadbArray web site (National Center for Biotechnology Information/NIH) for additional analysis. 5 To accurately compare measurements from different experiments, the data were normalized, and the ratio (signal Cy5/signal Cy3) was calculated so that the median (ratio) was 1.0.
Data from the 73 thyroid tumors was used to build a benign (FA and HN) versus malignant (PTC and FVPTC) expression ratio-based model, capable of predicting the diagnosis (benign versus malignant) of each sample. After normalization, a file containing the gene expression ratio values from all 73 samples was imported into a statistical analysis software package (Partek, Inc., St. Charles, MO).
Samples were divided in two sets: one set (63 samples) was used to train the diagnosis predictor model, and a second set (8) was used as a validation set to test the model. These 10 samples were not previously used to do any other analysis.
As a first step, the data from the 63 samples were subjected to principal component analysis (PCA) to perform an exploratory analysis and to view the overall trend of the data. PCA is an exploratory technique that describes the structure of high-dimensional data by reducing its dimensionality. It is a linear transformation that converts n original variables (gene expression ratio values) into n new variables or PCs, which have three important properties: they (a) are ordered by the amount of variance explained; (b) are uncorrelated; and (c) explain all variation in the data. The new observations (each array) are represented by points in a three-dimensional space. The distance between any pair of points is related to the similarity between the two observations in high-dimensional space. Observations that are near each other are similar for a large number of variables and conversely, the ones that are far apart are different for a large number of variables.
An ANOVA test with Bonferroni correction was then used to identify genes that were statistically different between the two groups. The resulting significant genes were used to build a diagnosis-predictor model. Variable (gene) selection analysis with cross-validation was performed different times, each time testing a different number of gene combinations. For cross-validation, we used the leave-one-out method to estimate the accuracy of the output class prediction rule: the whole data set was divided into different parts, and each part was individually removed from the data set. The remaining data set was used to train a class prediction rule, and the resulting rule was applied to predict the class of the held-out sample.
ANOVA test with Bonferroni correction was used on 9100 genes to identify ones that were statistically different among the four groups. PCA of the 63 samples (Fig. 1) ⇓ using the statistically significant genes (data not shown) showed a clear organization of the samples based on diagnosis. The same analysis (ANOVA test with Bonferroni correction) was performed on the data set organized, this time, by grouping benign (HN-FA) and malignant (PTC-FVPTC). For this analysis, 47 genes were found to be significantly different between the benign and the malignant group (Table 1) ⇓ . PCA also separated the data clearly into two groups (Fig. 2) ⇓ .
For the purpose of this investigation, we focused our attention on the analysis of the data set separating benign from malignant. These 47 genes were used to build a diagnostic predictor model. Variable (gene) selection analysis with cross validation was performed with a different number of gene combinations. After cross-validation the model was 87.1% accurate in predicting benign versus malignant with an error rate of 12.9% (Table 2) ⇓ . This suggested that we were capable of using our data to create a diagnostic predictor model.
The most accurate results were obtained with a combination of 6–10 genes. This combination of genes constituted our predictor model and a validation set of 10 additional thyroid samples was used to confirm the accuracy of our model (Table 3) ⇓ . The pathological diagnosis for each sample was kept blinded to us at the time of the analysis. When the blind was broken, we found that 9 of the samples were diagnosed in concordance with the pathological diagnosis by our model. One sample that was originally diagnosed as a benign tumor by standard histological criteria was diagnosed as malignant by our model. This sample was re-reviewed by the Pathology Department at The Johns Hopkins Hospital and was subsequently found to be a neoplasm of uncertain malignant potential. The diagnosis was changed by pathology after review for clinical reasons, not because of the gene profiling. What is so extraordinary about this is that this was not suspected until the genotyping suggested that the lesion might be malignant and we examined the pathology report a second time. By that time, the report had been amended, and it suggested that the tumor had undetermined malignant potential. Regarding the other tumors, all were examined a second time before array analysis to be certain that the tissue was representative and consistent with the pathology report. Therefore, our model was correct in assigning the diagnosis in all 10 cases.
PCA using only the 6 most informative genes was conducted on all of the samples with and without the 10 unknown samples (Fig. 3, A and B) ⇓ . It is clear from the PCA organization that the 6 genes strongly distinguish benign from malignant. In addition, these same genes also give us a hint of the possible diagnosis with respect to the four subcategories of thyroid lesion. Between the two-predictor models, 11 genes are informative (Table 4) ⇓ . Each gene alone is not informative enough to classify a specific sample (Fig. 4) ⇓ . Each sample has a specific profile given by the combination of six genes.
The identification of markers that can determine a specific type of tumor, predict patient outcome, or predict the tumor response to specific therapies is currently a major focus of cancer research. With the information gleaned from the human genome project and the new high-throughput technologies available, one can begin to derive detailed gene expression profiles for various tumor types. In this study, we have capitalized upon such technology in an attempt to solve a pressing clinical problem in endocrine surgery. Our aim in this study was to use gene expression profiling to build a predictor model able to distinguish a benign thyroid tumor from a malignant one. In theory, such a model, when applied to FNA cytology, could impact greatly the clinical management of patients with suspicious thyroid lesions. To build the predictor model, we used four types of thyroid lesions, PTC, FVPTC, FA, and HN. Taken together, these represent the majority of thyroid lesions that often present as suspicious. The choice of the appropriate control for comparative array experiments is often the subject of much discussion. In this case, in order for us to construct a predictive diagnostic algorithm based on a training set of samples, it was necessary to have a common reference standard to which all individual samples are compared. In this way, differences between each and, in fact, all samples could be analyzed. Had we compared each tumor to the adjacent normal thyroid tissue from the same patient, we could only comment on gene changes within each patient. Although this type of analysis would be appropriate to test certain hypotheses, in the present series of experiments, a common reference pool was required. We chose a source of RNA from normal thyroid tissue because the source was replenishable and could be used for all of our future experiments once the diagnostic predictor algorithm was validated.
It is important to point out that the mRNA extracted from each sample was amplified. We have found that the quality of the arrays and the data derived from them is superior when mRNA has been amplified from total RNA. Our lab, as well as others, have published specific papers in this regard that have demonstrated increased sensitivity without loss of quality or a bias in the results (23) . Of note, all samples and all reference controls were amplified in the same fashion.
Analysis of the overall gene expression profiles revealed that the benign lesions (FA, HN) could be distinguished from the malignant lesions (PTC, FVPTC). Furthermore, although not statistically significant, the four tumor subtypes appeared to have different gene profiles. The use of a powerful statistical analysis program (Partek, Inc.) helped us to discover a group of 11 genes that were informative enough to create a predictor model. Two combinations were created out of these 11 genes, a combination of 6 genes, and a combination of 10 genes. PCA of the 6 most informative genes resulted in a nearly perfect distinction between the two groups (Fig. 3, A and B) ⇓ . In general, PCA describes similarities between samples and is not a commonly used tool for predicting diagnosis. However, in this study, the distinction was so powerful that we were able to visually make a correct diagnosis for each of the 10 unknown samples (Fig. 3, A and B) ⇓ .
However, the predictor model determines the kind of tumor with a specific probability value. We were able to correctly predict the diagnosis of all 10 unknown samples, with a more accurate prediction using the 6 gene combination (Table 3 ⇓ ; see Ps).
It is clear from the graph in Fig. 4 ⇓ how the combination of gene expression values gives a distinctly different profile between the benign and malignant lesions. However, within each tumor group, there are differences among the profiles of the five samples tested. This could be explained by the fact that each tumor, even if of the same type, could be at a different stage of progression.
Of the 11 genes that were informative for the diagnosis, five genes are known genes, and for the other six genes, no functional studies are yet available (Table 4) ⇓ . The genes that were identified are the ones that the model has determined best group the known samples into their correct diagnosis. Those genes identified are the ones that consistently grouped the samples into the categories and subcategories described in the text. This type of pattern assignment is based on the analysis of thousands of genes and the recognition by the computer software that certain patterns are associated repeatedly with certain diagnostic groups. This type of analysis derives it power (and significance) by the number of genes that are analyzed, rather than the degree of up or down-regulation of any particular gene. With respect to the specific genes identified, the computer is not biased by the knowledge of previously identified genes associated with thyroid cancer. The genes it identifies are those that best differentiate the varied diagnoses of the known samples. This occurs during the training phase of establishing the algorithm. Once the computer is trained with data from comparisons of RNA from known diagnoses to a standard reference, unknowns can be tested and fit to the diagnostic groups predicted during the training. For the purposes of such an approach, individual genes are less important. A specific gene, which is found in a univariate study to be associated with thyroid cancer, may not turn out to be the best multivariate predictor of a diagnosis in an analysis such as the one presented here.
The present study is a clear example of how gene expression profiling can provide highly useful diagnostic information. Recently, Barden et al. (19) carried out a similar study and obtained a group of 105 genes significantly different between malignant and benign tumors. It is almost certain that analogous studies will yield similar results for many other types of tumors. It is likely that gene expression profiling will be used in the future for clinical decision making. For this purpose, adequate reporting of DNA microarray data to clinicians will be necessary. Gene expression profiles may be more reproducible and clinically applicable than well established but highly subjective techniques such as histopathology. The small number of genes for which RNA expression levels are diagnostically and prognostically relevant raises strong hope that a robust, affordable, commercially available testing system could soon emerge. We hope that the results presented here will provide a useful method for classifying thyroid nodules as benign or malignant and therefore help facilitate appropriate, and eliminate unnecessary, operations in patients with suspicious thyroid tumors.
We thank Filomeno D. Apor, Sidney L. Craddock, Jr., Cleo Sawyer, Alan Silverio, and Dante Trusty for their dedication in the collection of thyroid specimens.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Requests for reprints: Steven K. Libutti, Senior Investigator, Surgery Branch, National Cancer Institute Building 10, Room 2B07, 10 Center Drive, Bethesda, MD 20892. E-mail:
↵5 Internet address: http://nciarray.nci.nih.gov.
- Received December 5, 2003.
- Revision received February 1, 2004.
- Accepted February 9, 2004.
- ©2004 American Association for Cancer Research.