Abstract
Comparative genomic hybridization (CGH), microsatellite instability (MSI) assays, and expression microarrays were used to molecularly subclassify a common set of gastric tumor samples. We identified a number of novel genomic aberrations associated with gastric cancer and discovered that gastric tumors could be grouped by their expression profiles into three broad classes: “tumorigenic,” “reactive,” and “gastric-like.” Patients with gastric-like tumors exhibited a significantly better overall survival than patients belonging to the other two classes (P < 0.05). A novel supervised learning methodology for multiclass prediction was used to identify optimal predictor gene sets that accurately predicted the class of an unknown tumor sample. These predictor sets may prove useful in the development of new diagnostic applications for gastric cancer staging and prognostication.
INTRODUCTION
Gastric adenocarcinoma is a leading cause of cancer mortality worldwide, surpassed only by lung and breast cancer (1) . A major difficulty in the diagnosis and treatment of gastric cancer is that very few of the currently used classification schemes are strong predictors of clinical behavior. Traditional classifications of gastric cancer on the basis of mucin content, histological architecture, and cellular differentiation are highly subject to interobserver variation and are, thus, neither robust nor clinically meaningful (2) . To date, only tumor staging is a proven prognosticator of gastric cancer and, therapeutically, only surgery has been shown to convey a survival benefit (3) .
Recently, it has been shown that the resolving power of classification schemes based on molecular data can be sufficiently sensitive to detect new disease subtypes that have hitherto eluded traditional light microscopy approaches (4) . In this study, we used various molecular assays such as CGH, 5 MSI studies, and expression microarrays to characterize a common set of gastric tumors. We identified several novel genomic aberrations associated with gastric cancer and discovered that gastric cancers could be divided into three broad molecular subgroups (“tumorigenic,” “reactive,” and “gastric-like”) on the basis of their expression profiles. Patients belonging to one of these subgroups (gastric-like) exhibited a significantly better overall survival than patients belonging to the other groups. Using a recently described novel methodology for multiclass prediction, we defined various optimal predictor gene sets capable of accurately predicting the class of an unknown tumor sample. Our results show that molecular data can provide a useful framework for furthering our understanding of the taxonomy and pathology of gastric cancer.
MATERIALS AND METHODS
Tissue Samples and Histological Review.
Gastric tissue specimens, peripheral blood samples, and clinical records were provided by the National Cancer Center Tissue Repository after approval by the Center’s Ethics Committee. Three surgical samples of normal gastric tissue were also obtained from patients with benign gastric disease. Paraffin sections of the 60 gastric cancer cases in this study were independently reviewed and classified by a single pathologist (S. Y. T.) using established criteria (5) .
CGH.
CGH was performed as described elsewhere (6) . Of the 13 tumors with low CNA, all 13 (100%) had >50% tumor content. Of the 16 tumors with no CNA, 8 (50%) had >60% tumor content, and the remaining 50% had <50% tumor content.
Microsatellite Analysis of Tumors.
Multiplex PCR was performed at five markers (Bat25, Bat26, D5S346, D2S123, D17S250) on tumor DNA and case-matched normal genomic DNA from peripheral blood or histologically verified normal gastric tissues of 59 patients. Microsatellite stability (MSS), MSI-H, and MSI-L were scored by consensus criteria (7) .
Generation of Expression Profiles.
cDNA microarrays of approximately 13K and 18K array targets were produced using established procedures (8) with cDNA clones from commercial vendors (Incyte and Research Genetics). Identities of array targets were confirmed by resequencing the parent clones. Expression profiles were generated from tumor specimens containing a minimum of >50% cancer cells as assessed by cryosections. Total RNA was extracted from homogenized gastric tissue and 5 μg were amplified using a single-round T7-polymerase-based linear amplification protocol (9) . Each microarray hybridization used 1–2 μg aRNA (amplified RNA) and was compared with a common reference RNA pool (Universal Reference RNA; Stratagene).
Microarray Data Analysis.
Microarray data sets are downloadable. 6 An initial data set was created from array targets that were well measured across 90% of all of the arrays and normalized by median centering each sample (array) and array target (gene). A truncated data set (764 array targets) was then formed by selecting array targets exhibiting a minimal SD of >0.7 across all of the samples. Minor variations on the gene selection filter (e.g., using a SD of 0.6–0.8) did not significantly affect results of the clustering analysis (data not shown). 7 “Semisupervised” clustering was performed using seven distinct expression clusters whose boundaries were visually determined using Treeview software. Supervised clustering was performed using the following algorithms: OVA SVM (10) , nearest neighbor correlation analysis (NNCA), and GA/MLHD; Ref. 11 ; see Supplementary Information 2 for details). Accuracy of the supervised classification methodologies was assessed using LVO CV.
Survival Curves.
Kaplan-Meier survival curves were generated using SPSS software. To maximize sample size, clinical data was used from patients whose biological samples are in Fig. 3 ⇓ , as well as from four additional patients whose samples could be reliably assigned to a specific tumor class (one tumorigenic, one reactive, and two gastric-like) using two independent classification methodologies [OVA SVM and nearest neighbor correlation analysis (NNCA); see Supplementary Information 2] .
RESULTS
Identification of Novel Genomic Aberrations in Gastric Cancer by CGH
We first classified the 60 gastric tumors in our study by conventional clinical and histopathological criteria (Supplementary Information 2 ) and confirmed that the demographic, anatomical and histopathological features of the tumors (all adenocarcinomas) in this series were comparable with other published studies and representative of gastric cancers in general. Using CGH, we then determined that 16 (26.6%) of the 60 tumors had no CNAs. The remaining 44 aneuploid tumors were then stratified into high-, intermediate-, and low-frequency CNAs using criteria of ≥10, 5–9, and 1–4 chromosomal gains and/or losses per tumor respectively, with tumors showing gains or losses of an entire chromosome being scored as having two separate abnormalities (of the p and q arms). On the basis of these criteria, 16 tumors (26.6%) had high CNAs, 15 (25%) had intermediate CNAs, and 13 (21.2%) had low CNAs. Similar to other smaller series, the mean genomic copy number change was 8.8 (range, 1–29) with gains (mean, 6.4/tumor; range, 1–27) exceeding losses (mean, 2.3/tumor; range, 1–10; Ref. 12 ).
In addition to previously reported gains in 20q, 8q, 7p, 13q, 20p, and 17q (Ref. 12 ; Fig. 1 ⇓ ; Supplementary Information 2 ), several tumors also exhibited novel chromosomal amplifications in 11p, 12p, 14q, 22q, 10q, 17p, 4p, 10p, 16q, 19p, and 4q. Strikingly, ∼13% of tumors exhibited a gain of 16q. High-level amplifications were also identified in 20q11.2-q13 (six cases), 6p21.1-p21.3, 16p12 and 19q12-q13 (four cases each), 8q24.1-q24.2, 11p13-p14, 12p11.2-q12, 12q14, and 17q12-q21 (three cases each). Although deletions on 18q, 4q, 5q, 17p, and 9p have been reported by others, we also observed deletions in 8p, 10q, 11q, and 1q. Chromosomal imbalances in 2p, 4p, and 5p occurred only in the intestinal-type tumors but were absent in all of the diffuse-type tumors in this series.
Chromosomal gains and losses in gastric cancer specimens. Gains are shown as green lines to the right of chromosomes, and losses as red lines on the left. Thick solid lines, highly amplified regions.
Classification of Tumors by Microsatellite Analysis
A high percentage of tumors (48%) in our series exhibited no or low CNAs. Because it has been shown in colon cancer that many no-CNA or low-CNA tumors often exhibit microlevel genomic instability resulting from defects in DNA mismatch repair components (e.g., MSH2 and MLH1; Ref. 13 ), we then performed MSI studies on the gastric tumors. Seven (12%) of 58 adenocarcinomas in this series were MSI-H, among which six had relatively few (≤ 7; n = 3) or no copy number changes (n = 3) by CGH. Five tumors were MSI-L. Immunostaining for MSH2 and MLH1 showed that very few of the gastric tumors had lost expression of either protein (S. Y. T., data not shown). Signet ring tumors were more likely to be MSI-H (3 of 15) than tubular tumors (4 of 39; P = 0.295 by Fisher’s exact test), and expansive gastric adenocarcinomas were more frequently MSI-H (2 of 10) than infiltrating tumors (5 of 48; P = 0.347).
Identification of Biological Expression Signatures Using Unsupervised Clustering
We then used cDNA microarrays to generate expression profiles for the gastric tumors, focusing initially on samples containing >50% tumor cells as determined by cryosections (47 tumors), and also profiling 3 surgical samples of normal gastric mucosae obtained from patients with benign gastric disease. Using various data filters, we defined a set of 746 array targets representing well-measured genes that exhibited considerable transcriptional variation across all of the gastric samples (see “Materials and Methods”). A two-way unsupervised HC algorithm was then used to order the gastric samples and genes on the basis of their similarity to one another (Fig. 2) ⇓ . The gastric tumors segregated into three broad subclasses (discussed in the next section), and the three normal gastric samples exhibited tight cosegregation, indicating that their expression profiles are highly correlated to one another.
Unsupervised clustering of gastric cancer expression profiles. Two-way HC was used to order samples (columns) and array targets (rows). Samples include 47 tumor specimens and 3 samples of normal gastric tissue (dark blue bar under dendogram). The relative expression level of an array target in each sample (compared with all other samples) is depicted according to the color scale bar (top right). All of the tumor specimens contained >50% tumor cells as assayed by cryosections.
The unsupervised clustering algorithm successfully grouped the majority of array targets/genes into distinct “expression signatures” based on their relative expression levels in the gastric samples (color bars, left of clustergram in Figure 2 ⇓ ). We identified at least seven specific signatures, each associated with a distinct biological process. In this report, only a few members of each signature are mentioned. 6
Cell Growth and Proliferation.
This expression signature contained several genes involved in different aspects of cell growth, e.g., energy metabolism (adenylate kinase), DNA and protein synthesis (various ribosomal proteins and nucleoside phosphorylase), and cell cycle regulation (cyclin D1). Notably, cyclin D1 overexpression has been reported in a gastric cancer subset (14) . Other relevant genes in this signature (not depicted in Fig. 2 ⇓ ) included thymidine kinase and replication factor C (data not shown). 7
Intestinal Metaplasia.
The genes in this expression signature were highly expressed in many of the tumor samples but were down-regulated in the samples of normal gastric mucosae. They included markers of intestinal differentiation such as villin-1, trefoil factor 3(intestinal), the intestinal brush border protein galectin 4, and the intestinal enzyme glutathione peroxidase 2. The cytoskeletal proteins keratin 8 and 18 were also part of this signature suggesting a specific “crypt”-like intestinal character (15) . The presence of these intestinal markers in tumors but not in normal tissue supports the hypothesis that intestinal metaplasia is a predisposing factor in gastric carcinogenesis.
Immunity.
We detected two distinct clusters (Immunity A and B) related to immunological function. Immunity A contained multiple MHC Class I genes (B, C, and G), whereas Immunity B was composed primarily of MHC Class II genes (DO, DP, DQ, and DR) and α2-macroglobulin. Previous reports have suggested that gastric cancer cells with a tendency for peritoneal dissemination are associated with the up-regulation of MHC Class I molecules, whereas gastric cancer cells with a tendency for lymph node metastasis tend to up-regulate MHC Class II genes instead (16 , 17) . Alternatively, it is also possible that the presence of distinct populations of immune cells may contribute to the differential expression of the MHC genes observed in this tumor series.
Tumor-like.
This prominent expression signature contained several genes associated with an active tumorigenic phenotype, such as markers of tumor hypoxia (HIF-1) and reactive angiogenesis (VEGF). Also in this cluster were β1-integrin and matrix metalloproteinase 9 (MMP-9), both having been implicated in gastric tumor invasion and dissemination (18 , 19) . Tumor markers in this cluster included tumor rejection antigen gp96 and tumor-associated calcium signal transducer 1. Genes involved in protein degradation, e.g., several 26S proteosome subunits and the E1 ubiquitin-activating enzyme were also prominent in this cluster, as was the transcription factor GATA6, previously shown to be strongly expressed in certain gastric cancer cell lines (20) .
Remodeling.
This cluster contained genes such as Mucin 5B and FGFR1, which have been reported to be expressed in a subset of gastric cancers (21 , 22) but not in normal gastric tissue (22 , 23) . However, the most striking feature of this cluster was the presence of numerous genes involved in stromal remodeling and endothelial growth, suggesting the presence of an active desmoplastic reaction, which is frequently observed in gastric cancer. A number of smooth muscle genes (leiomodin 1, calponin 1) were highly up-regulated, as were the pan-endothelial markers hevin, IGFBP4, and matrix Gla protein (24) . Also present were genes such as MMP-2 and COL1A1 that behave as specific markers of tumor endothelium (24) .
Gastric-like.
This final cluster was strongly expressed in the three benign gastric specimens, as well as in several tumor samples, and contained genes associated with gastric epithelia including the digestive proteins pepsinogen C, gastric lipase, and tryptase II/Granzyme K. The tight junction epithelial proteins p55 and desmoplakin were also present, as was the gastric-specific growth hormone ghrelin. The secreted frizzled-related protein hsFRP, shown to be expressed in normal gastric tissues and some gastric cancers (24) , was also in this cluster. It is important to note that the tumor samples in this group were confirmed by histological examination of cryosections to contain a very high percentage (80–100%) of tumor cells. Thus, it is unlikely that the presence of this gastric-like expression signature in these tumor samples arises from the presence of contaminating normal gastric tissue, but instead is reflective of the endogenous tumor expression profile.
Molecular Subtypes of Gastric Cancer have Distinct Clinical Behaviors
The expression signatures detected in the previous analysis suggest that several specific and possibly independent biological subprograms might be operating in the gastric cancer samples. Because these signatures appeared to be differentially regulated across the gastric cancer specimens, we hypothesized that they could be used to divide the gastric cancer samples into various molecular subtypes. To test this hypothesis, we performed a “semisupervised” clustering operation in which the gastric tissue specimens were reclustered on the basis of their expression levels in the seven signatures described in the previous section. This operation, using a combined total of 598 array targets/genes, subdivided the gastric cancer specimens into three broad groups, which we refer to as tumorigenic, reactive, and gastric-like based on the principal expression signature that defines each group (Fig. 3) ⇓ . Because the groupings defined in the purely unsupervised and semisupervised clusterings were highly comparable (Supplementary Information), 2 such semisupervised clustering was performed primarily to refine the distribution of specific tumors within each group and to minimize “noise” caused by extraneous genes.
Molecular subtypes of gastric cancer. A, two-way HC was used to order samples (columns) using the array targets (rows) corresponding to the seven expression signatures defined in Fig. 2 ⇓ (color-codes of expression signatures are depicted in Fig. 2 ⇓ ). B, identity of specific tissue specimens belonging to each molecular subtype. Color codes: black, tumorigenic; purple, reactive; and green, gastric-like.
We then attempted to determine whether the subtypes defined by the expression analysis might be associated with any clinical or histopathological criteria. To maximize our sample size, we included clinical data from four additional patients whose samples were not used in the initial expression analysis because they contained 40% tumor cells (as assessed by cryosections).The samples, subsequently, could be reliably assigned to a specific class using two independent classification methods (Supplementary Information). 2 No significant associations were discovered between the three molecular subgroups and age of diagnosis, patient sex, tumor site, Lauren classification (intestinal or diffuse), tumor differentiation status, or clinical stage at diagnosis (Supplementary Information). 2 However, when a survival analysis was performed, we discovered that patients with gastric-like tumors exhibited a significantly better overall survival (P < 0.05) than patients belonging to the other two groups (Fig. 4A) ⇓ , suggesting that subtyping gastric cancers by expression profiling might identify clinically relevant features of gastric adenocarcinoma.
Molecular subtypes of gastric cancer exhibit distinct clinical behaviors. A, Kaplan-Meier survival curves of all gastric cancer patients (n = 51) divided by molecular subtypes (gastric-like versus tumorigenic and reactive). Patients with gastric-like gastric cancers exhibited higher overall survival than patients belonging to the other two groups (P = 0.0496). No significant differences in the survival of patients belonging to the tumorigenic and reactive groups were observed (data not shown). 7 B, survival curves of Stage II and Stage III gastric cancer patients divided by tumor stage (n = 30). No statistically significant differences are observed, possibly because of the small sample size (P = 0.16). C, survival curves of the same 30 stage II and stage III gastric cancer patients divided by molecular subtype (tumorigenic: 10 patients, 6 at stage II and 4 at stage III; reactive: 12 patients, 3 at stage II and 9 at stage III; gastric-like: 8 patients, 4 at stage II and 4 at stage III). Patients with gastric-like tumors still exhibit a significantly better overall survival (P = 0.042).
The presenting tumor stages of patients belonging to each of the three molecular subtypes were comparable, and a multivariate analysis confirmed that tumor stage and molecular subtype were not significantly associated (P = 0.58 by χ2, and P = 0.51 by subsequent evaluation using ANOVA). This result suggests that the improved prognosis of patients with gastric-like tumors might be attributable to factors independent of tumor stage, and that knowing the molecular subtype of a gastric tumor might serve as a useful adjunct to traditionally used staging systems for disease prognostication. To explore this possibility, we stratified our patients by tumor stage, and we confirmed that patients presenting at both extremes of the clinical spectrum (stages I and IV) were associated with statistically significant “good” and “bad” prognoses, respectively (good, Stage I versus II/III/IV, P < 0.001; bad, Stage IV versus II/III, P < 0.05; data not shown). 7 However, although there was an observed tendency for stage II patients to have a better prognosis than stage III patients, this difference was not statistically significant, possibly because of the small sample sizes involved (Fig. 4B ⇓ ; P = 0.17). Nevertheless, when these same patients (stage II and III) were then restratified according to their molecular subtype, we once again found that, despite the reduced sample size, patients with gastric-like tumors still exhibited a significantly better overall survival than patients belonging to the other two groups (Fig. 4C ⇓ ; P < 0.05). These results suggest that, for gastric cancer prognostication, classical tumor staging may play a dominant role for early stage (I) and late stage (IV) patients. However, for patients presenting at intermediate clinical stages (Stage II and III, the majority of gastric cancer cases), prognostication by molecular subtype may prove more clinically useful than classical tumor stage.
Identification of Minimal Predictor Gene Sets for Gastric Cancer Classification
We then attempted to define a minimal predictor gene set that could accurately predict the subtype of an unknown gastric tumor sample. A specific requirement of this gene set was the necessity to distinguish between three subgroups (i.e., multiclass prediction). Applying a supervised learning approach to this problem, we first adopted a OVA approach on the initial 746-gene set, reducing the multiclass set to a series of quasibinary class distinctions (i.e., T versus (RG), R versus (TG), and G versus (TR) where T, R, and G refer to tumorigenic, reactive, and gastric-like, respectively). A SVM was then used to classify the samples, and the accuracy of the algorithm was assessed through LVO CV studies. (Because of the limited number of samples, we were unable in this study to assess the accuracy of the algorithm through the ‘gold standard’ of an independent test of naïve samples. It is relatively challenging to obtain these samples, as reflected by the fact that our study, despite its limited size, actually constitutes one of the largest molecular profiling studies (CGH or expression microarray) performed on gastric cancer to date). Nevertheless, the SVM algorithm, trained on all 746 genes, successfully classified the tumor samples to reasonably high degrees of sensitivity and specificity (81%; Table 1 ⇓ ). Higher predictive accuracies (94%) were obtained, however, when the OVA SVM was trained on only the top 10 genes both positively and negatively correlated to each of the three classes (total number, 60 genes).
Classification accuracies delivered by the OVA SVM algorithm
Predictive accuracy was assessed by LVO CV (47 samples). Numbers in parentheses refer to the total number of unique samples that were misclassified.
The SVM algorithm, although extremely powerful for binary class prediction scenarios, is associated with certain issues that render it less than ideal for use in a multiclass prediction setting, such as the reliance on rank-based gene selection and OVA reduction (see “Discussion”). As the requirement for small numbers of genes is particularly relevant in the development of diagnostic applications, we then applied a novel classification methodology (GA/MLHD) to classify the gastric tumors (11) . One advantage of the GA/MLHD methodology is its ability to identify predictor gene sets of drastically fewer genes that nevertheless deliver comparable or slightly higher predictive accuracies than the conventional OVA SVM approach. We applied the GA/MLHD methodology to the gastric tumor data set and selected optimal predictor sets after multiple independent runs of 100 generations each. Using this approach, we identified several optimal predictor gene sets of minimal feature size (12–17 genes) that yielded a CV classification accuracy of 100% (See Supplemental Information 2 for predictor sets). Similar results were also obtained when the samples were randomly divided 100 times in a 66/33% training-/test-set manner, in which an independently generated predictor set was used for each split (Supplementary information). 2 In contrast, a predictor set of comparable size (18 genes), created using the OVA SVM, exhibited a lower CV classification accuracy of 89–91% (Table 1) ⇓ .
DISCUSSION
The pathogenesis of gastric cancer is complex and dependent on both extrinsic (e.g., microbial infections) and intrinsic factors (e.g., hypochlorhydria; Ref. 25 ). Although associations between gastric cancer and various genotypes [e.g., E-cadherin (26) and interleukin receptor IL-1RN gene polymorphisms (27)] have been reported, relatively little is still currently known about the fundamental pathobiology of gastric adenocarcinoma. In this report, several different molecular assays were used to analyze a common set of gastric cancer specimens. Using CGH, we identified several specific chromosomal aberrations that occur frequently in gastric cancer, including novel genomic aberrations such as amplifications at the 16p locus. It is likely that these chromosomal aberrations reflect the selective retention of genomic fragments housing “driver” genes whose products functionally contribute to gastric carcinogenesis. For example, the 20q region harbors several genes implicated in tumor formation, such as AIB1, BTAK, and PTPN1. For other frequently observed aberrations (e.g., 13q), there are as yet no specific driver genes that have been implicated. To identify these novel genes, we are currently attempting to integrate the CGH data with the microarray expression results. An initial attempt to combine these two data sets revealed a relatively poor correlation between the presence of an amplified chromosomal fragment and the transcriptional overexpression of genes on that fragment (data not shown), 8 but this may be attributable to the relatively low resolution achievable by the chromosomal CGH assay. Addressing this issue through the use of higher-resolution assays such as array-CGH will be an important task for future research.
As an alternative to CGH, which detects large-scale genomic aberrations, we also used MSI studies to address the possibility that the high proportion of low- and no-NA tumors in our series might be associated with microlevel genomic instability. Although only a small fraction of the no- and low-CNA tumors [i.e., 6 (19%) of 32] were MSI-H, this observation carries several caveats. For example, protocol discrepancies may explain much of the reported variation of MSI in gastric cancer (from 9 to ∼44%; Refs. 28 and 29 ), and the standard marker loci used in the MSI assay may not be equally mutable in all DNA mismatch repair-deficient tumors. Indeed, a study of MSI in gastric cancer showed that the rate of instability of different markers in the same tumors ranged from 0 to 77% (30) . Furthermore, MSI-H gastric cancers are known to harbor mutations in genes that are distinct from those found in other tumor types (31) . Nevertheless, if a majority of CNA-absent gastric cancers are truly mismatch repair-proficient (as our data indicate), then this may suggest the existence of an alternative pathway capable of driving the oncogenic potential of no/low-CNA gastric tumors in the absence of processes causing either classical micro- or macro-genomic instability (the former being measured by MSI and the latter by CGH).
We also discovered that the gastric cancers could be divided on the basis of their expression profiles into three major groups: tumorigenic, reactive, and gastric-like, and that patients with gastric-like tumors exhibited a significantly better overall survival than patients of the other two groups. The clinical usefulness of the molecular subtypes became more apparent when used to prognosticate patients presenting at intermediate clinical stages. Our current hypothesis is that each of the molecular subtypes is associated with a distinct biological behavior, which ultimately contributes to the differing survival rates. For example, tumorigenic tumors may be more clinically and metabolically aggressive, whereas gastric-like tumors may progress along a more indolent course. In addition, the expression signatures found in each subtype logically suggest that certain therapies may be more effective against certain tumor subtypes than against other therapies. For example, reactive tumors, by virtue of their association with numerous endothelial growth markers, may be more susceptible to antiangiogenic therapies and strategies that target the surrounding normal stroma.
Finally, we used various supervised learning approaches to define a minimal predictor gene set that could accurately classify the class of an unknown gastric sample. To date, much less work has been done specifically on algorithms for multiclass prediction than for binary prediction. The popular OVA SVM approach (10) , for example, is associated with certain issues that render it less than ideal for use in a multiclass setting. Because it relies on converting the multiclass scenario into a series of quasibinary class prediction problems (the OVA approach), distinct sets of predictor genes need to be selected for each quasibinary class distinction, leading to a final combined predictor gene set that can be fairly large and unwieldy, especially for the development of diagnostic assays. As an alternative, we used a methodology that we developed (GA/MLHD) which was created specifically for use in a multiclass prediction setting (11) . In addition to being able to automatically determine the optimal number of genes that should belong to a predictor gene set (a number that normally has to be prespecified), the GA/MLHD approach does not rely on a rank-based gene selection strategy, and deliberately selects genes that are uncorrelated in expression to each other to belong to a predictor gene set. Although the strength of this approach is primarily seen in scenarios involving many classes (i.e., more than five; see Ref. 10 ), the application of the GA/MLHD methodology to the gastric cancer data set allowed us to define a series of small (<20) gene sets that delivered very high classification accuracies (100% CV accuracy, as compared with 87% for a OVA SVM based on 21 genes). We are hopeful that the GA/MLHD methodology will prove useful also in other complex multiclass prediction settings for other cancers. In conclusion, our results offer several insights and suggest multiple logical avenues for future research into gastric cancer, which may ultimately lead to improved methods of diagnosis, treatment, and prevention of this important and complex disease.
Acknowledgments
We thank the NCC Tissue Repository for tissue specimens, Choon Wei Wee and Cheryl Lee for clone resequencing, Alwin Loh and Ivy Sng for assistance with histological review, National Medical Research Council and National Cancer Centre for financial support, the Lee Foundation for the purchase of clones, and Pulivarthi Rao for CGH training. P. T. thanks Hui Kam Man for his encouragement and support.
Footnotes
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
↵1 Supported in part by National Medical Research Council, National Cancer Centre, and the Lee Foundation.
↵2 Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org).
↵3 S. T. T. and S. H. L. contributed equally to this work.
↵4 To whom requests for reprints should be addressed, at Defence Medical Research Institute, National Cancer Centre, 11 Hospital Drive, Singapore 169610, Republic of Singapore. E-mail: cmrtan{at}nccs.com.sg
↵5 The abbreviations used are: CGH, comparative genomic hybridization; CNA, copy number abnormality; MSI, microsatellite instability, HC, hierarchical clustering, SVM, support vector machine, LVO, leave-one-out; CV, cross-validation; OVA, one-versus-all; MSI-H, high MSI; MSI-L, low MSI; GA/MLHD, genetic algorithm/maximum likelihood discriminant analysis.
↵6 The entire expression data set is available at www.omniarray.com/gastric_cancer.html.
↵7 P. Tan, unpublished observations.
↵8 P. Tan and O. L. Kon, unpublished observations.
- Received September 25, 2002.
- Accepted April 11, 2003.
- ©2003 American Association for Cancer Research.