| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Experimental Therapeutics, Molecular Targets, and Chemical Biology |
1 Bioinformatics and Drug Design Group, Department of Pharmacy; 2 Center for Computational Science and Engineering; and Departments of 3 Biological Sciences and 4 Physics, National University of Singapore, Singapore, Singapore
Requests for reprints: Yu Zong Chen, Department of Pharmacy, National University of Singapore, S16, Level 8, 6 Science Drive 2, Singapore 117546, Singapore. Phone: 65-6516-6877; Fax: 65-6774-6756; E-mail: phacyz{at}nus.edu.sg.
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
|
|
We explored a new signature selection method aimed at reducing the chances of erroneous elimination of predictor genes due to the noises contained in microarray data set. In our approach, non–predictor genes were eliminated by consensus scoring of a large number of training and test sets generated from repeated random sampling (9), and by incorporating a multistep gene-ranking consistency evaluation procedure into a well-established signature selection method. Our method was tested by using a well-studied colon cancer data set (16) and two other data sets—86 samples of lung adenocarcinoma (21) and 60 samples of hepatocellular carcinoma (22) so that our derived signatures can be adequately evaluated and compared with those of other studies using different sampling sets, class differentiation methods, and feature selection methods. The performance of selected signatures for the colon cancer data set was further evaluated by applying the selected signatures in predicting colon cancer outcomes from an independent colon cancer data set separately generated from Stanford Microarray Database (23), and the performance of these signatures was compared with those of previously derived colon cancer signatures applied to the same data set.
| Materials and Methods |
|---|
|
|
|---|
SVMs (24) project feature vectors into a high-dimensional feature space by using a kernel function
. The linear SVM procedure can then be applied to the feature vectors in this feature space: It constructs a hyperplane that separates two different classes of feature vectors with a maximum margin. This hyperplane is constructed by finding a vector w and a variable b that minimizes ||w||2, which satisfies the following conditions: w·xi + b
+1, for yi = +1 (cancer patients) and w·xi + b
–1, yi = –1 (normal people). Here, xi is a feature vector, yi is the group index, w is a vector normal to the hyperplane, |b| / ||w|| is the perpendicular distance from the hyperplane to the origin, and ||w|| is the Euclidean norm of w. After the determination of w and b, a given vector x can be classified by using sign [(w·x) + b]; a positive or negative value indicates that the vector x belongs to the positive or negative class, respectively.
The performance of SVM classification can be measured by true positive TP (number of cancer patients correctly predicted as cancer patient), false negative FN (number of cancer patients incorrectly predicted as normal), true negative TN (number of normal correctly predicted normal), and false positive FP (number of normal incorrectly predicted as cancer patients). Three indicators, sensitivity Qp = TP / (TP + FN), specificity Qn = TN / (TN + FP), and overall accuracy Q = (TP + TN) / (TP + FN + TN + FP), were used to measure the predictive performance.
Feature selection method. Predictor genes of each training test set were selected by using SVM recursive feature elimination (RFE-SVM), which is a wrapper method that selects predictor genes by eliminating non–predictor genes according to a gene-ranking function generated from a class differentiation system (28). Wrapper methods generally perform better than other feature selection methods (28). RFE-SVM is the best performing wrapper method and has thus been more widely used in cancer microarray analysis (24, 27).
The ranking criterion of RFE-SVM is based on the change in the objective function upon removing each feature. To improve the efficiency of training, this objective function is represented by a cost function J for the kth feature computed by using training set only. When a given feature is removed or its weight wk is reduced to zero, the change in the cost function J(k) is given by
. The case of Dwk = wk– 0 corresponds to the removal of feature k. In our case, the change in the cost function can be estimated as
, where H is the matrix with elements
. H(–k) is the matrix computed by using the same method as that of matrix H but with its kth component removed. The change in the cost function indicates the contribution of the feature to the decision function and serves as an indicator of gene ranking position.
Gene-ranking consistency evaluation. The microarray data set was randomly divided into a training set (contains half of the samples) and an associated test set (the other half of the samples). By using repeated random sampling (9), 10,000 training test sets, each containing a unique combination of samples, were generated. These 10,000 training test sets were randomly placed into 20 sampling groups; each group contains 500 training test sets. Every sampling group was then used to derive a signature based on consensus scoring and evaluation of gene-ranking consistency of the corresponding 500 training and 500 test sets. The 20 different signatures derived from these sampling groups were compared to test the level of stability of selected predictor genes.
In each group containing 500 training and 500 associated test sets, gene subsets were selected by RFE-SVM from each training set and the performance of gene subsets were evaluated from the associated test set. To derive a gene ranking criterion consistent for all iterations, RFE gene ranking function at every iteration step was derived from a SVM class differentiation system with a universal set of globally optimized variables that gave the best average class differentiation accuracy over the 500 test sets.
To reduce the chance of erroneous elimination of predictor genes due to noises in microarray data, additional gene-ranking consistency evaluation steps were implemented on top of the normal RFE procedures in all sampling sets. In step 1, for every test set, subsets of genes ranked in the bottom 10% (if no gene was selected in current iteration, this percentage was gradually increased to the bottom 40%) with combined score lower than the first few top-ranked genes were selected such that the collective contribution of these genes will less likely outweigh higher-ranked ones. In step 2, for every test set, the step 1 selected genes was further evaluated to choose those not ranked in the upper 50% in previous iteration so as to ensure that these genes are consistently ranked lower. In step 3, a consensus scoring scheme was applied to step 2 selected genes such that only those appearing in >90% (if no gene was selected in current iteration, this percentage was gradually reduced to 60%) of the 500 test sets were eliminated.
For each sampling set, different SVM variables were scanned; various RFE iteration steps were evaluated to identify the globally optimal SVM variables and RFE iteration steps that give the highest average class differentiation accuracy for the 500 test sets. The 20 different signatures derived from these sampling sets were then compared to test the level of stability of selected predictor genes.
| Results and Discussion |
|---|
|
|
|---|
The stability levels of the 20 derived signatures can be estimated from the percentage of predictor genes shared by all 20 signatures. From Table 3 , 80% of the top 50 ranked genes and 69% to 93% of all genes in each signature were shared by 20 signatures. This suggests that our selected signatures are fairly stable. One reason is that a SVM class differentiation system with a universal set of globally optimized variables, which gave the best average class differentiation accuracy over the 500 test sets, was used to derive RFE gene ranking function at every iteration step and for every test set. In earlier studies using RFE or other wrapper methods for selecting signatures, non–predictor genes have been eliminated in multiple iterations, and at every iteration step a different class differentiation system, characterized by a different set of optimized variables, has been constructed (24, 27). As gene elimination is variable dependent, these selected predictor genes are likely path dependent and heavily influenced by sampling method, composition, order of gene evaluation, computational algorithm, and variables. These characteristics partly explain the highly unstable and patient-dependent characteristics of the previously derived signatures (16). Another reason is that an additional gene-ranking consistency evaluation is done on top of the normal RFE procedure to reduce the change of erroneous elimination of predictor genes.
|
The number of predictor genes in our signatures ranges from 112 to 157 (Table 3), which is substantially higher than those of 6 to 60 in the previously derived signatures. It was reported that there are 291 known cancer genes (34), 15 cancer-associated pathways (35), 34 angiogenesis genes (36, 37), and 43 tumor immune tolerance genes (38). Because of biological differences and complex nature of cancers, a signature applicable for many patients is expected to include a substantial percentage of these cancer-related genes, together with some of their interacting partners and consequence genes (34). Moreover, because of measurement variability, a certain number of irrelevant genes may be inevitably included in a signature. Therefore, it is not surprising to find cancer signatures with 110 to 150 predictor genes. Moreover, for target discovery, which is a very important application of gene selection from microarray analysis, it is probably unrealistic to assume that only a few genes stand out from the thousands of gene with sufficient clarity to allow target selection (8).
The 104 predictor genes shared by all 20 signatures (Table 4 ; Supplementary Table S1 and S2) include 48 cancer-related genes (4 anticancer targets, 3 oncogenes, 8 tumor suppressors, 2 angiogenesis genes, 1 tumor immune tolerance gene, 4 cancer genes, 3 tumor markers, 17 cancer-gene interacting genes, and 6 cancer pathway–affiliated genes). In our analysis, anticancer targets were obtained from the latest version of therapeutic target database5 (39, 40), and the cancer-related genes and cancer pathways were taken from recent publications (34–38, 41) and references in Supplementary Table S1.
|
With a significantly higher number of cancer-related genes than those of 2 to 20 in the 10 previously derived signatures, our signatures seem to more closely reflect the complex nature of cancer known to involve collective actions of many genes of different functions (34–38, 41). Moreover, our signatures include 52 of the 107 previously derived predictor genes, and those selected by a higher number of other studies tend to be ranked higher by our gene-ranking function (Supplementary Table S2). Regardless of their possible roles in cancer, these genes have shown proven capability for colon cancer outcome prediction. It is not surprising that they are included in our signatures.
To further evaluate the predictive capability of our selected signatures, we collected the gene expression profiles of 34 colon cancer cell lines and 8 normal colon tissues from the Stanford Microarray Database (23). The predictive capability of our selected and the 10 other previously derived signatures were evaluated by using the SVM classification system and 500 randomly generated training test sets generated from this data set using the same procedure described in Materials and Methods. The performance was evaluated by using the associated test set, which is shown in Table 5 . The overall accuracy for the 104 predictor genes is 96.8%, with a SD of 3.3%. Using genes selected by other methods, the overall accuracies were found to range from 80.5% to 94.9%, with a SD of 2.9% to 6.6%. These results suggest that the signatures selected using our method can perform more stable and better than those selected by other methods.
|
It has been reported that six samples in the colon cancer data set might have been wrongly labeled (42). These include three tumor tissues (T33, T36, T30) more probable to be normal ones and three normal tissues (N8, N34, N36) more likely to be cancerous. Four of these six samples (T33, T36, T30, N36) are misclassified as their opposite labels by >90% of the 500 SVM models. Another one (N34) is misclassified in 74% of the 500 SVM models. Misclassification of T33, T36, and T30 into their opposite labels is actually consistent with the opinion that these are more likely normal tissues. Likewise, misclassification of N36 and N34 is consistent with the opinion that they are more likely cancerous. Despite of the "incorrect" labeling of six samples, our SVM models are "fooled" by only one of these samples. These results suggest that our method and derived SVM models are less sensitive to incorrect labeling of a small percentage of samples.
Lung adenocarcinoma and hepatocellular carcinoma data set. The lung adenocarcinoma data set contains the expression profiles of 7,129 genes from 86 lung adenocarcinoma patients (21). These 86 patients have been divided into two groups, survivable (62 patients) and nonsurvivable (24 patients), based on whether the patient was still alive in a postsurgery follow-up survey (21). This data set6 has been analyzed in several previous studies (9, 43, 44). Although these studies show good performances for separating the survivable and nonsurvivable patients, few of the selected predictor genes are shared by these reported signatures. In this work, the relevant data was subjected to the standard preprocessing procedure, as described by Guyon et al. (24).
In multiple random sampling, this data set was randomly divided into a training set containing 43 samples and an associated test set containing the other 43 samples. To reduce computational cost, 3,000 training test sets, each containing a unique combination of samples, were generated. These 3,000 training test sets were randomly placed into six sampling groups, each containing 500 training test sets. Every sampling group was then used to derive a signature based on consensus scoring and evaluation of gene-ranking consistency of the corresponding 500 training and 500 test sets. Finally, the six different signatures derived from these sampling groups were compared with test the level of stability of selected predictor genes.
The results are summarized in Table 6 . The number or predictor genes in our signatures ranges from 42 to 56. A total of 36 predictor genes, representing 64.3% to 85.7% of all genes in each signature, were shared by all six signatures. The predictive capability of our selected signatures was evaluated by using an additional 500 randomly generated training test sets generated from the original lung adenocarcinoma microarray data set. The average survival prediction accuracies of our signatures over these 500 test sets are 95.5% to 96.7%, which are comparable with those derived by other studies but our signatures are significantly more stable as manifested by the high percentages of selected predictor genes shared by all signatures. These results suggest that our method is capable of selecting fairly stable signatures for predicting lung adenocarcinoma survivability.
|
The hepatocellular carcinoma oligonucleotide array data set (22) contains the expression profiles of 7,129 genes from 60 hepatocellular carcinoma patients. These 60 patients have been divided into two groups, recurrent (20 patients) and nonrecurrent (40 patients), based on a postsurgery follow-up survey (22). This data set7 has been analyzed in several previous works (9, 45). Although these studies show good performances for separating the recurrent and nonrecurrent patients, few of the selected predictor genes are shared by these reported signatures. The relevant data was also processed by using the same standard preprocessing procedure as that described in the literature (24).
Using the same procedure as that of the lung adenocarcinoma data set, six different sets of hepatocellular carcinoma recurrence signatures were derived from 3,000 training test sets. The results are summarized in Table 6. The number of predictor genes in our signatures ranges from 25 to 30 of which 16 genes were shared by all the six signatures. This indicated that 53.3% to 64.0% of all genes in each signatures were shared by six signatures. The predictive capability of our selected signatures were evaluated by using the SVM classification system and 500 randomly generated training test sets generated from this hepatocellular carcinoma data set using the same procedure described in Materials and Methods. The performance was evaluated by using the associated test set and the average accuracies of our signatures over these 500 test sets were found to range from 98.1% to 99.3%, which is comparable with those derived by other studies (9, 45) but our signatures are significantly more stable as they contain a high percentage of shared predictor genes.
It is noted that the numbers of predictor genes in the lung adenocarcinoma and hepatocellular carcinoma data sets are significantly less than the number of predictor genes from the colon cancer data set. One possible reason for this difference is that the expression profiles of some cancer genes important for differentiating cancer and noncancer patients may not be significantly different in cancer patients of different survival groups or recurrent groups. As a result, higher number of cancer genes is expected to be selected in the signatures of the colon cancer data set than those of the lung adenocarcinoma and hepatocellular carcinoma data sets.
| Summary |
|---|
|
|
|---|
| Acknowledgments |
|---|
| Footnotes |
|---|
6 http://bidd.nus.edu.sg/group/cjttd/ttd.asp ![]()
7 http://dot.ped.med.umich.edu:2000/ourimage/pub/Lung/index.html ![]()
8 http://surgery2.med.yamaguchi-u.ac.jp/research/DNAchip/hcc-recurrence/index.html ![]()
Received 5/ 2/07. Accepted 8/ 3/07.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. Cui, Q. Liu, D. Puett, and Y. Xu Computational prediction of human proteins that can be secreted into the bloodstream Bioinformatics, October 15, 2008; 24(20): 2370 - 2375. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |