Tumor-derived cell lines are used as in vitro cancer models, but their ability to accurately reflect the phenotype and genotype of the parental histology remains questionable, given the prevalence of documented cell line–specific cytogenetic changes. We have addressed the issue of whether copy number alterations seen in tumor-derived cell lines reflect those observed in studies of fresh tissue by carrying out a meta-analysis of array-based comparative genomic hybridization data that considers both copy number alteration frequencies and the occurrence of cancer gene amplifications and homozygous deletions. Pairwise correlation comparisons between the data sets of seven diagnosis-specific matched tumor and cell line groups indicate that the trends in aberration frequencies are highly correlated between tumors and cell line sets of matched cancer histology relative to unmatched pairings. Despite their similarities, cell lines showed uniformly higher locus-specific alteration frequencies (P = 0.004) and several recurring cell line–specific alterations emerged. These include the previously documented losses of 13q and 9p and gains of 20q, as well as additional undescribed cell line–specific gains of 5p, 7p, and 17q and losses of 18q and 4q. These results indicate that, on average, cell lines preserve in vitro the genetic aberrations that are unique to the parent histology from which they were derived while acquiring additional locus-specific alterations. These data may enable a more predictive understanding of individual cell lines as in vitro models of cancer biology and therapy. [Cancer Res 2007;67(8):3594–600]
- Comparative Genomic Hybridization
- Molecular Oncology
- Molecular cytogenetics
Cell lines derived from human cancers have been crucial to building our understanding of the molecular pathophysiology of cancer and its treatment. Of equal importance, they form an in vitro model system for rational drug discovery and development because they are easy to maintain and manipulate in vitro and in animal xenograft models. Given their prominence as models, understanding in what ways and to what degrees cell lines grown under artificial conditions reflect their parental in vivo genetic architecture is of fundamental importance to cancer biology.
Since they were first established, cell lines have been used to assay for drug sensitivity. Monolayers are generally poor at reflecting the in vivo sensitivity of the parent histology to classic chemotherapeutics ( 1), although a better correlation can be found with three-dimensional culture methods ( 2). In more recent years, novel drugs have been developed against specific oncogenic aberrations that are vital to the transformed phenotype. These include trastuzumab and lapatinib, which both target the ERBB2 subchromosomal amplification in breast cancer; imatinib, which targets BCR-ABL translocation in chronic myelogenous leukemia as well as activating mutations in KIT; and gefitinib, which is preferentially active against tumors harboring mutant EGFR. In each of these cases, in vitro drug sensitivity closely parallels the in vivo situation. In other words, this is dependent on the presence or absence of the relevant genetic aberration. Moreover, genetically determined in vivo secondary drug resistance arising from additional mutations [e.g., secondary KIT mutations associated with acquired imatinib resistance ( 3) or EGFR mutations with acquired gefitinib resistance ( 4, 5)] can be accurately modeled in cell lines. Thus, in these cases, cell lines that share the relevant genotype of the parent tumor accurately model the drug sensitivity phenotype.
With the advent of genome-wide technologies to assay nucleic acids and proteins, a more complete molecular phenotype of cancer has been emerging. Transcriptional profiling, for example, has led to a new understanding of distinct classes of breast cancer (as well as other histologies) that have significant prognostic and therapeutic implications (reviewed in ref. 6). While there have been some studies showing a transcriptional profiling similarity between cell lines and their histology of origin (most notably using the well-characterized NCI-60 panel; ref. 7), the overall concordance is not particularly high and these cell lines seem to lose the tissue-specific up-regulation of genes ( 8). This difference is not surprising given that transcriptional regulation of many genes is an immediate function of the cellular environment, which is markedly different in vitro than in vivo.
The study of subchromosomal copy number alterations, as in the case of ERBB2 amplification, has revealed mechanisms for tumorigenesis and progression. Array-based comparative genomic hybridization (aCGH) technology in the study of cancers has greatly expanded in recent years, furthered by assays with widespread availability ( 9, 10) and increasing resolution ( 11). aCGH-based studies of cancers have primarily used survey style approaches in which tissues recognized as being members of a histologically homogeneous panel are interrogated for recurring copy number alterations. However, a comprehensive knowledge of copy number alterations within and across histologies is far from complete. Although cell lines are commonly used in large-scale aCGH studies, they are normally separated from fresh-frozen tissue panels in subsequent analyses due to their suspected discordance with tumor population ( 12). This is likely a result of a number of factors. Most notably, the cell line immortalization process has been implicated as a source of cytogenetic changes ( 13, 14). In addition, multiple growth passages, to which commercially available cell lines are routinely subjected, have been shown to be associated with random genomic instability ( 15). Finally, past studies have noted differences in gene expression patterns between cell lines and their fresh-frozen tissue counterparts ( 16– 18). Despite these observations, more recent analyses of genetic aberrations from larger panels of cell lines and tumors indicate a close concordance of genetic changes within individual histologies (e.g., breast cancer; ref. 19).
We undertook the current analysis to evaluate the degree to which cell lines display an accurate genome-wide model of the DNA copy number aberrations found in human cancers. Specifically we asked: (a) Do common aberrations occur in the same genomic regions in cell lines as their primary tumor counterparts? (b) Are sporadic aberrations resulting from cell line immortalization and growth passages recurrent and predictable, which would facilitate the ability to use cell line panels to model in vivo tumors? To answer these questions, we compiled and compared 19 aCGH data sets from 7 different cancer types with the purpose of estimating the similarity between cell line–tumor groups at the data set level. Specifically, this required focusing on trends in genome-wide gain and loss frequencies between data sets stratified by histology and DNA origin (i.e., tumor or cell line). We report that, when taken as a group, (a) copy number aberrations in cell lines from a given histology reflect their cell of origin and (b) specific genomic regions are prone to more frequent gain or loss in cell lines as compared with those seen in vivo. Finally, we begin to identify the patterns of common alterations observed in both tumor and cell line populations that can serve as cancer type delineators.
Materials and Methods
Data collection. Data were obtained either from our groups at the University of Pennsylvania or, where available, from published reports ( Table 1 ). At the University of Pennsylvania, aCGH was conducted on a panel of fresh, frozen, and paraffin-embedded tumors (n = 206) and cell lines (n = 109) from four cancer types. All samples were assayed twice on an array that was constructed with 4,135 bacterial artificial chromosome (BAC) clones spaced at ∼1-Mb intervals across the entire human genome. Clones were mapped to the human genome build 34 (June 2003) using BAC end sequences (69%), a sequence tag site (28%), or a full clone sequence (3%). Array details and hybridization protocols are described in detail in ref. 10. For quality control purposes, low intensity and variable spots were removed from the set before averaging the Cy3/Cy5 ratios for all replicates. aCGH data were obtained from public data resources for fresh tumors of five cancer types (n = 445) as well as cancer cell line data for three cancer types (n = 112). For analyses, all samples were organized into separate aCGH data sets that were stratified by (a) cancer type, (b) fresh tumor– versus cell line–derived DNA, and (c) data source (University of Pennsylvania generated data or public data). The resulting final data set consisted of 872 distinct samples from 19 separate data sets representing 7 different cancer types ( Table 1). Notably, cell lines in common between two separate lung cancer data sets ( 20, 21) were removed in one set such that each cell line was represented in just one set.
Data processing. Data were processed with the objective of accurately describing copy number alteration in each sample while structuring data to allow comparisons between samples from different platforms with distinct probe sets. Several steps were used to accomplish this. First, all samples were normalized under the assumption that the mean copy number of every sample is diploid (mean log 2 copy number ratio = 0). Copy number break points were estimated with a circular binary segmentation algorithm ( 22) implemented by the DNAcopy package in the R programming language.
For genome-wide copy number alteration frequency analysis, segmentation output was assigned to 1-Mb bins across the entire genome, where the log 2–based metric assigned by the segmentation algorithm represented the relative copy number status for each bin in each sample. All raw and processed data are available in MIAME compliant format ( 23). 3 This procedure attained a common data format for all data sets with no probe dependence and allowed direct comparisons between data that were drawn from different assays with different probes. Categorization of low-level copy number gains (≤5 copies) and heterozygous losses for each bin was made by applying thresholds of >0.25 and <−0.25, respectively. For published studies, it was confirmed that our calculated genome-wide aberration frequencies reflected those measured by the study from which the data were obtained. Similarly, high-level amplifications (>5 copies) and homozygous deletions were identified with threshold scores of 0.81 and −1.0, respectively. Due to sex mismatching between test and reference in some samples, chromosomes X and Y were excluded from all analyses.
Data analysis. Estimations of the fraction of the genome experiencing either DNA copy number gain or loss in each sample were made by compiling the total number of segments classified as gained or lost in each sample based on the threshold scores. Aberration rates across each data set were collated by determining the frequency of change for each 1-Mb bin. The gain and loss frequencies assigned to each bin were used to measure similarities in aberration trends between data sets. Specifically, using Pearson distance as a metric, unsupervised hierarchical clustering was done where distance scores were calculated based on genome-wide aberration trends of each data set. Tumor- and cell line–specific aberrations were calculated by subtracting the mean cell line aberration rate of each bin from that of the tumors of a given histology. As a means of determining the specific alterations driving the similarities between data sets, all bins estimated to have a gain or loss frequency of >25% in any data set were mapped to a cytogenetic band and subjected to further clustering analysis. To analyze known cancer-related alterations, a set of 323 cancer genes ( 24) were mapped and assigned a copy number status based on the previously mentioned log 2 thresholds.
Regional trends in gain and loss frequencies seem to be histology specific. Hierarchical clustering based on 1-Mb bin gain and loss frequencies indicates the tendency for data sets to cluster with matched histologies. Considering the clustering results of all 19 data sets, the lone data sets whose nearest neighbor is not of matched histology are the sarcoma tumor and cell line groups, a single breast tumor data set, and a lung tumor data set ( Fig. 1 ). The tendency for both tumor and cell line data sets of matched histologies to cluster together is highly statistically significant (P < 0.0001) considering a permutation-based randomized class assignment (n = 1,000). Four independent colon cancer data sets clustered together as did three melanoma data sets; data from these histologic groups were derived from different platforms, negating the possibility of platform-specific clustering. Additionally, two distinct sets of lung cancer cell lines data sets derived from different platforms (BAC-based arrays versus “SNP Chip” arrays) clustered together (data from refs. 20 and 21, respectively).
The occurrence of patterns of frequent alterations seems to drive the similarities between data sets and recapitulate what is seen across the entire genome ( Fig. 2 ; Supplementary Table S1). For example, gains of 19p13.3 are common to both tumors and cell lines from breast, lung, and ovarian cancers, as well as in melanomas, while being virtually absent in colon and pancreatic tumors and cell lines. Similarly, gains of 13q21.33 are relatively common in colon tumors and cell lines whereas this aberration is rare in all other histologies. It seems that the most aberrations drive similarity between breast tumors and cell lines, whereas few are seen in colon and pancreatic cancers. Lung and ovarian cancers and melanomas seem to be intermediate in this respect.
Pairwise analysis of histology-matched tumor and cell line pairs shows uniformly higher rates of locus-specific alteration rates in cell lines (P = 0.004; data seen in Supplementary Table S2). To identify cell line and tumor-specific alterations, the mean gain or loss frequency was calculated for each 1-Mb bin of every histology tumor group and subtracting from that of the matched cell line group. Large values represent 1-Mb regions frequently gained or lost in cell lines but not in tumors or vice versa. For genome-wide copy number gains, breast and lung had the largest difference between cell lines and tumors (μ = 13.7%; σ = 14.4% and μ = 9.1%; σ = 11.6%, respectively) whereas sarcoma and ovary showed the smallest difference between cell lines and tumors across the entire genome (μ = 0.6%; σ = 7.5% and μ = 2.1%; σ = 8.5%, respectively). Specific regions that differed between tumors and cell lines were identified in each cancer type by querying for those with a mean difference in frequency that was 2 SD larger than the mean difference for that cancer type. This set of regions reflect those that are the most different between tumor and cell line data sets of matched histology. Several DNA copy number alterations seem to consistently occur at disproportionately higher frequencies in cell lines in at least three cancer groups (Supplementary Table S3). These include gains of large genomic regions such as 20q12-13.33 (∼24 Mb) and 17q23.2-24.3 (∼11 Mb), as well as more localized gains such as 5q35.1-35.3 (∼5 Mb) and 11q13.2-13.4 (∼6 Mb). Additionally, more frequent losses of 18q12.2-23 and 9p23-21.3 are seen in cell lines than in tumors (Supplementary Table S3).
Genomic rearrangements, including high level DNA amplifications and homozygous deletions of known disease-related genes, are often defining features in cancer tissues. Consistent with genome-wide data, the frequency of copy number gains of cancer genes (n = 323) is more prevalent in cell lines than tumors (P < 0.0001, t test). Similarly, cell lines exhibit more frequent high level amplifications (log 2 ratio, >0.81) of cancer-related genes than tumors (P < 0.0001, paired t test). Total homozygous losses follow a similar trend by having higher overall rates of occurrence in cell lines (P = 0.0139, paired t test). The disproportionate increased amplification and homozygous deletion occurrence of several genes seem to be consistent with divergent frequencies of gain and loss between tumors and cell lines. For example, breast cell lines show a 27% (6 of 22) amplification rate of the SS18L1 locus (20q13.33), where only ∼1% (1 of 90) of tumors are amplified at this locus ( Table 2 ). This gene falls in a region that shows significantly higher overall gain frequencies in breast, melanoma, colon, and lung cell lines than their respective tumors. Similarly, the frequency of homozygous loss (log 2 ratio, <−1.0) of CDKN2A (9p21.3) in cell lines seems to be higher in lung and melanoma cell lines than in tumors [7 of 40 (17.5%) cell lines versus 3 of 51 (5.9%) tumors and 5 of 42 (11.9%) cell lines versus 1 of 145 (0.8%) tumors, respectively]. These two histologies are two of four (breast and sarcoma are the others) that showed overall higher loss frequencies of this locus in cell lines. Interestingly, pancreatic samples also had a recurring loss of CDKN2A in cell lines whereas there were no occurrences of this aberration in tumors [3 of 24 cell lines (12.5%) versus 0 of 13 tumors (0%)]. Pancreatic tumors and cell lines showed no difference in overall loss frequency at this locus. Several other cancer-related genes also showed differences in amplification frequencies between data groups. Most notably, higher rates of cell line amplification of the MYC locus (8q24.21) appear in breast, ovary, lung, and colon cancers than in their respective tumor sets. A full list of cancer-related gene DNA copy number gains and losses can be seen in Supplementary Tables S4 and S5, respectively.
In this study, copy number alteration data representing 7 tumor types and 19 different data sets derived from 7 different copy number alteration detection platforms showed strong correlation between regional gain and loss frequencies within cancer types. Although cell lines showed higher overall aberration rates than tumors, these histology-wise pairings showed clear tendencies for similar DNA copy number alterations. Several recurring deviations are apparent. These include previously described cell line–specific gains of 20q12-13.33 and losses of 13q22 ( 13, 14) as well as newly identified cell line–specific gains of 5q35.1-35.3 and losses of 18q12.2-23. A homozygous loss of the tumor suppressor locus CDKN2A and amplification of the MYC oncogene seem to be more frequent in cell lines in several histologies, indicating that the dysregulation of these genes may be acquired as part of cell immortalization or their occurrence is selected when tumors are chosen for transformation.
The general concordance of results from independent surveys of recurring copy number aberrations in cancers shows that traditional histology-based groupings are proper first tier stratifiers. For example, two studies of colon cancers yield very similar trends in genome aberration frequencies despite their distinct panel of tissues and array platform ( 25, 26). Similarly, two separate surveys of non–small-cell lung cancer cell lines provided independent validation of important, previously described cytogenetic changes ( 20, 21).
Recent studies have used DNA copy number alterations for the molecular pathology of cancers. For example, common copy number gains of 1q as distinct to breast cancers have been described from a pool of tumor types, whereas 13q gains are largely unique to colon cancers ( 27). Similar results were observed here, as these alterations were components driving tumor/cell line relationships ( Fig. 2). Although few alterations were common (>25%) in only a single cancer type, gains of 8q24.21 (encompassing the MYC locus), gains of 20q13.31, and losses of 18q21.1 seem to occur frequently in this panel of cancers. Conversely, whereas gains of 14q13.1 are relatively common in lung and pancreatic cancers, they are largely absent in the others. Jointly, these and previous results ( 27) suggest that each histology broadly bears a range of unique DNA aberrations. Copy number alterations have also been used to discern several cancer subtypes. For example, more frequent gains of 11q and 17q have been seen in acral and mucosal melanomas compared with those originating from the skin ( 28). We would expect that cell line models should similarly represent genotypic subtypes as they do overall histology. For example, it has been shown that colon tumors exhibiting microsatellite instability (MSI+) harbor a distinct set of alterations than microsatellite stable tumors (MSI−; ref. 26). Concordantly, when all colon cancers are stratified by MSI status and tumor and cell line origin, those showing the two MSI+ sample sets segregate from MSI− cancers when subjected to the genome-wide clustering described previously (Supplementary Fig. S1).
Collectively, these results help quantify cancer cell lines as accurate, reflective models for investigating in vitro genomic alterations in human cancers. Cancer cell lines exist as appealing models for studying DNA copy number and, by extension, therapeutic response prediction for several reasons. First, cancer cell lines provide a more homogeneous cell population where cell-to-cell variation in copy number is thought to be reduced. Tumor heterogeneity in primary lesions can limit the ability to accurately describe copy number alterations due to the infusion of normal cells, causing a diluted signal ( 20), which is consistent with the observation that alterations occur at higher frequencies in cell lines than tumors. Most importantly, tumors cannot be analyzed for copy number alterations in vivo, whereas the use of cell lines for aCGH analysis opens the possibility of time course analysis ( 29) and drug treatments ( 30, 31). In parental cell lines and their derivatives, drug response can be engineered and studied in relationship to basal and evolved genomic changes (e.g., ref. 32). Our analyses substantiate the translation of these observations to primary tissues by suggesting that relatively large-scale copy number genetic aberrations seen in cell lines in vitro accurately reflect their parent histology. Specifically, these results support the notion that cell lines can serve as relevant in vitro models for developing specific therapies and imply that, by understanding the exact genetic determinants of a phenotype in a cell line, it is possible to accurately target similar genotypes in a patient population. In the case of well-known genetic aberrations, using cell lines as in vitro models for biomarker discovery and validation is already routinely done (i.e., modeling the effectiveness of kinase inhibitors on cells that have the relevant target amplification or deletion). More generally, the use of cell lines with defined genetic aberrations allows inferences to be made not only on which histologies but also the specific patients that may respond to a given therapy. This is an observation that could be deduced from previous meta-analysis of gene expression data ( 33) but has not yet been observed as a trend in microarray-based DNA copy number data. In addition, these analyses validate the nature of published CGH study designs, where the calculation of the most common alterations yields biomarkers that are most likely relevant to the oncogenic phenotype ( 34). By comparing data across a wide spectrum of cell line histologies, loci that are more commonly aberrant in cell lines than in tumors (perhaps as a result of artificial selection pressure) can be identified. Cell lines harboring these recurrent aberrations may be less faithful models of the primary tumor. Further, understanding which recurrent loci are more associated with cell lines will allow them to be accounted for. For example, models aimed at predicting phenotypes that use one or more of these discordant loci may be suspect.
There are several apparent limitations of these analyses. Most significantly, cell line panels may not provide true representation of the range of phenotypes of the parent histology. For example, DNA amplification of the N-Myc locus in human neuroblastomas, a biomarker for a malignant phenotype, occurs at significantly higher rates in cell lines than tumors ( 35). The overrepresentation of such genetic changes may reflect a bias in those tumors selected for or those capable of undergoing transformation. Furthermore, although all data sets in this study are derived from sub-megabase resolution microarrays, platform variation can confound analyses. Probe density (i.e., genome-wide spacing), noise levels, and mechanical factors (e.g., normalization process) can vary between platforms and laboratories. Each of these variables is capable of affecting the accurate description of copy number aberrations. Finally, CGH analyses reflect an average of genomic aberrations that have occurred in populations of cells. Due to the confounding dilution effect caused noncancerous cell populations, all primary tissues in this study (University of Pennsylvania and published data) were subject to macrodissection under light microscopy to maximize the percent tumor. Resulting tumor cell proportions were variable for these studies, ranging from a minimum of >70% ( 21) to 50% ( 36). Although it has been shown that aCGH assays can tolerate up to 50% infusion of normal cells ( 37), less pure tumors are likely to be prone to increased rates of false negatives. Further, heterogeneous clonal populations may be more common in tumors than in cancer cell lines and can lead to an uninformative profile when assayed by CGH. Ultimately, this possibility can confound the translation from cell line model to tumor genome and provide insight into why aberrations appear in uniformly lower frequencies in tumor data sets. In the future, this may be remedied by techniques that effectively evaluate zthe genome of single cells ( 38). Several data sets showed discordance between tumor and cell line populations, including lung tumors and sarcomas. Whereas tumor heterogeneity and normal cell infusions can account for nonuniform frequencies of alterations, the pattern differences noted in both lung and sarcoma tumors from their cell line models could be the result of the diversity of pathologies within these respective populations. For example, sarcomas represent a host distinct molecular subtypes (reviewed in ref. 39) that are likely to be subject to a unique set of DNA copy number alterations. Tumor and cell line populations were represented disproportionately in several prominent sarcoma subtypes, such as leiomyosarcomas (5% of cell lines, 19% of tumors) or teratomas (0% of cell lines, 15% of tumors). Finally, few studies have been devoted to documenting specific changes to DNA copy number profiles associated with the cell line transformation process. These meta-analyses focus on trends across panels of cancers. Although they do not address directly whether individual cell lines maintain the copy number profile of the parental tumor, they do rediscover findings of previous studies. For example, meta-analysis has suggested disproportionately high rates of loss of 9p21.3, a specific acquired alteration that is described in the immortalization of normal epithelium ( 40). Similarly, the 20q instability associated with cell lines was also noted in several cancers ( 14, 41). This suggests that querying large sample sets is also a viable means of discriminating cell line–specific copy number instability.
Although genome-wide RNA profiles of cancer cell lines have often been used in recent years to model drug response and other characteristics in vitro ( 42), it may be advantageous to conduct these investigations with DNA-based profiles or with combined DNA/RNA profiles. First, although careful in vitro modeling of oncogenic activation can dissect RNA profiles that correlate with drug response and may correlate with in vivo tumor profiles ( 43), the relationship between cell lines used for in vitro studies and their fresh-frozen tissue counterparts is often questionable ( 16– 18). In addition, it is unknown whether successful transcriptional meta-profiles ( 33) could be independently calculated by exclusively using cell lines. Moreover, the above-mentioned caveats of analyzing potentially heterogeneous cell populations still apply and may be further confounded by the obvious effect of in vitro culture conditions on the transcriptome. When a specific genetic aberration is known, its relevance to modeling drug activity is much more obvious [e.g., BRAF mutations and MEK inhibition ( 44), MET amplifications and response to MET inhibitors ( 45), and the above-mentioned studies of KIT and ERB inhibitors]. Moreover, the measurement of discrete genetic aberrations is easier translated into clinical usefulness.
In summary, we have shown that aCGH-based surveys of tumor cell lines preferentially cluster with their parent histology when considering data set with wide trends in copy number aberrations. This finding supports their usefulness as faithful in vitro models of in vivo tumors across a wide range of solid tumors. Further, this observation enables an analysis of the discrete aberrations that may distinguish specific tumor types. Future work will be needed to elucidate these loci to help understand the histologic specificity of underlying oncogenic aberrations as well as to determine how they interact with known aberrations (i.e., oncogenic mutations).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/).
- Received October 3, 2006.
- Revision received January 31, 2007.
- Accepted February 13, 2007.
- ©2007 American Association for Cancer Research.