Aberrant DNA methylation at CpG islands is thought to contribute to cancer initiation and progression, but mechanisms that establish and maintain DNA methylation status during tumorigenesis or normal development remain poorly understood. In this study, we used methyl-CpG immunoprecipitation to generate comparative DNA methylation profiles of healthy and malignant cells (acute leukemia and colorectal carcinoma) for human CpG islands across the genome. While searching for sequence patterns that characterize DNA methylation states, we discovered several nonredundant sequences in CpG islands that were resistant to aberrant de novo methylation in cancer and that resembled consensus binding sites for general transcription factors (TF). Comparing methylation profiles with global CpG island binding data for specific protein 1, nuclear respiratory factor 1, and yin-yang 1 revealed that their DNA binding activity in normal blood cells correlated strictly with an absence of de novo methylation in cancer. In addition, global evidence showed that binding of any of these TFs to their consensus motif depended on their co-occurrence with neighboring consensus motifs. In summary, our results had two major implications. First, they pointed to a major role for cooperative binding of TFs in maintaining the unmethylated status of CpG islands in health and disease. Second, our results suggest that the majority of de novo methylated CpG islands are characterized by the lack of sequence motif combinations and the absence of activating TF binding. Cancer Res; 70(4); 1398–407
- DNA Methylation
- Gene Regulation
The epigenetic status of a normal cell can be drastically altered during aging or even more pronounced during malignant transformation (1). A commonly observed alteration is the aberrant DNA methylation of CpG islands that often targets tumor suppressor genes and may play a role in disease initiation and progression (1). Exactly how and why certain CpG islands are prone to methylation, whereas others remain unmethylated, is largely unknown.
In the past, different mechanisms for cancer-dependent, aberrant de novo methylation have been proposed largely based on the behavior of individual CpG islands. One possibility includes an initial random methylation event that is selected for during progressive proliferation (1). A second possibility comprises the targeted recruitment of DNA methyltransferases to methylation targets by cis-acting factors (2, 3), histone methyltransferases such as G9a (4, 5), or EZH2 (6). The third possibility includes the loss of chromatin boundaries or the absence of “protective” transcription factors (TF), leading to the spreading of DNA methylation into affected CpG islands (7).
The latter possibility is supported mainly by anecdotal evidence (2, 6, 8–15), showing that the unmethylated state of chromosomal DNA is maintained or even established by DNA-binding proteins. Although several previous studies identified specific nucleotide sequences (16–20) or structural features (21) that correlated with methylation-prone or methylation-resistant CpG islands in cancer samples, they did not identify sequence motifs resembling consensus sites for known TFs, arguing against a protective role of TFs.
In the present study, we show that CpG islands that remain unmethylated in normal and malignant cells (leukemia and colon cancer) contain specific sequence motifs that are identical to consensus sequences for general TFs. Sites that are stably bound by these factors in normal cells are highly resistant to de novo methylation. We also show that the stable binding of TFs in normal cells to their respective motifs requires the presence of neighboring consensus sites for other cis-acting factors, highlighting the importance of combinatorial interactions in defining TF-bound and TF-unbound regions as well as in conferring resistance to de novo methylation.
Materials and Methods
DNA preparation from normal cells and clinical samples
Colorectal cancer samples were collected from 10 patients who underwent colon resection for biopsy-proven invasive colorectal adenocarcinoma. The study was approved by the local ethical committee. Each resection specimen was staged and graded by routine pathology analysis according to the tumor-node-metastasis classification by the American Joint Committee on Cancer. DNA from frozen colon tissues was isolated using the Puregene DNA Purification kit (Gentra) according to the supplier's recommendation. Three normal colon DNAs (male, ages 50–63 y) were purchased from BioChain Institute, Inc. Leukemic blasts and bone marrow cells from acute leukemia patients were collected during routine diagnostic bone marrow aspirations. Patients had given informed consent to additional sample collection and analyses according to a protocol approved by the local ethical committee. The human myeloid leukemia cell lines THP-1 and U937 (American Type Culture Collection) were grown in RPMI 1640 (Biochrom KG) supplemented with 10% FCS (Life Technologies). Human peripheral blood monocytes from several healthy donors (male, ages 20–28 y) were purified after Ficoll gradient centrifugation, and subsequent elutriation of DNA from hematopoietic cell types was prepared using the Blood & Cell Culture DNA kit (Qiagen). DNA concentration was determined with the NanoDrop spectrophotometer and quality was assessed by agarose gel electrophoresis.
The recombinant MBD-Fc protein was produced as previously described (22). Methyl-CpG immunoprecipitation (MCIp) was performed as described with slight modifications. In brief, genomic DNA was sonicated to a mean fragment size of 350 to 400 bp. Each sample (2 μg) was incubated with 150 μL of protein A–Sepharose 4 Fast Flow beads (GE Healthcare) coated with 60 μg of purified MBD-Fc protein in 2 mL Ultrafree-MC centrifugal filter devices (Amicon/Millipore) for 3 h at 4°C in buffer containing 300 mmol/L NaCl. Beads were centrifuged to recover unbound DNA fragments (300 mmol/L fraction) and subsequently washed with buffers containing increasing NaCl concentrations (400, 500, and 550 mmol/L). Densely CpG-methylated DNA was eluted with a high-salt buffer (1,000 mmol/L NaCl), and all fractions were desalted using the MinElute PCR Purification kit (Qiagen). The separation of CpG methylation densities of individual MCIp fractions was controlled by quantitative PCR using primers covering the imprinted SNRPN.
Chromatin immunoprecipitation and ligation-mediated PCR
Chromatin immunoprecipitation (ChIP) analysis of purified peripheral blood monocytes was performed essentially as described (23). Precipitation of precleared chromatin from 10 × 106 cells was done overnight at 4°C using 5 μg of anti–yin-yang 1 (YY1; Santa Cruz Biotechnology), anti–nuclear respiratory factor 1 (NRF1; Abcam), anti–specific protein 1 (Sp1; Upstate), and anti-rabbit IgG (Upstate). After reversion of cross-links, enriched DNA fragments were recovered using the QIAquick PCR Purification kit (Qiagen). The quality of each ChIP was controlled at known target sites by quantitative PCR. For ChIP-on-Chip analysis, all samples as well as an aliquot of equally treated input DNA were amplified by ligation-mediated PCR for subsequent labeling as described in Supplementary Materials and Methods.
Microarray handling and analysis
Enriched methylated DNA fragments of the high-salt MCIp fractions were labeled with Alexa Fluor 5-dCTP (cancer cells) and Alexa Fluor 3-dCTP (normal cells) using the BioPrime Total Genomic Labeling System (Invitrogen) according to the manufacturer's instructions. Amplified ChIP material was labeled with Alexa Fluor 5-dCTP and the genomic input with Alexa Fluor 3-dCTP. Comparative MCIp- and ChIP-versus-input hybridizations on 244K CpG island oligonucleotide microarrays (Agilent) were performed using the recommended stringent protocol (Agilent). Images were scanned immediately after washing using a DNA microarray scanner (Agilent) and processed using Feature Extraction Software 9.5.1 (Agilent) and the standard comparative genomic hybridization protocol. Processed signal intensities were further normalized using GC-dependent regression and imported into Microsoft Office Excel 2007 for further analysis. Microarray data have been submitted and are available from the National Center for Biotechnology Information (NCBI)/Gene Expression Omnibus repository (comparative MCIp hybridizations: GSE17455, GSE17510, and GSE17512; ChIP-on-Chip hybridizations: GSE16078). MCIp microarray data for cell lines as well as ChIP-on-Chip data for TFs are also available as Genome Browser tracks in Supplementary Materials and Methods.
Algorithm for de novo motif finding
Motif discovery was performed using a comparative algorithm similar to those previously described (24). An in-depth description and benchmarking of the software suite HOMER (Hypergeometric Optimization of Motif EnRichment)6 that was developed for motif discovery will be published elsewhere.7 Briefly, sequences were divided into target and background sets for each application of the algorithm. Background sequences are then selectively weighted to equalize the distributions of CpG content in target and background sequences to avoid comparing sequences of different sequence content. Motifs are found separately by first performing exhaustively screening all oligo sequences for enrichment in the target set compared with the background set using the cumulative hypergeometric distribution. Up to two mismatches were allowed in oligo sequences to increase the sensitivity of the method. The top 50 sequences of each length with the lowest P values were then converted into probability matrices and heuristically optimized to maximize hypergeometric enrichment of each motif in the given data set. As optimized motifs are found, they are removed from the data set to facilitate the identification of additional motifs.
Genomic locations are based on the March 2006 human reference sequence (NCBI Build 36.1) that was produced by the International Human Genome Sequencing Consortium.
All statistical testing of enrichment data (motifs or attributes) was performed using a cumulative hypergeometric distribution (or Fisher's exact test, referred to as the hypergeometric test). Statistical testing of differences in mRNA level distributions was done using the two-sided Mann-Whitney U test.
Additional methods are provided in Supplementary Materials and Methods.
Comparative DNA methylation profiles of normal and leukemia cells
To globally define methylation-prone and methylation-resistant CpG islands, we initially analyzed the methylation status of 23,000 CpG islands of the human genome in acute leukemia cell lines as well as normal blood monocytes using our previously described MCIp technique (22, 25) as shown in Supplementary Fig. S1A. A typical scatter plot of a comparative hybridization of MCIp-enriched material (Fig. 1) highlights the three types of hybridization behavior: probes that show low signal intensities in both samples (absence of DNA methylation), probes indicating specific enrichment (aberrant DNA methylation) in the leukemia samples, and probes that show high signal intensities but low signal ratios in both samples (methylated in both samples). We initially performed comparative MCIp hybridizations of two well-established leukemia cell lines (human acute monocytic leukemia cell line THP-1 and human histiocytic lymphoma cell line U937) and their normal myeloid counterpart (peripheral human blood monocytes) and extensively validated microarray results by bisulfite conversion and subsequent matrix-assisted laser desorption/ionization–time-of-flight mass spectrometric (MALDI-TOF MS) analysis (26) to show the reproducibility of this method. Three independent replicates of each cell line were highly similar (mean r2 = 0.79 and 0.87 for log10 ratios of THP-1 and U937 monocyte comparisons, respectively), and on region or single-probe level, the microarray data correlated well with methylation ratios obtained by MALDI-TOF MS analyses. A detailed description and analysis of the validation set (1,150 amplicons covering 140 genes and ∼13,500 individual CpG dinucleotides) is available in Supplementary Materials and Methods. Confirming previous observations (27), the positional annotation of microarray probes showed that regions around known transcription start sites (TSS) are less often targeted by de novo methylation in leukemia cells than promoter-distal sites (Supplementary Fig. S2A). For further global analyses, individual probe signals were combined to cover and assign methylation ratios to whole CpG island regions. The comparative analysis on the region level also separated the three different classes of CpG island regions: unmethylated in both, aberrantly hypermethylated, and methylated in both (Supplementary Fig. S2C). We were unable to identify specific properties associated with the latter type of CpG islands, which is heterogeneous and includes both monoallelic (e.g., imprinted regions) and biallelic (tissue- or soma-specific) DNA methylation events (28) and therefore concentrated on properties of unmethylated or de novo methylated genes. Confirming earlier observations (19, 22), the comparison of global mRNA expression data of normal and leukemia cells between the two major CpG island classes showed that the majority of de novo methylated genes is characterized by low or absent transcription irrespective of the position of CpG islands relative to TSS (Supplementary Fig. S2D).
Sequence motifs associate with CpG island regions that remain unmethylated or become hypermethylated in cancer
We next used the de novo motif discovery algorithm HOMER to search for sequence patterns that are associated with CpG island regions that are either specifically and highly methylated in leukemia cell lines or not methylated in any sample (also see Supplementary Materials and Methods) and were able to identify a set of eight nonredundant sequence motifs that were highly enriched in either population in comparison with all CpG island regions on the array (Fig. 2A). Two repetitive motifs were highly enriched in the hypermethylated CpG island population. More strikingly, our de novo motif search revealed several sequences highly enriched in unmethylated CpG island regions that corresponded to consensus binding sites for known TFs, including nuclear TF Y (NFY), GA-binding protein (GABP), Sp1, NRF1, and YY1. These motifs were enriched with high significance and showed a clear enrichment/depletion in unmethylated or methylated CpG island regions, respectively (Fig. 2B). We next obtained comparative methylation profiles of samples from acute leukemia (n = 8, compared with normal monocytes) and colorectal carcinoma patients (n = 10, compared with normal colon) and analyzed the distribution of the above-identified motifs. All sequence motifs were significantly enriched in either unmethylated or methylated CpG island regions in both primary tumor types (Fig. 2B).
Motifs isolated from unmethylated CpG islands were previously described as prominent constituents of proximal promoters (29, 30). Indeed, all six sequence motifs identified in unmethylated regions were enriched within the proximal promoter regions of known genes (Supplementary Fig. S3A and B), whereas repeat sequences showed no specific enrichment around TSSs. Motif searches with groups of unmethylated or methylated CpG island regions that were classified according to their genomic position (promoter/intragenic/intergenic) additionally identified a CTCF consensus motif specifically enriched in unmethylated intergenic CpG island regions (Supplementary Fig. S3A and B; Supplementary Materials and Methods). Despite the significant overrepresentation of the protective motifs in promoters, they were also enriched in unmethylated CpG island regions that were located in intergenic or intragenic regions (Supplementary Fig. S3C). By plotting average MCIp signal intensities against motif distance for each of the protective motifs, we showed that signal ratios were lowest at the center and progressively increased with distance. Distance-related differences in signal ratios markedly increased in leukemia cells, suggesting that these motifs are indeed associated with lower methylation levels and that this association depends on motif distance (Supplementary Fig. S4A). This concurs with the preferential de novo methylation of CpG island shores detected in colon cancer (27). Averaged DNA methylation ratios of individual CpGs derived from high-throughput reduced representation bisulfite sequencing of mouse primary tissues (31) show a similar motif distance–dependent distribution (Supplementary Fig. S4B and C).
Sequence motifs and TF binding in normal cells correlate with CpG methylation status in leukemia
To study the correlation between motif appearance, TF binding in normal cells, and aberrant DNA methylation in the tumor cell lines, we performed ChIP-on-chip experiments for the TFs Sp1, NRF1, and YY1 in normal monocytes. As their consensus sites, these factors preferentially bound to promoter regions (Supplementary Fig. S5A), often bound in the vicinity of each other (Supplementary Fig. S5B), and showed enrichment of the other protective motifs around their binding sites (Supplementary Fig. S5C). Some motifs showed preferences in terms of orientation or distance to each other (Supplementary Fig. S6A). In general, motif distances show periodic preferences in most cases, which is in line with sterical preferences caused by the helical structure of DNA (Supplementary Fig. S6A). Genes associated with TF-bound CpG islands generally show significantly higher mRNA levels in normal progenitors (CD34+ cells), normal blood monocytes (CD14+ cells), or a leukemia cell line as compared with all genes (Supplementary Fig. S6B), and binding of more than one factor generally increased overall expression level of associated genes (Supplementary Fig. S6C).
The direct comparison of TF binding patterns in normal cells with aberrant methylation profiles of leukemia cell lines shows that both events were mutually exclusive in all three cases (Fig. 3; Supplementary Fig. S7). We also observed that TF binding was not detected at every motif. The comparison of bound and nonbound motifs using the de novo motif-finding algorithm revealed that TF-bound motifs were coenriched for consensus motifs of the other protective motifs (Fig. 4A). A sequence motif was more likely bound if it contained at least one or two other motifs in close proximity (Fig. 4B), and genes associated with TF-bound motifs show significantly higher mRNA levels as compared with genes that were associated with nonbound motifs (Fig. 4C). The data suggest that the stable binding of these general TFs (as measured by ChIP) to their consensus motif depends on the presence of neighboring motifs that are cooperatively bound by other general TFs. Thus, the combinatorial presence of two or more of the identified consensus sequences may serve to stabilize TF binding and to confer the resistance of certain CpG islands (preferably those acting as promoters) to aberrant methylation.
Properties of CpG island–associated genes in conjunction with CpG island methylation status and TF binding
We finally asked the question whether DNA methylation status or TF binding events at CpG islands are associated with attributes or properties of the corresponding genes or their products. Thirteen databases were analyzed for enrichment of specific terms or properties, including gene ontology terms, pathway association, protein domains or interactions, chromosomal localization, and predicted miRNA targets in regions that were associated with a DNA methylation status or bound by any of the three TFs Sp1, NRF1, or YY1 (Fig. 5; Supplementary Table S4). Hierarchical clustering of enrichment P values clearly separated the three classes of CpG islands into functional groups. DNA methylation–free and TF-bound regions included terms that were associated with basic cellular functions required for cell survival and proliferation. In line with earlier observations (32), CpG island regions that are commonly targeted by aberrant DNA methylation in both myeloid cell lines showed highly significant associations with gene ontology terms related to developmental processes, TF or receptor functions, as well as homeobox proteins, which are often targeted by Polycomb group repressors. Interestingly, these associations were also found in regions that contained unbound consensus motifs for at least one of the three above general TFs and, to a lesser extent, in regions that were methylated also in normal somatic cells (human blood monocytes; Supplementary Table S4).
The hypothesis that a TF provides methylation protection dates back to the reports of two independent groups in 1994, showing that a Sp1-binding site is necessary to protect the APRT gene from de novo methylation (9, 13) in humans and mice. Because Sp1-deficient animals had no obvious “methylation defects,” the concept of methylation protection by TFs has been controversially discussed. Anecdotal evidence clearly supports a role of specific DNA-binding proteins in establishing and maintaining DNA methylation pattern, but it is unclear whether the reported observations represent isolated cases or whether methylation protection represents a general mechanism. Earlier computational studies largely failed to identify defined consensus motifs for known TFs. Only one recent survey of methylation states at CpG islands in normal human tissues described the association of unmethylated CpG islands with the consensus motif for the Sp1 (28).
Using a powerful de novo motif analysis, our study shows that several defined sequence motifs are strongly enriched in CpG islands that are generally resistant to de novo methylation in cancer. These sequence motifs were previously shown to represent the most conserved motifs in mammalian promoters (30), but the observed correlation is also evident at intergenic, promoter-distal CpG islands that are not directly associated with transcription. We also show that the sole presence of a consensus motif for any of the general factors is not sufficient to confer “protection” from de novo methylation. In fact, protection from de novo methylation requires the stable binding of these factors to their binding sites, which in turn requires the presence of neighboring motifs that are cobound by at least one other ubiquitous (or in some cases cell type–specific) TF (a schematic model describing the methylation protection hypothesis is shown in Fig. 6).
Most resistant CpG islands were bound by combinations of ubiquitous TFs and also associated with basic cellular functions, whereas “methylation-prone” CpG islands generally associated with organismal development, differentiation, and cell communication, which are frequently regulated by cell type–specific TFs. Interestingly, genes that are associated with CpG islands that were commonly methylated in normal and cancer cells were enriched for predicted targets of specific (mostly uncharacterized) miRNAs; however, the relevance of this observation is uncertain and requires functional validation. We also observed that methylation-prone regions are significantly enriched for certain repeat motifs (GAGA and CACA), implying that they may also act as cis-acting sequences and direct de novo DNA methylation. GAGA resembles the consensus motif for Drosophila GAGA-binding factor, a trithorax group member that has been implicated in preventing heterochromatin spreading (33); however, a mammalian homologue has not been described thus far. CA repeats have not been previously linked to DNA methylation or chromatin structure.
With the exception of the Sp1/Sp3 motif, none of the other motifs has previously been associated with the establishment or maintenance of DNA methylation (8, 9, 28) but all are known to recruit epigenetic modifiers to their binding sites. NFY, a regulator of many cell cycle control genes, actively recruits coactivators (such as p300) that induce histone acetylation at NFY-bound promoters (34). Ubiquitously expressed NRF1 and GABP (also called NRF2) are able to recruit coactivators (PCG1 and p300/CBP) that create a chromatin environment favoring transcription (35, 36). YY1 has been shown to recruit Polycomb group proteins that control H3K27 methylation, a mark that has previously been implicated in aberrant silencing mechanisms during tumorigenesis (6, 37). However, a recent study by Lindroth and colleagues (38) elegantly showed that H3K27 methylation (recruited by YY1) and CpG DNA methylation at the murine Rasgrf1 locus are mutually exclusive, suggesting that both epigenetic marks are interdependent and antagonistic. This is also consistent with a recent study globally mapping key histone modifications and subunits of Polycomb-repressive complexes 1 and 2 (PRC1 and PRC2) in embryonic stem cells (39), which identified a YY1-like motif enriched in CpG islands that were not targeted by PRC2. Additional motifs identified in this study (ETS, NFY, AP-1, MYC, and NRF1; ref. 39) partially overlapped with those observed in the present study, further corroborating the negative correlation of repressive epigenetic marks and cis-acting sequences conferring transcriptional activity. In line with several recent observations showing that the DNA methylation status correlates with histone modifications (31, 40, 41), the factors binding the identified sequences likely share the ability to recruit RNA polymerase II (Pol II) and to create an “active” chromatin environment that may prevent or at least impede de novo CpG methylation at particular CpG islands.
An analogous study recently showed that the presence of RNA Pol II, active or stalled, predicts the epigenetic fate of promoter CpG islands in cancer (42). Because the recruitment of RNA Pol II requires cis-acting factors such as Sp1 (43), a large overlap between TF and Pol II binding is expected and the association of Pol II with resistance to de novo methylation is likely a consequence of its interaction with TFs present at the promoter. However, the fact that TF-bound, promoter-distal sites were equally resistant to de novo methylation in our study suggests that cis-acting factors may have a protective role independent of Pol II binding.
In conclusion, our data provide strong experimental and computational evidence that specific sequence motifs are associated with the DNA methylation states of CpG islands in normal and malignant cells. Most of these motifs are identical to consensus motifs for known general TFs, and our data suggest that the combinatorial binding of these factors plays a dominant role in regulating the DNA methylation status at a large set of CpG islands. Our findings also imply that the aberrant methylation patterns in cancer cells may at least in part result from a “loss of protection.”
Disclosure of Potential Conflicts of Interest
M. Ehrich is a shareholder and employee of Sequenom, Inc.
Grant Support: Wilhelm Sander Stiftung and Deutsche Krebshilfe (M. Rehli).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/).
Microarray data are deposited with Gene Expression Omnibus (gene expression analyses: GSE16076; comparative methyl-CpG immunoprecipitation hybridizations: GSE17455, GSE17510, and GSE17512; ChIP-on-Chip hybridizations: GSE16078).
Author contributions: C. Gebhard, M. Ehrich, and M. Rehli designed research; C. Gebhard, L. Schwarzfischer, E. Schilling, and M. Klug performed research; C. Benner, W. Dietmaier, C. Thiede, E. Holler, and R. Andreesen contributed new reagents or analytic tools; C. Gebhard, M. Ehrich, and M. Rehli analyzed data; M. Rehli wrote the paper.
↵7C. Benner et al., in preparation.
- Received September 28, 2009.
- Revision received November 17, 2009.
- Accepted December 8, 2009.
- ©2010 American Association for Cancer Research.