| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Systems Biology and Emerging Technologies |
The Simons Center for Systems Biology, Institute for Advanced Study, Princeton, New Jersey
Requests for reprints: Jun S. Song, The Simons Center for Systems Biology, Institute for Advanced Study, Einstein Drive, Princeton, NJ 08540. Phone: 609-734-8038; Fax: 609-951-4438; E-mail: jssong{at}ias.edu.
| Abstract |
|---|
|
|
|---|
(ER
) binding sites in cell lines, the precise mechanism governing the gene-to-enhancer association is still unknown and no quantitative model that can predict the estrogen responsiveness of genes has been hitherto proposed. This article presents an integrative genomics approach to construct a predictive model that can explain more than 70% of estrogen-induced expression profiles. The proposed method combines a recent map of the insulator protein CCCTC-binding factor (CTCF) with previous ER location studies and expression profiling in the presence of the translation inhibitor cycloheximide, providing evidence that CTCF partitions the human genome into distinct ER-regulatory blocks. It is shown that estrogen-responsive genes with a decreased transcription level (down-regulated genes) have a markedly different relative distribution of ER binding sites compared with those with an increased transcription level (up-regulated genes). Finally, Bayesian belief networks are constructed to quantify the effects of ER-binding distance from genes as well as the insulating effects of CTCF on the estrogen responsiveness of genes. This work thus represents a stride toward understanding and predicting the distal activities of steroid hormone nuclear receptors. [Cancer Res 2008;68(21):9041–9] | Introduction |
|---|
|
|
|---|
(ER-
) in the breast cancer cell line MCF7 (1–3), shifting the traditional focus of study from proximal promoters to distal enhancers, the preponderance of which has now been shown also for other transcription factors such as androgen receptor and p63 (4, 5). Despite the success in constructing a genomic map of ER-binding sites, the mechanism by which ER regulates a certain subset of genes while not affecting others remains a profound mystery. In fact, the distal nature of ER activities renders the binding site information alone quite insufficient for predicting estrogen-responsive genes. For instance, our analysis of these data shows that only 24% of genes with ER binding in [–1 kb,1 kb] proximal promoters and only 15% of genes with ER in exons or introns are actually regulated by ER. In addition to using chromatin immunoprecipitation (ChIP), a common approach for identifying genes regulated by a transcription factor involves time course expression profiling before and after overexpression or overactivation of the transcription factor, e.g., via estrogen induction in the case of ER (2, 6). This method, however, is complicated by the differential expression of numerous secondary targets that are not directly regulated by the transcription factor under investigation but rather by other transcriptionally induced transcription factors. Ideally, a combination of high-throughput expression and ChIP experiments can provide a powerful tool to discover functional transcription factor binding sites and a confident set of primary target genes by focusing on differentially expressed genes with transcription factor binding sites near the genes. Unfortunately, estrogen and androgen receptors pose formidable exceptions to this ideal scenario because they can bind quite far (>100 kb) from the genes that they regulate while having no effect on intermediate genes. Along this line, a crucial role has been ascribed to the so-called "pioneering factor" FoxA1, a forkhead protein that selectively localizes to H3K4me2-modified enhancer regions and remodels chromatin for subsequent ER and androgen receptor binding activities (1, 7). Around 50% of ER binding sites have nearby FoxA1, and FoxA1 may thus play an important role in establishing cell type–specific gene regulation by ER and androgen receptor (1, 2, 7). However, only a small subset (<40%) of estrogen-responsive genes can be explained by any given study to date. Understanding how the estrogen receptor selectively chooses genes that it regulates thus remains a daunting challenge.
A recent study attempted to circumvent these difficulties by using the translation inhibitor cycloheximide in time course expression profiling (6). In addition to eliminating secondary estrogen-target genes, cycloheximide can prevent potential repression of estrogen-induced primary transcription by secondary negative feedback loops and can also stabilize mRNAs by weakening the activities of RNases (8), allowing low-level transcripts to accumulate for detection. Despite these apparent advantages of using cycloheximide to identify the genes directly targeted by ER, the cytotoxicity of cycloheximide and the accompanying genetic perturbation introduced into cells may raise serious doubts against the validity of this approach. Our article revisits the use of cycloheximide in studying ER regulatory networks and provide strong support for the use of cycloheximide in this particular setting. We accomplish this task by integrating the data available from high-throughput ER ChIP and expression studies with those from another genome-wide assay of the so-called CCCTC-binding factor (CTCF; ref. 9).
CTCF is a nuclear protein that is ubiquitously expressed across cell types. It binds to diverse DNA sequences by combinatorial use of its 11-zinc finger DNA-binding domains. CTCF is essential and highly conserved from fruit fly to human, with its binding sites being highly conserved across vertebrates (10). It is the only major protein identified in vertebrates that is involved in the establishment of insulators that can block enhancer activities. CTCF has been shown to be involved in the regulation of gene imprinting, monoallelic gene expression, X chromosome inactivation, and escape from X-linked inactivation (11). The mechanism of insulator function, however, still remains unknown. The distribution of CTCF binding sites in the genome correlates with gene density, with
46% of sites lying in intergenic regions, 20% near transcriptional start sites, 22% in introns, and 12% in exons (9). Regions that are depleted of CTCF binding sites often include clusters of related gene families and genes that are transcriptionally coregulated; on the other hand, regions that are enriched in CTCF binding sites tend to contain genes with tissue-specific multiple alternative promoters (9). The distribution of CTCF binding sites is thus consistent with its role as an insulator, and it remains to be shown in a genome-wide fashion whether CTCF can block the activity of distally located transcription factors such as estrogen receptor.
By combining the aforementioned data sets, we here report for the first time integrated computational models that can predict estrogen-responsive genes with 60% to 70% sensitivity and 70% to 80% specificity, demonstrating the insulator function of CTCF in a global scale. In particular, we show that CTCF may block transcriptional activation of genes by ER and that most estrogen-responsive genes do not have CTCF between them and nearby ER binding sites. We also construct Bayesian belief networks to quantify the distance effect of ER and FoxA1-binding sites as well as the insulating effect of CTCF on gene regulation.
| Materials and Methods |
|---|
|
|
|---|
Expression data normalization and analysis. The expression data from refs. 2 and 6 were normalized together using RMA (12) and using the latest probe mapping to the human genome NCBI Build 36 (13). The cutoffs for deciding down-regulated and up-regulated genes in the presence of cycloheximide were obtained by constructing a three-component mixture model of log fold-change distribution, as described in Results. Gaussian mixture models of log fold-change distribution were estimated using an expectation maximization algorithm.1 The 5th and 95th quantiles of the middle distribution consisting of noise and weakly responsive genes were chosen as cutoffs. These cutoffs corresponded to the 2nd and 97th quantiles in the overall fold-change distribution of Bourdeau and colleagues. Equivalent cutoffs for selecting down-regulated and up-regulated genes of Carroll and colleagues were taken to be the 2nd and 97th quantiles of the distribution of the maximum statistic max(fold change at 3 h, fold change at 6 h).
Correlation of expression of estrogen-inducible genes within and across CTCF blocks. UGT1A1, UGT1A2, ..., UGT1A10 comprise a gene family with alternative transcription start sites (TSS) and were excluded from "Within CTCF" comparisons to remove any potential bias. The data set of Wang and colleagues was obtained from Gene Expression Omnibus GSE2034.
Bayesian network analysis. For each gene, the distance DER from TSS to the nearest ER binding site in the genome was computed and discretized into 30 bins, and likewise for the distance DFoxA1 to the nearest FoxA1. Discretization was performed using the Data PreProcessor (14) with the option "Equal Frequency." The boundaries of DER bins were located at –1169719, –756064, –541760, –410268, –320005, –251895, –198610, –159728, –125557, –97298, –73124, –51441, –30947, –15326, –684, 11100, 27113, 44859, 67912, 91809, 119665, 152739, 196654, 247969, 312264, 405354, 542869, 737070, 1139259. Similarly, the boundaries of DFoxA1 bins were at –510165, –317673, –235483, –180471, –144864, –117117, –95399, –77092, –61167, –46743, –33846, –22280, –12887, –3749, 557, 6754, 15363, 26293, 36663, 49181, 62833, 79941, 99614, 123768, 151500, 188368, 240405, 333340, 516120. The FoxA1 sites at FDR 1% were obtained from ref. 7. BN PowerPredictor was used to select the four significant Bayesian belief networks shown in Fig. 5A, maximizing the area under the receiver operating characteristic curve (14) for training data sets. BN PowerPredictor estimates the conditional probabilities graphically encoded in Fig. 5A by simply counting the frequency of events in training data sets. We used an unbiased success rate of 0.5 as a prior for the Bernoulli variable CLASS. Posterior probability of CLASS is then computed for each test gene from these conditional probabilities and prior distribution.
|
| Results |
|---|
|
|
|---|
To examine the effect of cycloheximide on estrogen-regulated genes, we normalized together the data from Carroll and colleagues and Bourdeau and colleagues (see Materials and Methods). Density and quantile-quantile plots of fold changes from the two data sets, shown in Fig. 1A and B
, respectively, indicated that the data of Carroll and colleagues had roughly three times more differentially expressed genes than the data of Bourdeau and colleagues at a given fold-change cutoff in the same cell line MCF7, consistent with the fact that cycloheximide eliminates many secondary targets of ER from being differentially expressed after estrogen treatment. The fold-change distribution of Bourdeau and colleagues has heavy tails and is not easily decomposed into estrogen-responsive and nonresponsive genes because of tight mixing between signal and noise. To find a statistical way of demarcating the subdistribution of up-regulated genes directly targeted by ER, we constructed a Gaussian mixture model of fold changes consisting of three components. We fitted the log fold-change distribution P using an expectation-maximization algorithm1 as P = 0.64 N(–0.022,0.1) + 0.33 N(0.015,0.24) + 0.03 N(0.45, 0.77), where N(µ,
) is normal distribution with mean µ and SD
. Figure 1C shows that the model fits the original distribution well (Kolmogorov-Smirnov test P > 0.13). A simpler two-component model did not yield a good fit (Kolmogorov-Smirnov test P = 2.7 x 10–14; see Supplementary Fig. S1). We interpreted the component N(–0.022,0.1) as describing pure noise, N(0.015,0.24) as a mixture of noise and weakly responsive genes, and N(0.45, 0.77) as strongly responsive genes. By taking the 5th quantile and 95th quantile of the distribution N(0.015,0.24) for weakly responsive genes as cutoffs, we identified 509 up-regulated and 326 down-regulated genes, consistent with the estimates of 544 up-regulated and 235 down-regulated direct targets found by Bourdeau and colleagues using their 1.4 fold-change cutoff. We shall use the genes identified in this analysis as estrogen-responsive genes throughout this article. It should be noted that the fold-change method agreed well with significance analysis of microarrays (SAM; ref. 16), with 90% of our top 500 genes appearing in the list of top 500 genes found by SAM; fold changes and SAM q values also had a very high Pearson correlation of –0.93.
|
|
Up-regulated estrogen-responsive genes have ER binding sites within CTCF blocks. Three hundred forty-eight (68%) up-regulated genes have ER binding within the same CTCF blocks as the genes. To assess the significance of this phenomenon, we computed the percentage of genes with ER as a function of fold-change cutoff (see Fig. 3A ). We simulated 6,285 random ER binding sites and computed the fraction of up-regulated estrogen-responsive genes (+cycloheximide) in CTCF blocks with ER, repeating the procedure 10,000 times. Figure 3A shows that a significant number of ER binding sites lie in the same CTCF blocks as up-regulated genes (P value <10–4). It can also be seen that much less significant fraction of the early up-regulated genes identified by Carroll and colleagues have ER binding in CTCF blocks, thus supporting the use of cycloheximide treatment in filtering out genes not directly targeted by ER. Interestingly, because highly estrogen-responsive genes tend to have ER within 20 kb, whereas CTCF blocks are much larger, simulating random CTCF sites did not significantly change the fraction of genes having ER within CTCF blocks, indicating that Fig. 3A may be capturing the distance effect of ER rather than CTCF itself (see Supplementary Fig. S3A). Supplementary Fig. S3B shows the corresponding analysis for ER and FoxA1 overlapping sites. Similar graphs are obtained when one uses fold-change ranks instead of fold changes (data not shown).
|
CTCF can act as an insulator of ER regulation. It has been shown that CTCF can act as boundary elements that group together coregulated genes into regulatory blocks (9, 10). To see whether CTCF also plays a similar role in ER-mediated gene regulation, we examined the correlation of expression of estrogen-inducible genes across cell lines in various expression profiling studies. Figure 3B shows the boxplots of pair-wise Pearson correlation across NCI60 cell lines (17) between up-regulated genes within same CTCF blocks and between up-regulated genes located within 100 kb but in different CTCF blocks. The correlation of genes within the same CTCF blocks was significantly higher than that of genes located within 100 kb but in different CTCF blocks (P = 6.87 x 10–9 for the data of Bourdeau and colleagues in the presence of cycloheximide, P = 9.261 x 10–11 for the data of Carroll and colleagues), consistent with the idea that the genes within CTCF blocks were coregulated by ER. Although it may be true that nearby genes can be coregulated, the observed high correlation was not because the genes were closer to each other within blocks than between blocks; in fact, Supplementary Fig. S5 shows that the opposite was true for the gene pairs in our analysis, as we had restricted the maximum distance between genes across CTCF blocks to be 100 kb. A significant difference in correlation was also observed across 286 primary ER+ and ER– breast cancer samples (ref. 18; see Supplementary Fig. S6A). To find further support for the insulator effect of CTCF on ER regulation, we randomized the CTCF boundaries by taking the midpoints of original blocks as new boundary locations, unless a block contained more than one differentially expressed gene, in which case the midpoint between two randomly chosen differentially expressed gene TSSs was chosen. The expression levels of estrogen-regulated genes within the random CTCF blocks were not significantly more correlated across NCI60 cell lines than those in different blocks (P = 0.04; see Supplementary Fig. S6B), indicating that the high correlation of estrogen-responsive genes observed in Fig. 3B depended on the structure of CTCF boundaries.
To further verify that CTCF can indeed act as an insulator of ER, we checked that the genes in CTCF blocks without ER (non–ER blocks) but immediately adjacent to ER blocks with up-regulated genes have significantly low fold changes (P = 1.3 x 10–16, using t test; see Fig. 3C). Supplementary Fig. S8 shows that randomizing CTCF sites yields more "leaky" ER activities across CTCF boundaries. In addition, of the 110 genes that lie in non–ER blocks but have a nearest ER binding site within 20 kb in an immediately adjacent CTCF block, only 3 were up-regulated (P = 5.6 x 10–4). ER activation of genes, therefore, is mostly confined to ER blocks and does not cross the boundaries into neighboring blocks. This observation, together with the fact that a significant number of up-regulated genes have ER binding sites within their CTCF blocks, supports that CTCF can indeed delimit the range of ER activities.
Distance effect of ER binding sites. Distance distribution of nearest ER binding sites from up-regulated genes in ER blocks had mean 85 kb (SD 242 kb), 10% trimmed mean 47 kb (SD 63 kb; see Materials and Methods). For down-regulated genes, the mean was 216 kb (SD 391 kb), 10% trimmed mean 151 kb (SD 118 kb), comparable with nonresponsive genes (mean 245 kb and SD 590 kb, 10% trimmed mean 148 kb and SD 180 kb). As can be seen in Fig. 4A , ER was located significantly closer to the TSSs of up-regulated genes than those of down-regulated or nonresponsive genes (P < 10–14, Wilcoxon rank sum test), whereas there was no significant difference between the nearest ER locations for down-regulated and nonresponsive genes (P > 0.7, Wilcoxon rank sum test). Although ER binding sites lie close to up-regulated genes within CTCF blocks and most highly up-regulated genes have ER binding within 20 kb, there was no significant global correlation between the distance of ER binding sites and fold change among up-regulated genes (see Fig. 4B). Pearson correlation between fold change and log(|D|) was –0.09 with a P value of 0.097.
|
30% less than that of down-regulated genes found by Carroll and colleagues. Bayesian networks for predicting estrogen-responsive genes. Although 44% of the genes directly targeted by ER lie within 20 kb of ER binding sites, the precise effect of the distance between ER/FoxA1 binding sites and TSSs has not been hitherto quantified. We here used Bayesian networks to model the distance effect of ER/FoxA1 binding sites on differential expression of genes. A Bayesian network is a probabilistic graphical model that allows computation of joint probabilities in terms of a limited number of conditional probabilities, with arrows encoding the conditional dependence of the head node on the tail node. Bayesian networks have been previously applied to successfully predict 73% to 79% of the gene expression patterns in Saccharomyces cerevisiae (20). After feature selection, four models shown in Fig. 5A were analyzed using BN PowerPredictor (14) and the performance of each model was tested using 5-fold cross-validation (see Materials and Methods).
The first network in Fig. 5A involving only the distance DER to the nearest ER in the genome yielded 56% sensitivity and 81% specificity for predicting estrogen-responsive genes. Roughly, this model predicted a gene to be estrogen regulated if it had an ER binding site within 50 kb of its TSS. The second model that included only the information about whether a gene had an ER binding site within its CTCF block increased the sensitivity to 68% but lowered the specificity to 65%. The third model combining the information about ER in CTCF block and distance to the nearest ER had 66% sensitivity and 72% specificity (68% sensitivity on training sets). Incorporating the distance to the nearest FoxA1 site in the genome (the fourth model in Fig. 5A) yielded 92% sensitivity and 79% specificity on training sets, but only 55% sensitivity and 74% specificity on test sets, indicating that this model was overfitted. We have also analyzed the model containing only the information about nearest FoxA1 sites; however, it had only 50% sensitivity and 66% specificity on test sets. Overall, the third model gave the best compromise between sensitivity and specificity and also had the greatest area under the receiver operating characteristic curve (Supplementary Fig. S9).
We have examined other factors that could potentially correlate with estrogen responsiveness, but no significant influence was found: There was a slight correlation between the number of ER binding sites within CTCF blocks and expression fold change (Supplementary Fig. S10A). Although the ER blocks with an up-regulated gene had significantly more ER binding sites than those without an up-regulated gene (Supplementary Fig. S10B), the distance-normalized number of ER binding sites was similar between CTCF blocks with and without differentially expressed genes. There was also no correlation between the number of ER motifs in nearest ER binding sites and fold change (correlation = 0.01 and P = 8.517 x 10–1). We also checked that there was no orientation bias in differentially expressed genes within CTCF blocks.
| Discussion |
|---|
|
|
|---|
Unlike the up-regulated genes, we observed that down-regulated genes had a broad distribution of ER binding sites and had low fold changes. The weak response of down-regulated genes is consistent with the fact that cycloheximide prevents the translation of AP1 and NRIP1, which are transcriptionally activated by ER and which have been shown to play critical roles in ER-mediated repression of genes (2, 19). Our study thus supports the hypothesis that ER repression of genes involves secondary cofactors that are themselves induced by estrogen.
CTCF in MCF7 versus IMR90. To date, a genome-wide map of CTCF binding sites in MCF7 is not available. It has been shown, however, that the CTCF binding sites are mostly invariant across cell types. For example, a comparison of 44 genomic regions representing 1% of the human genome (ENCODE regions) in the hematopoietic progenitor cell line U937 and primary human fibroblasts IMR90 yielded an agreement of 67% at a strict confidence level of P < 10–6; the overlap increased at lower cutoffs (9). CTCF sites have also been mapped in CD4+ T cells using ChIP-seq (21) and a genome-wide comparison between the sites and those in IMR90 had a 71% overlap. We thus believe that the majority of the CTCF binding sites found in IMR90 will also be present in MCF7 and, at the same time, that accounting for the cell type–specific CTCF sites will only increase the power of our analysis.
CTCF blocks form regulatory units for understanding and predicting ER activities. For the past 3 years, high-throughput sequencing and ChIP-ChIP studies have revealed that ER can bind in distal enhancers to regulate its targeted genes (1–3). Although it was observed that highly regulated genes tended to have ER binding within 20 kb (2, 7), various distance cutoffs were chosen somewhat arbitrarily for analyses and no quantitative model has yet been formulated to capture the precise distance effect of ER binding locations. Here, we trained a Bayesian belief network to discover that just based on the information about nearest ER locations, our model classified genes that have nearest ER within 50 kb as estrogen responsive. As expected, this naïve approach capturing only the ER distance effect yielded a high level of 44% false negatives.
We have presented in this article several lines of evidence that CTCF can demarcate the range of ER activities, grouping coregulated estrogen-responsive genes into distinct blocks. As can be seen in Fig. 5B, incorporating the information about whether genes reside in ER-containing CTCF blocks increased the performance of our Bayesian network. Considering conserved motifs in promoters further improved the sensitivity and specificity by
3% (data not shown), supporting the hypothesis that the specificity of estrogen-induced genes is in part determined by the presence of particular cofactors in promoters. Interestingly, Bourdeau and colleagues could not find c-Myc as enriched in up-regulated genes, although c-Myc has been shown to localize to estrogen-responsive promoters and interact with ER (22). c-Myc is an important oncoprotein with a diverse transcriptional program in cell proliferation, growth, and apoptosis, interacting not only with ER but with other cofactors (23). The broad distribution of c-Myc and its non–ER-specific activities throughout the genome thus may make it difficult to find c-Myc as an enriched motif in estrogen-responsive genes. Our analysis shows that CTCF also helps in this regard; moreover, by considering the CTCF blocks with and without ER separately, we were able to detect c-Myc as a potential cofactor of ER in up-regulated promoters.
Many aspects of ER regulation of genes still remain unclear. For instance, about 30% of up-regulated genes did not have ER binding sites in their CTCF blocks. It could be that ER was actually present, but the specific epitopes recognized by antibodies were masked by interacting proteins such as FoxA1. Furthermore, there are more than 6,000 ER binding sites in the genome but only
500 direct target genes, and more than 6,000 non–estrogen-responsive genes (35% of RefSeq genes studied here) had ER binding within their CTCF blocks. Therefore, we do not yet understand the functions, if any, of the majority of ER binding regions. It is possible that the specificity of ER activation of genes can be influenced by local chromatin structure or other epigenetic states, such as the recently found estrogen-induced dimethylation of histone 3 arginine 17 in the E2F1 promoter by CARM1 (24). Future studies in these directions will greatly facilitate our understanding of how ER functions in normal development and cancer.
| Disclosure of Potential Conflicts of Interest |
|---|
|
|
|---|
| Acknowledgments |
|---|
We thank H. Girgis, A.J Levine, and I. Tagkopoulos for helpful comments on the manuscript; M. Brown, M. Lupien, and C. Meyer for discussions; and Y.S. Song for bringing the work of Bourdeau and colleagues to our attention.
| Footnotes |
|---|
1 http://algorithmics.molgen.mpg.de/Software/PyMix/ ![]()
Received 7/ 9/08. Revised 8/ 6/08. Accepted 8/22/08.
| References |
|---|
|
|
|---|
binding sites. PLoS Genet 2007;3:e87.[CrossRef][Medline]
responsive promoters. Mol Cell 2006;21:393–404.[CrossRef][Medline]This article has been cited by other articles:
![]() |
J. Eeckhoute, R. Metivier, and G. Salbert Defining specificity of transcription factor regulatory activities J. Cell Sci., November 15, 2009; 122(22): 4027 - 4034. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Lupien and M. Brown Cistromics of hormone-dependent cancer Endocr. Relat. Cancer, June 1, 2009; 16(2): 381 - 389. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. W. Pike, M. B. Meyer, and M. L. Martowicz New Techniques in Transcription Research Extend Our Understanding of the Molecular Actions of the Vitamin D Hormone IBMS BoneKEy, May 1, 2009; 6(5): 169 - 180. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Eeckhoute, M. Lupien, C. A. Meyer, M. P. Verzi, R. A. Shivdasani, X. S. Liu, and M. Brown Cell-type selective chromatin remodeling defines the active subset of FOXA1-bound enhancers Genome Res., March 1, 2009; 19(3): 372 - 380. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |