Gene set control analysis predicts hematopoietic control mechanisms from genome-wide transcription factor binding data

Transcription factors are key regulators of both normal and malignant hematopoiesis. Chromatin immunoprecipitation (ChIP) coupled with high-throughput sequencing (ChIP-Seq) has become the method of choice to interrogate the genome-wide effect of transcription factors. We have collected and integrated 142 publicly available ChIP-Seq datasets for both normal and leukemic murine blood cell types. In addition, we introduce the new bioinformatic tool Gene Set Control Analysis (GSCA). GSCA predicts likely upstream regulators for lists of genes based on statistical significance of binding event enrichment within the gene loci of a user-supplied gene set. We show that GSCA analysis of lineage-restricted gene sets reveals expected and previously unrecognized candidate upstream regulators. Moreover, application of GSCA to leukemic gene sets allowed us to predict the reactivation of blood stem cell control mechanisms as a likely contributor to LMO2 driven leukemia. It also allowed us to clarify the recent debate on the role of Myc in leukemia stem cell transcriptional programs. As a result, GSCA provides a valuable new addition to analyzing gene sets of interest, complementary to Gene Ontology and Gene Set Enrichment analyses. To facilitate access to the wider research community, we have implemented GSCA as a freely accessible web tool (http://bioinformatics.cscr.cam.ac.uk/GSCA/GSCA.html).

Cell type-specific gene expression is an inherent property of all multicellular organisms and indeed represents a major determinant that underlies the generation of differentiated cell types with distinct functionality. Elucidating the molecular mechanisms controlling cell type-specific expression has the power to reveal fundamental insights into the regulatory circuitry controlling both human and model organism development. Moreover, identification of control mechanisms in normal cells provides potential avenues for manipulating cellular fates, as exemplified by the recent explosion in cellular reprogramming studies [1]. It also enables the rational design of new therapies aiming to revert abnormal pathological cellular states back to their normal condition [1].
The blood or hematopoietic system has long been recognized as a powerful model system for studying cell typespecific gene expression [2]. Within the blood system, more than 10 distinct mature hematopoietic lineages (e.g., red blood cells, T cells, B cells) are generated from pluripotent hematopoietic stem cells (HSCs) via a sequence of intermediate progenitors, often represented as a lineage differentiation tree. Both the mature lineages as well as the various immature blood stem and progenitor populations can be purified based on the expression of combinations of specific cell surface markers, thus enabling powerful studies of cellular differentiation.
Transcription factors have long been recognized as major regulators of hematopoietic cell type specification [3][4][5][6]. To understand the mechanisms underlying cell type specification by transcription factors, it will be essential to identify their transcriptional targets. An important advancement in this research area was provided by the introduction of chromatin immunoprecipitation (ChIP) coupled to massively parallel sequencing (ChIP-Seq), which allows genome scale identification of all DNA sequences (regions) bound by a given transcription factor (TF) in a given cell type [7]. The technique has been rapidly adopted with over 100 individual studies now deposited in public databases for the murine hematopoietic system alone. This wealth of new data represents unprecedented opportunities to unravel the transcriptional control mechanisms that mediate expression of specific sets of genes within the various hematopoietic cell lineages [8].
Gene ontology [9] overrepresentation analysis provides information on various types of functional categories enriched within a given gene set of interest [10] and GSEA determines whether a gene set of interest shows statistically significant expression differences between two or more cell types [11]. However, neither of these approaches explicitly links a gene set to transcriptional control mechanisms. In this study, we report a new computational framework for linking gene sets with transcriptional control, called Gene Set Control Analysis (GSCA). Unlike previous algorithms developed to provide functional enrichment [10], GSCA links gene sets to likely upstream regulators responsible for coordinated expression. By exploiting multiple transcription factor binding patterns from genome-wide ChIP-Seq studies, GSCA can provide previously unattainable insights into possible transcriptional control mechanisms operating in both normal and malignant cells. To gain insights into combinatorial control mechanisms (i.e. multiple transcription factors occupying the same binding site in a gene locus), we further developed a novel tool called combinatorial-GSCA (C-GSCA). Through integrated analysis of 142 blood-specific ChIP-Seq binding datasets, C-GSCA identifies likely combinatorial transcriptional control mechanisms by revealing TF cooccupancy patterns specifically associated with gene regulatory elements from a given gene set. A web-based implementation of GSCA and C-GSCA allows user-friendly access for the wider research community, and thus provides a substantial new addition to the bioinformatic toolbox for hematopoietic gene set analysis.

Methods
ChIP-seq compendium Binding events for 35 transcription factors in seven major hematopoietic lineages were obtained from Hannah et al. [8]. Sixty new ChIP datasets from 18 publications and ENCODE murine datasets were analyzed, starting from the raw data set in each case, and peaks were identified in each sample using the protocol described previously [8]. A supplementary website (http://bioinformatics. cscr.cam.ac.uk/BLOOD_compendium_PUBLISHED.html) lists the number of peaks, reference, and peak calling method for each of the ChIP dataset. All binding events were mapped to genes using the same protocol described previously [12]. Binding events in the promoter and gene body were associated to the corresponding gene, whereas intergene peaks were associated to the nearest gene on either side within 50 kb, such that each peak is assigned to at most two genes.
Tissue-specific enhancer elements in mouse were downloaded from [13] and p value was calculated for overlap between each of the 61 tissue-specific enhancer regions and blood-specific regulatory regions [8] using a hyper-geometric test (Supplementary  Table 1, online only, available at www.exphem).

GSCA method
Of 270,261 genomic regions bound by at least one TF (N), for a set of user-defined genes, we calculate the number of genomic regions mapped to the genes (n). For each ChIP-Seq ChIP dataset, the number of peaks (m) near user defined genes (k) is calculated. The p value is calculated using a hypergeometric test (Fischer exact test).
cGSCA method A matrix of binding events with 270,261 genomic regions as rows and overrepresented ChIP-seq data sets (K) from GSCA step as columns is generated. The ChIP-seq data sets (K columns) are then clustered using a hierarchic clustering with Pearson's correlation coefficient as a distance measure.

Reference data set
Gene sets for 80 clusters of tightly coexpressed genes (their induction patterns) in 38 hematopoietic cell types were obtained from Novershtern et al. [14]. Human genes were mapped to orthologous mouse genes using MGI mammalian orthology (http://www. informatics.jax.org/orthology.shtml). We calculated the p value for each gene set with respect to each signature cluster using a hypergeometric test. We used the number of Novershtern clusters significantly overrepresented (Bonferroni corrected p ! 0.001) for one or more transcription factor targets as a measure to evaluate performance while comparing different methods.

Gene expression datasets
Nine gene expression signatures (d-erythroid, differentiated, d-lymphoid, d-myeloid, r-myelolymphoid, s-erythroid, s-mpp, s-myelolymphoid, and stem) were obtained from [15]. Differentially expressed genes in various leukemia datasets were downloaded from their respective publications. Gene lists were then interrogated against the ChIP-seq compendium using both GSCA and C-GSCA.

GSCA web tool
The GSCA output was produced using R, and the web user interface of the application was done using Perl/CGI/HTML. R commands are executed through the perl-cgi script to produce the image. The web tool can be accessed at the following URL: http://bioinformatics.cscr.cam.ac.uk/GSCA/GSCA.html.

Definition of a candidate regulatory genome in mouse hematopoiesis
We recently reported a compendium of more than 50 TF ChIP-Seq experiments in mouse blood cells collected from publicly available datasets [8]. We have doubled the compendium by adding 60 new ChIP datasets from 18 recently published studies [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33] and ENCODE murine unpublished datasets to obtain genome-wide binding patterns for 53 unique transcription factors in 15 major blood lineages and three types of leukemia (Table 1). TFbound peaks were determined for all new datasets using the same parameters as before [8], which resulted in a total of 270,261 genomic regions bound by at least one transcription factor. When added together, these 270,261 regions corresponded to 936 Mb, thus constituting 5.78% of the mouse genome. ChIP-Seq samples of the same transcription factor in related cell types were merged together to provide a consolidated set of 78 samples (Table 1).
Pennacchio et al. [13] developed a phylogenetic conservation and motif based approach to predict tissue specific enhancers, which allowed them to annotate w5,500 highconfidence mouse tissue-specific enhancers for 61 murine tissue types by integrating tissue-specific expression data, conservation information, and cis-regulatory motifs. Only 4 of these 61 tissues corresponded to hematopoietic cells, and predicted only enhancers for those four tissues showed significant overlap with our ChIP-enriched regions (B220 þ B cells, p 5 1.9e-10; CD4 þ T cells, p 5 1.4e-4; CD8 þ T cells, p 5 7.0e-7; lymph node, p 5 1.0e-4; see Supplementary table 1). This analysis therefore supports the validity of a compendium built on TF binding events in hematopoietic cells.
A new GSCA tool matches weighted TF-peak lists to gene sets We next explored whether our blood-specific TF ChIP-Seq peak catalogue could be used to predict transcriptional control mechanisms that may regulate the coordinated expression of a given set of genes. Computational tools for the identification of statistically significant overlaps between a given gene set and peak regions from single ChIP-Seq experiments have been described previously [34,35]. However, these tools do not exploit the ever-increasing number of datasets for multiple TFs in the same or related cell types.
Novershtern et al. [14] reported gene expression profiles in 38 distinct purified populations of human hematopoietic cells ranging from hematopoietic stem cells, through multiple progenitor and intermediate maturation states, to 12 terminally differentiated cell types. Using the Module Networks algorithm [36], they identified 80 modules or gene sets of tightly coexpressed genes with distinct expression patterns and enrichment for specific biological functions, which they termed induction patterns. When we used the 80 Novershtern modules as gene sets, 37 of 80 gene sets (Supplementary Table 2, online only, available at www.exphem) showed a statistically significant correlation with one or more TF peak files from our compendium when using the previously described ChIP Enrichment Analysis (ChEA) [34] and Csan [35] tools. Of note, there was a good overlap between the cell type used for ChIP-Seq and the expression/induction patterns as annotated by Novershtern et al. (Supplementary Table 2). For example, gene set 727 with induction pattern ''Late Erythroid'' was associated with Eto2 in Erythroid, Scl, and Ldb1 in HSCs and Scl in MELs, and gene set 979 overrepresented for ''immune response'' genes with induction pattern ''Late MYE'' was associated with Cebpa, Cebpb, P65, Pparg, and Stat1 in macrophages.
Because the ChEA [34] and Csan [35] tools could associate candidate upstream regulators to less than half of the 80 Novershtern gene sets, we set out to develop an alternative approach by incorporating the concept of weighted peak-to-gene mapping recently reported as part of the Genomic Regions Enrichment of Annotations Tool (GREAT) [37]. GREAT links a list of ChIP-Seq peak regions to gene lists with particular functional significance and unlike previous approaches incorporates binding sites not only in the promoter region of a gene. Taking inspiration from this approach, we developed a new tool by mapping each peak to its nearest gene within 50 kb and then considering the number of binding events in each gene locus to calculate the significance of association between a gene locus and a given upstream regulator.
(Essentially this is the reverse of GREAT, which associates peaks with genes, whereas our new procedure associates genes with peaks). Specifically, our new tool determines the number of binding events in the loci of genes of interest for each ChIP dataset (Fig. 1A, red arrows), and then calculates a p value using a simple hypergeometric test. Datasets with statistically significant overlaps (corrected p value cutoff !0.001) are then selected by interrogating all ChIP datasets independently against the gene list ( Fig. 1B). When applied to the 80 gene modules from Novershtern et al. [14], our new tool reported significant associations with ChIP-Seq peaks for 65 gene modules (Supplementary  Table 2, online only, available at www.exphem), which corresponds to 81% of all gene sets compared with only 46% using the previously reported ChEA and Cscan tools. Incorporation of weighted gene lists therefore results in a significant increase in the percentage of gene modules that can be linked to candidate upstream regulators. We named this new approach Gene Set Control Analysis, or GSCA. Only 61% of all Novershtern gene sets (49 of 80 gene modules) were enriched when the binding events only in promoters were selected, thus highlighting the likely importance of binding to nonpromoter regions, which compose 57% of all binding events in our datasets.

GSCA correlates relevant combinations of transcription factors with hematopoietic gene sets
To investigate the potential biological relevance of the candidate upstream regulatory transcription factors matched with the 65 Novershtern gene sets by GSCA, we again used the induction patterns defined by Novershtern et al. as a measure of lineage-specific expression. The majority of gene sets (97%) showed good correspondence between the induction patterns and the cell types in which the TFs had been chipped (Supplementary Table 3, online only, available at www.exphem). For example, gene sets 667 and 829 (enriched for T cell receptor activity) were associated by GSCA with Stats and Gata3 in T cells, whereas gene sets 649 and 961 (enriched for B cell receptor activity) were associated with Pu.1, E2A, and Pax5 in B cells. Gene set 721 (involved in inflammatory and antibacterial response) was linked by GSCA with Cebpa, Cebpb, P65, Pu.1, and Stat1 in macrophages. Gene sets 727 and 889 with Late Ery induction pattern (en-riched for protein amino acid glycosylation and blood group antigen functional annotations) significantly overlapped only with targets of Eto2, Gata2, Ldb1, Mtgr1, and Scl in erythroid cells. Taken together therefore, there is good concordance between the induction patterns of Novershtern gene sets and the matching ChIP-Sequencing TF datasets identified by GSCA.
Combinatorial regulatory pattern discovery from multi factor ChIP-Seq data Compared with previous tools, our new GSCA tool performs better by associating gene lists with ChIP-Seq peaks by calculating weighted associations between factors and genes based on the number of binding events within a gene locus. However, all individual ChIP-Seq datasets are treated independently, thus making it difficult to infer whether two overrepresented transcription factors work combinatorially (e.g. whether they show statistically significant co-occupancy of the same regulatory regions), rather than binding to overlapping sets of gene loci, but using distinct cis-regulatory regions. To address this issue of combinatorial binding, we developed a new tool called combinatorial GSCA (C-GSCA), and then applied this new tool to our hematopoietic ChIP-Seq compendium. For a given gene list, we first run GSCA to select the TFs showing overrepresented binding. Assuming that m TFs are selected out of 78 ChIP-seq datasets, we generate a binary matrix (n Â m) of m columns representing the m ChIP datasets and n rows representing the genomic regions occupied by two or more of the m TFs, with 1s and 0s indicating the presence or absence of binding, respectively. We filter genomic regions bound by only one factor (w16% of genomic regions; Supplementary Table 2, online only, available at www.exphem) because they are not informative in terms of combinatorial control mechanisms. We then perform hierarchical clustering of n overrepresented ChIP datasets using Pearson's correlation coefficient as a distance measure. Unlike GSCA, all overrepresented ChIP datasets are considered together, making the prediction of combinatorial control feasible (Fig. 2).
Using ChIP-Seq analysis of 10 transcription factors in the hematopoietic progenitor cell line HPC7, we have shown previously that combinatorial interactions between a heptad of TFs (SCL, LYL1, LMO2, GATA2, RUNX1, ERG, and FLI-1) were overrepresented in the loci of genes specifically expressed in HSPCs and therefore associated with gene sets specifically expressed in HSCs [12]. When the heptad-bound genes were interrogated using GSCA, 49 of 78 ChIP-Seq datasets were enriched, thus identifying multiple new transcription factors as candidate upstream regulators in addition to the seven factors (Supplementary Figure 1, online only, available at www.exphem). Using C-GSCA, these 49 datasets could be split into four cell type-specific groups of T cells, macrophages, HSCs, and erythroid (Supplementary Figure 1, online only, available at www.exphem). This observation suggests that gene loci bound by the heptad in blood stem and progenitor cells not only include genes specifically expressed in HSCs, but could also include a subset of genes affiliated with various different hematopoietic differentiation programsdan observation that would be consistent with the concept of lineage priming developed in the 1990s [38]. These results suggested that the C-GSCA procedure outlined here may be useful more generally to associate hematopoietic gene sets to upstream regulators and thus able to predict combinatorial control mechanisms driving the expression of a given gene set.
We next applied the new C-GSCA tool to all 80 hematopoietic gene sets from the Novershtern et al. study [14], which allowed us to associate 65 of the 80 Novershtern gene sets overrepresented for ChIP datasets using GSCA for combinatorial TF signatures. Mxi1, and Tbp in mouse erythroleukemia (MEL) (Fig. 3A). Similarly, gene set 745 with induction pattern ''NK þ T cell'' is linked with Myb in myeloid progenitors and Stat3, Stat4, and Stat5 in T cells (Fig. 3B). Indeed, more than 60% (40 of 65) of the overrepresented Novershtern gene sets with matched upstream regulators were linked with more than one combinatorial pattern (Supplementary  Table 3, online only, available at www.exphem). Therefore, unlike the GSCA approach (Fig. 1), C-GSCA has the potential to identify distinct subsets of candidate upstream regulators for a given gene set (Fig. 2).

GSCA web tool
As GSCA and C-GSCA provide potentially powerful ways of predicting candidate upstream regulators for a given list, we developed a web tool to facilitate gene set control analysis for the wider community (http://bioinformatics.cscr.cam.ac.uk/ GSCA/GSCA.html). In this section we provide a brief explanation of the functionality of the GSCA web tool using a recent transcriptome analysis of murine HSCs and early multipotent, bipotent, and unipotent progenitors [15], which reported nine gene expression signatures ranging from those characteristic for the most immature HSCs to those affiliated with differentiation into the individual hematopoietic lineages. We interrogated these nine experimentally obtained gene expression signatures using the GSCA web tool. Eight of these nine mouse stem-progenitor gene signatures showed significant overlap with multiple ChIP-Seq data sets, thus providing an independent test case to examine the biological relevance of predicted combinatorial regulatory signatures in  Figure 4A shows a screenshot of the web tool in which users can paste a query gene list or upload it from a file (human or mouse).
Upon choosing GSCA, a gene list of interest is interrogated against 78 ChIP-Seq datasets across 15 blood cell types. GSCA calculates the significance of overlap between each ChIP-Seq dataset and the gene set of interest and displays all ChIP-Seq datasets, with those showing enrichment in yellow color. For example, the self-renewing signature (stem signature from Ng et al. [15]) is provided as a test dataset for the users and shows statistically significant overlap with multiple transcription factors in HPC7 and progenitors. When the same stem signature gene list is analyzed using C-GSCA, the overrepresented ChIP datasets are clustered into two distinct cell type specific clusters HPC7 and MK progenitors (Fig. 4B). Six of the seven transcription factors in the HPC7 cluster overlap with the heptad signatureda binding pattern that we have previously shown is overrepresented in the loci of genes specifically expressed in HSPCs and therefore associated with gene sets specifically expressed in HSCs [12]. Similarly, the gene signature associated with the third wave of the myeloid lineage program (d-my signatures) from Ng et al. [15] shows statistically significant overlap with two combinatorial binding events, Cebpa, Cebpb, Stat1, P65, and Pu.1 in macrophages and Myb in myeloid progenitors. In addition to showing the functionality of the web tool, these results suggest that combinatorial control signatures generated by C-GSCA have the potential to provide insights into combinatorial transcriptional control mechanisms, and that the GSCA web tool provides access to this type of analysis to the wider community.

GSCA analysis of gene sets associated with hematologic malignancies
We have shown that GSCA can be used to link lineagespecific gene sets to combinations of candidate upstream regulatory TFs, and these associations are consistent with expectations based on current knowledge of regulatory control within hematopoiesis. This consistency attests to the potential robustness of the GSCA approach and suggests that it may also be useful to reveal biological insights into transcriptional programs operating in malignant hematopoietic cells, where diagnostic or prognostic gene sets have been derived for many types of leukemia, yet the combinations of TFs driving expression of these gene sets remain largely unknown. We therefore explored the utility of GSCA for linking leukemic gene sets with candidate upstream regulators.
We first analyzed a gene set recently reported by McCormack et al. [39], in which the investigators showed that overexpression of Lmo2 in T-lymphoid progenitors induced a preleukemic state characterized by extensive self-renewal capacity. When the authors performed comparative gene expression profiling of normal and LMO2 expressing thymocytes, they noted upregulation of several HSC specific genes and suggested that ectopic expression of Lmo2 might activate an HSC specific transcription program. To test this hypothesis further, we analyzed the list of genes upregulated in Lmo2 transgenic DN thymocytes [39] by GSCA. This analysis suggested that the LMO2 overexpression gene set was under the transcriptional control of stem cell transcription factors such as Scl, Gata2, Runx1, Fli1 and Erg and also showed a strong overlap with LMO2 binding itself in nonleukemic progenitor cells.
We next analyzed gene expression profiling data generated as part of a recent study investigating transcriptional programs downstream of mixed lineage leukemia (MLL) transformation in mouse models of acute myeloid leukemia (AML) [40]. Expression analyses following MLL-AF9 withdrawal had prompted the authors to propose a model whereby MLL-AF9 enforces a Myb-coordinated program of aberrant self-renewal that involves genes linked to leukemia stem cell potential and poor prognosis in human AML patients. Of note, when we analyzed the genes downregulated following MLL-AF9 withdrawal by GSCA, we observed statistically significant overlaps with the two Myb ChIP-Seq datasets in our compendium (Fig. 5A). In addition, GSCA also recovered associations with MAX and the MAX interacting protein MXI1, both of which have also been linked to a range of human cancers [41]. GSCA analysis therefore not only corroborated the findings by Zuber et al. [40]; it also provided additional hypotheses on likely mechanisms that might control transcriptional programs downstream of MLL-AF9 in AML.
The final leukemic gene set analyzed by GSCA was taken from a 2009 study of the transcriptional programs in leukemic stem cells [42]. Comprehensive gene expression profiling analysis had lead the authors to speculate that leukemia stem cells in an MLL-driven mouse model of AML are characterized by a transcriptional program shared with embryonic rather than adult stem cells. This conclusion was subsequently challenged when it was suggested that the overlap with embryonic stem cell transcriptional programs was the reflection for a shared dependence on c-MYC activity rather than related to the stemness phenotype of ES cells [43]. Analysis of the leukemia stem cell associated gene set from  [14] with ''NK þ T cell'' induction pattern. C-GSCA is able to separate combinatorial patterns in T cells and myeloid progenitors.  [15] showing two cell type-specific distinct combinatorial patterns. the Somervaille et al. [42] study by GSCA revealed a strong association with c-MYC ChIP-Seq datasets (Fig. 5B). However, there were also statistically significant associations with many additional ChIP-Seq datasets. GSCA analysis was therefore supportive of a role for c-MYC in the similarity between leukemic and embryonic stem cell expression signatures, but suggested that TFs more specifically expressed within blood cells also make important contributions to the leukemia stem cell transcriptional program. Of note, genes negatively associated with the leukemia stem cell phenotype in the study by Somervaille et al. [42] did not show the overlap with c-MYC, but it showed a distinct pattern of correlated ChIP-Seq datasets for the hematopoietic TFS, which interestingly contained several datasets for mature macrophages and was thus consistent with a relatively immature differentiation stage for the leukemia stem cells (Supplementary Figure 2, online only, available at www.exphem).
The application of GSCA to leukemic expression datasets supports the notion that integrated analysis of genome-wide transcription factor binding maps has significant potential as a new addition to the toolbox used by experimentalists to derive new hypothesis for experimental validation, which in the case of our current implementation of GSCA analysis would be geared specifically toward the identification of transcriptional mechanisms that control the behavior of normal and leukemic blood cells.

Discussion
Gene expression arrays have been used widely to characterize genes responsible for a particular cellular phenotype. The differentially expressed genes thus obtained can then be used for functional enrichment analysis. However, the important question of ''What upstream regulatory mechanisms are responsible for the differential expression?'' is not specifically addressed when using current approaches for gene set analysis, such as Gene Ontology or Gene Se Enrichment analysis tools.
As a result of the rapid progress in next-generation sequencing technology, ChIP-Seq analysis has become a favorite tool to investigate in vivo binding events because it offers higher resolution, less noise, and greater coverage compared with other techniques [44]. Nevertheless, the generation of genome-wide binding maps for multiple transcription factors across different cell types remains a formidable challenge for individual labs [45]. ChIP-Seq datasets from different labs can, however, be integrated at the computational level, which we recently demonstrated using 53 mouse ChIP-Seq experiments from different laboratories across the hematopoietic differentiation tree [8]. Since then, we have added 60 new ChIP datasets, thus more than doubling the size of the original compendium. In addition to highlighting a potentially major portion of the total regulatory genome involved in hematopoietic gene expression, a data compendium of this scale should have the potential to provide new insights into regulatory mechanisms governing gene sets of interest.
To explore this further, we developed GSCA to identify enriched combinatorial binding patterns of transcription factors regulating a given gene set. This method uses experimental binding evidence, keeping the cell type specific context, unlike prediction methods based on overrepresentation of cis-regulatory sequence motifs in the promoters [46]. Using 80 clusters of tightly coexpressed genes in 38 hematopoietic cell types [14], we demonstrated that the transcriptional control mechanisms predicted are biologically coherent, and that GSCA performs better than current methods. Of note, this analysis also demonstrated that GSCA can be used in a cross-species fashion, with human gene sets analyzed using a murine ChIP-Seq compendium in this particular instance. The rationale for this crossspecies capability is provided by recent observations from ChIP-Seq data for the same transcription factor in multiple species where it was shown that, although a significant proportion of binding locations (peaks) are not conserved, there tends to be what was termed binding site turnover for these sites where loss of binding in one species is accompanied by gains elsewhere in the same gene locus in the other species [47]. The conserved and many of the nonconserved binding sites therefore map to the same gene loci, such as in human-mouse comparisons. Just as for many other gene set analysis tools, cross-species capability in GSCA is facilitated by the use of standard gene symbols that are standardized across mammals.
We further illustrated the utility of the GSCA tool to unravel potential regulatory mechanisms underlying a range of leukemia gene sets, thus suggesting potential future application of GSCA to build hypotheses to investigate transcriptional control mechanisms responsible for the expression of gene sets with diagnostic, prognostic, or therapeutic relevance. Finally, we built a web tool to facilitate similar analysis for the wider scientific community. Complementary to gene ontology functional overrepresentation analysis, GSCA calculates overrepresentation of binding events for a gene list of interest, thus predicting possible transcriptional control mechanisms.
Given the significant investment into several collaborative projects such as the ENCODE (Encyclopaedia of DNA Elements) and modENCODE (model organism ENCODE) initiatives [48,49], we are likely to witness a near exponential increase in ChIP-Seq datasets over the coming years. Although our current implementation of the GSCA web tool is geared toward predicting candidate upstream regulators within hematopoietic cells, the approach can be applied easily to other tissues when sufficient ChIP-Seq data become available.
Research Counsel, Cancer Research UK, Wellcome Trust, and the Leukemia and Lymphoma Society.