Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape

The large collections of ChIP-seq data rapidly accumulating in public data warehouses provide genome-wide binding site maps for hundreds of transcription factors (TFs). However, the extent of the regulatory occupancy space in the human genome has not yet been fully apprehended by integrating public ChIP-seq data sets and combining it with ENCODE TFs map. To enable genome-wide identification of regulatory elements we have collected, analysed and retained 395 available ChIP-seq data sets merged with ENCODE peaks covering a total of 237 TFs. This enhanced repertoire complements and refines current genome-wide occupancy maps by increasing the human genome regulatory search space by 14% compared to ENCODE alone, and also increases the complexity of the regulatory dictionary. As a direct application we used this unified binding repertoire to annotate variant enhancer loci (VELs) from H3K4me1 mark in two cancer cell lines (MCF-7, CRC) and observed enrichments of specific TFs involved in biological key functions to cancer development and proliferation. Those enrichments of TFs within VELs provide a direct annotation of non-coding regions detected in cancer genomes. Finally, full access to this catalogue is available online together with the TFs enrichment analysis tool (http://tagc.univ-mrs.fr/remap/).


SUPPLEMENTAL TABLES 2
Table S1 List of the 132 transcription factors available in the public catalogue. 2 Table S2 Biotypes of genes not recovered by our catalogue. 5 Table S3 Overlap statistics of VELs from CRC and MCF--7 cell lines with our regulatory catalogue (peaks and CRMs). 5

SUPPLEMENTAL FIGURES 6
Figure S1 Analysis and pipeline flowchart 6 Figure S2 Quality assessment of public datasets 7 Figure S3 Variation of summits of merged peaks relative to the average summit of non-redundant peaks.
8 Figure S4 Clusters of highly interconnected and strongly specific colocalized TFs. 9 Figure S5 Normalized distance from TFBS to the center of Variant Enhancer Loci.
10 Figure S6 Breast cancer VELs correlate with gene expression variations.
10 Figure S7 Matrix of the percentages of significant overlapping peaks for each pair of TFs. 11 Figure S8 Determination of the colocalisation specificity of two TFs (example for SOX2).
12 Figure S9 Transcription factors specifically enriched within lenient sets of gained and lost VELs in  SUPPLEMENTAL TABLES

Figure S1 Analysis and pipeline flowchart
Schematic overview of our analysis pipeline used to process public TF ChIP-seq data, assess their quality following approved guidelines, and finally construct the ReMap catalogue while including ENCODE TF ChIP-seq data.

Figure S2 Quality assessment of public datasets
Here each analysed dataset (n=696) is represented by a colored dot according to its assigned score. Indeed, to assess the quality of public datasets, we computed a score based on the cross-correlation and the FRiP (fraction of reads in peaks) metrics developed by the ENCODE Consortium and the phantompeak tools. This score is computed as follow. Two thresholds based on ENCODE studies were defined for each of the two cross-correlation ratios (Grey lines; Normalized Strand Coefficient: 1.05 and 1.10 on the x-axis; Relative Strand Coefficient: 0.8 and 1.0 on the y-axis, see M&M for details). A basal score ranging from 0 to 4 was assigned to each dataset corresponding to the number of thresholds it exceeds for NSC and RSC (2 thresholds for each score). Finally, this basal score was incremented by one if the FRiP is equal or higher than 1%. We observe that datasets having a minimum score of 2 exceed at least one threshold of RSC or NSC, which are both scores independent of peak calling procedures. Thus for our analyses, datasets having a score less than or equal to 1 (red dots), as well as datasets with fewer than 100 identified peaks were discarded for further downstream analyses. Reds dots within accepted thresholds are datasets with less than 100 peaks. Datasets with scores greater to 1 were kept for further analyses (green dots, n=395).  MEF2C  NCOA1  NCOR2  STAG1  CEBPA  RAD21  TCF4  FOXH1  NCOR1  RBPJ  SMC1A  CEBPB  REST  ESR2  VDR  SPI1  SMAD2+3  NFYB  CTNNB1  CTCF  NFYA  HNF4A  RXRA  SMAD4  TAL1  RUNX1T1  TP53  ESR1  ZNF143  HSF1  FOXA1  TFAP2C  GATA3  NR3C1  NR3C3  NOTCH1  E2F6  GTF2B  SMAD3  SMAD1  GATA1  STAT1  GATA2  AR  NFKB1  TCF3  CDX2  MBD4  NANOG  E2F7  TP63  GABPA  YY1  CDK8  CDK9  FOXM1  RELA  RUNX1  TCF7L2  KLF4  E2F4  SOX2  TCF12  SMARCA4  STAT5A  NR2F2  POU5F1  MAX  FLI1  SETDB1  MYC  JUND  ARNT  ETS1  BRD4  PHF8 Figure S4 Clusters of highly interconnected and strongly specific colocalized TFs. The colocalisation network in Figure 4 reveals 12 clusters of strongly specific TFs isolated using a partitioning algorithm. Nodes of these sub-networks indicate individual TFs, edge colours depict the percentages of overlap between TFBS and weights the colocalisation specificity between two TFs. Enrichments of sub-networks in Gene Ontology Biological Process annotations were calculated using DAVID tool and are indicated below each sub-network. No annotations were enriched in the last three sub-networks.   In this figure we are computing the overlap of non-redundant binding sites for each couple of transcription factors using IntervalStats tool. The specificity of this analysis is that we used each TF both as a query and as a reference in each couple of TFs (eg; TFa vs TFb, TFb vs TFa). Thus, this analysis forms an asymmetric matrix of the percentages of significant overlapping peaks between two TFs. For a TF pair (TFa vs TFb) and for each peak in the query set TFa, IntervalStats computes a p-value of the overlap of a peak with the reference set of TFBS for TFb. We identified significant overlapping peaks with the reference TF with a p-value threshold defined at 0.05. The result is an asymmetric matrix of the percentages of significant overlapping peaks between two TFs. TFs used as queries by IntervalStats are indicated on the right side of the heatmap and TFs used as references at the bottom. The significant overlapping peaks in the query set of TFBS are indicated as percentages ranging from 0% of overlap (blue) to 100% of overlap (red). This asymmetric matrix allowed to determine a list of strongly and moderately specific TFs by identifying outliers based on the percentages of significant overlapping peaks (See Figure S8). AR  ARID3A  ARNT  ATF1  ATF2  ATF3  ATRX  BACH1  BATF  BCL11A  BCL3  BCL6  BCLAF1  BCOR  BDP1  BHLHE40  BRCA1  BRD2  BRD3  BRD4  BRF1  BRF2  CBFB  CBX3  CCNT2  CDK8  CDK9  CDX2  CEBPA  CEBPB  CEBPD  CHD1  CHD2  CREB1  CTBP2  CTCF  CTCFL  CTNNB1  DCP1A  E2F1  E2F4  E2F6  E2F7  EBF1  EGR1  ELF1  ELF5  ELK1  ELK4  ELL2  EOMES  ERG  ESR1  ESR2  ESRRA  ETS1  ETV1  EZH2  FAM48A  FLI1  FOS  FOSL1  FOSL2  FOXA1  FOXA2  FOXH1  FOXM1  FOXP1  FOXP2  GABPA  GATA1  GATA2  GATA3  GATA6  GATAD1  GPS2  GREB1  GTF2B  GTF2F1  GTF3C2  HDAC1  HDAC2  HDAC6  HDAC8  HMGN3  HNF4A  HNF4G  HSF1  IKZF1  IRF1  IRF3  IRF4  JUN  JUNB  JUND  KDM5A  KDM5B  KLF4  MAFF  MAFK  MAX  MAZ  MBD4  MED12  MEF2A  MEF2C  MEIS1  MTA3  MXI1  MYBL2  MYC  NANOG  NCOA1  NCOR1  NCOR2  NELFE  NF1C  NFATC1  NFE2  NFKB1  NFYA  NFYB  NIPBL  NOTCH1  NR2C2  NR2F2  NR3C1  NR3C3  NRF1  ONECUT1  ORC1  PAX5  PBX3  PHF8  PML  POU2F2  POU5F1  PPARG  PPARGC1A  PRAME  PRDM1  PRDM14  RAC3  RAD21  RB1  RBBP5  RBPJ  RCOR1  RELA  REST  RFX5  RNF2  RPC155  RUNX1  RUNX1+3  RUNX1T1  RUNX2  RUNX3  RXRA  SAP30  SETDB1  SFMBT1  SIN3A  SIRT6  SIX5  SMAD1  SMAD2+3  SMAD3  SMAD4  SMARCA4  SMARCB1  SMARCC1  SMARCC2  SMC1A  SMC3  SMC4  SNAPC1  SNAPC4  SNAPC5  SOX2  SP1  SP2  SP4  SPI1  SREBP1  SRF  STAG1  STAT1  STAT2  STAT3  STAT4  STAT5A  STAT5B  SUZ12  TAF1  TAF2  TAF3  TAF7  TAL1  TAp73a  TAp73b  TBL1  TBL1XR1  TBP  TCF12  TCF3  TCF4  TCF7L2  TEAD4  TFAP2A  TFAP2C  TFAP4  THAP1  TLE3  TP53  TP63  TRIM28  UBTF  USF1  USF2  VDR  WRNIP1  YY1  ZBTB33  ZBTB7A  ZEB1  ZKSCAN1  ZNF143  ZNF217  ZNF263  ZNF274  ZNF76  ZZZ3   ZZZ3   ZNF76   ZNF274   ZNF263   ZNF217   ZNF143   ZKSCAN1   ZEB1   ZBTB7A   ZBTB33   YY1   WRNIP1   VDR   USF2   USF1   UBTF   TRIM28   TP63   TP53   TLE3   THAP1   TFAP4   TFAP2C   TFAP2A   TEAD4   TCF7L2   TCF4   TCF3   TCF12   TBP   TBL1XR1   TBL1   TAp73b   TAp73a   TAL1   TAF7   TAF3   TAF2   TAF1  For each transcription factor, a list of strongly and moderately specific colocalized TFs was determined by identifying outliers based on the percentages of significant overlapping peaks. Outliers were defined as TFs that have a percentage of overlap exceeding 1.5 and 3.0, respectively for moderately and strongly colocalized TFs, the interquartile range (IQR) above the 3 rd quartile of percentages. For SOX2, three strongly colocalized TFs (NANOG, POU5F1/Oct4 and SMARCA4) and 11 moderately colocalized TFs (AR, CEBPB, FOS, FOXA1, FOXH1, JUND, KLF4, RAD21, SMC3, STAG1 and TEAD4) were identified. To assess the MCF-7 VELs detection procedure we generated VELs for MCF-7 with relaxed thresholds (SD=0.75, n=5295 for Gain, n=5087 for Loss). Variant Enhancer Loci detections were performed using the exact same procedure as described in Akhtar-Zaidi B. et al using a function of Shannon Entropy. We observe a similar enrichment of TFs between our initial VELs ( Figure 5c)