Predicting context specific enhancer-promoter interactions from ChIP-Seq time course data

16 We have developed a machine learning approach to predict context specific enhancer-promoter interactions using evidence from changes in genomic protein occupancy over time. The occupancy of estrogen receptor alpha (ERα), RNA polymerase (Pol II) and histone marks H2AZ and H3K4me3 were measured over time using ChIP-Seq experiments in MCF7 cells stimulated with estrogen. A Bayesian classifier was developed which uses the correlation of temporal binding patterns at enhancers and promoters and genomic proximity as features to predict interactions. This method was trained using experimentally determined interactions from the same system and was shown to achieve much higher precision than predictions based on the genomic proximity of nearest ERα binding. We use the method to identify a genome-wide confident set of ERα target genes and their regulatory enhancers genome-wide. Validation with publicly available GRO-Seq data demonstrates that our predicted targets are much more likely to show early nascent transcription than predictions based on genomic ERα binding proximity alone. 17 18 19 20 21 22 23 24 25 26 27


28
Gene expression is dependent upon the binding of transcription factor (TF) proteins to genomic regions 29 which regulate transcriptional initiation (Nagarajan et al., 2014). In eukaryotic cells, these regulatory 30 genomic regions are referred to as promoters and enhancers. The transcriptional competence of DNA in 31 eukaryotes is determined by its organization in chromatin. Chromatin structure is dynamically regulated 32 at multiple levels, including ATP-dependent chromatin remodelling and histone modifications (Bernstein applied on a genome-wide scale (3C,4C; Simonis et al., 2007). Capture-HiC methods have recently been 48 developed (Mifsud et al., 2015;Javierre et al., 2016) to improve genomic resolution through focussing on 49 predetermined genomic regions, e.g. promoters, and show promise but are not yet widely used. Data from 50 these technologies can also be noisy and subject to various sources of bias which can be problematic to 51 correct (van Steensel and Dekker, 2010). In addition, the physical contact between two chromatin regions 52 does not determine a functional interaction (Shlyueva et al., 2014) with stimulus-dependant behaviour of 53 chromatin looping adding a further layer of complexity (Drissen et al., 2004;Vakoc et al., 2005). For 54 these reasons, complementary approaches to infer enhancer-promoter interactions by exploiting readily 55 available sources of genomic data, such as ChIP-Seq and RNA-Seq data, are of interest. 56 ChIP-seq experiments enable the discovery of the genomic location of transcriptionally relevant 57 proteins such as TFs, RNA polymerase and modified histones. Multiple ChIP-Seq datasets can be 58 combined with data from other relevant genomics assays to identify active promoters and enhancers 59 using genomic segmentation algorithms (Zhu et al., 2013;Ernst et al., 2011). Others have also used 60 ChIP-seq and RNA-seq datasets to infer enhancer-promoter interactions. For example, Ernst et al. (2011) 61 used histone mark data from multiple cell-types to identify active enhancers and promoters from which 62 enhancer-associated data was correlated with expression data from genes within 125kbp to identify likely 63 interactions. Thurman et al. (2012) used DNase I hypersensitivity (DHS) data from multiple cell-types to 64 correlate and link distal DNase hypersensitivity sites (within 500kbp) to those within putative gene targets. 65 Similarly, Andersson et al. (2014) predicted enhancer-promoter links by correlating CAGE enhancer RNA 66 to CAGE promoter RNA.

67
Approaches for discovering cell-type specific interactions include PreSTIGE (Corradin et al., 2014), 68 RIPPLE (Roy et al., 2015), and the method developed by Marstrand and Storey (2014). PreSTIGE uses a 69 method based on the Shannon entropy to identify cell-type specific interactions between enhancers and method uses three features: evolutionary conservation, correlation of enhancer scores derived from histone 82 marks from RNA-seq data, and an average of correlations between TF ChIP-Seq and gene expression 83 across 12 cell-types. A distance constraint is also imposed to aid inference.

84
The majority of the above methods require data from multiple cell-types and therefore do not allow 85 discovery of interactions given data from one cell-type. Most existing methods also assume a stringent 86 distance constraint and are therefore unable to discover distal links beyond this constraint. Finally, these 87 methods do not take into account evidence from time course data. 88 We show how ChIP-Seq time course data that reports TF and RNA polymerase occupancy at multiple 89 time points after cellular stimulation can be used to predict enhancer-promoter interactions within 90 chromosomes. We have developed a Bayesian classifier that combines evidence from the correlation 91 of ChIP-Seq time course data at enhancers and across gene bodies with the genomic separation of 92 interacting elements as features. We apply our method to time course data from MCF7 breast cancer cells 93 after stimulation with estradiol and we benchmark performance against publically available ChIA-PET 94 data from this system. We show that our method performs much better than association by proximity, 95 identifying many more interactions than predictions based on proximity alone. Estrogen Receptor (ER-α) 96 and RNA polymerase (Pol II) ChIP-Seq time course data are shown to be highly informative for predicting 97 interactions. We also stratify our predicted interactions to those that lie within Topologically Associating 98 Domains (TADS; Dixon et al., 2012)    Alignment to a reference human genome 168 Raw reads from the experiments were mapped onto the human reference genome (NCBI build37) using 169 the Genomatix Mining Station (version 3.5.2) to enable further analysis. The sequencing depth, i.e. the 170 total number of sequenced reads, was very similar for each dataset, however, on average only 81%, 76%, 171 67%, 61%, 64% of ER-α, Pol-II (rep 1), Pol-II (rep 2), H3K4me3, and H2AZ ChIP-seq reads were 172 mapped uniquely to the genome. The non-uniquely mapped reads were discarded from further analysis.

173
Using the statistical criterion provided by MACS, we established that our sequencing depth allows for no 174 duplicates of reads, thus we discarded any duplicated reads as they are most likely an artefact in ChIP-Seq.

176
The MACS package (v2.0, p-value: 1e-7, no control, estimation of λ local off) Zhang et al. (2008) was used 177 for peak-calling and applied to each of the 0, 5, 10, 20, ..., 320 min time course datasets to estimate ER-α 178 binding locations. The last two time points (640 and 1280 mins) were not included as the number of ER-α 179 mapped reads was found to be very low at these times compared to earlier times. Persistent co-occurring 180 ER-α binding locations (i.e occurring at least twice across two time points after t = 0) were merged by a 181 union operation (similar to the mergeBED method from BEDTools (Quinlan and Hall, 2010)), otherwise 182 they were discarded. The method is illustrated in Figure S1. Since our analysis is aimed at intergenic 183 ER-α-bound enhancers, we ignored the consensus peaks which overlapped with either gene bodies or 184 upstream 300bp-long regions by which the genes were extended to account for a promoter region. 186 We calculated the mapped read counts for each individual time point ChIP-seq dataset over the consensus 187 ER-α binding sites to create time series over enhancer regions for each of our antibodies. To normalise 188 the counts, we divided each read count over the total number of uniquely mapped and non-duplicated 189 reads across all time points and multiplied the resultant values by the total number of mapped reads in 190 the t = 0 min dataset. We concatenated the normalised counts to produce time series for each ChIP-seq 191 dataset. We refer to each enhancer time series as X X X j,n , where j ∈ J (number of intergenic enhancers) and  approximately pre-specified number of clusters. The method is similar to k-means but can achieve much 204 better optimisation of the k-means objective function than the standard EM algorithm.

4/17
To reduce the effect of noise, for Pol II we clustered only the pairs of the time series for which the 206 Pearson correlation coefficient was at least 0.2 between replicates and the total number of mapped reads 207 was at least 30. For ER-α, due to lack of replicates, we only clustered the time series with more than 100 208 reads in total across all times. Prior to the clustering we standardized each time series to z-scores to bring 209 all time series onto the same scale. We obtained 20 and 22 clusters for Pol II time series over enhancer 210 and genes, respectively. Similarly we obtained 21 and 21 clusters for ER-α time series over enhancer and 211 genes. We also jointly clustered time series of PolII and ER-α. The results of the clustering can be seen 212 in Figure S2. series (X X X j,n ,Y Y Y k,n ), and for each dataset n ∈ N, where N is a number of time course ChIP-seq datasets.

231
Additionally, the data vector also contains the Euclidean distance d j,k calculated between the genomic 232 coordinates of the canonical TSS of a gene k to the centre of an enhancer j. 233 The joint likelihood of the model can be written as: (1) The model provides a probability of observing a particular D D D j under a given structure I I I j . Due to its 234 regulatory role, an enhancer is unlikely to regulate a high number of genes, thus we can expect that the 235 true P(I I I j ), which in the Bayesian treatment is a prior distribution over the structures, would be sparse.

236
Moreover, we could expect that D j,k and D j,k ′ of any two interacting pairs k, k ′ would be interlinked, as 237 correlations between gene-enhancer pairs are not independent variables. These dependencies would be 238 reflected in a true form of the likelihood P(D D D j |I I I j ). Lastly, we could also expect that the N + 1 attributes 239 i.e correlations c j,k,n and distance d j,k of a pair j, k of the vector D j,k would also be correlated.

241
The modelling of all dependencies however is difficult given the relative sparsity of our training data.

242
We therefore restrict the form of the joint distribution and construct an approximate joint probability  a) The joint distribution factorises 246 We assume that the likelihood P(D D D j |I I I j ) can be factorised and written in the form: where I I I j = I j,1 , . . . , I j,K and D D D j = D j,1 , . . . , D j,K . Hence the distribution of each D j,k is conditionally 247 independent of other allocations and conditional only on the indicator variable I j,k .

5/17
b) An enhancer regulates a single gene 249 We assume further, that an enhancer j can interact with only one gene k. We restrict the event space of P(D j , I j ) to its subspace P(D j , I I I (1) j,k ), where I I I (1) j,k = 0, . . . , 1 kth , . . . , 0 . From (2) the events are given by: The prior distribution P(I I I j ) follows a multivariate Bernoulli distribution, and thus the restriction is equivalent to setting the probabilities of all the structures I I I j with non-singular number of contacts i.e. I I I j,k we assume that the prior is uniform across these sparse vectors, i.e.

P(I I I
(1) so that each I I I (1) j,k is equally likely a priori. Assuming that the attributes are conditionally independent, the likelihood component P(D j,k |I j,k ) becomes: where d j,k is a distance from the centre of an enhancer j to the TSS of a gene k, whereas c j,k,n is a 252 correlation between the time series of the n th time course dataset between an enhancer j and gene k. 253 Combining the assumption of the factorisable likelihood (2) with the conditional independence of attributes (5) yields, Restricting the event space to single enhancer-gene events (3) The assumption of conditional independence of features in (5)  The posterior distribution under the model is: The posterior distribution can be used to find the probability of each structure I I I and computational cost associated with KDE, employing the same approach for negatives was infeasible.

287
Their size, however, also entails less requirement for optimised fitting, and thus to select the bandwidth 288 we resorted to the Scott's rule (Scott, 2015).

290
We trained the classifier on the odd chromosomes and estimated the training error. Similarly, we tested 291 the method on the even chromosomes and obtained the test error. Since the test data is not used to build 292 the classifier (i.e. fit the feature densities), its predictions on the test data can be considered unbiased. 293 We measured the performance in two ways. Firstly, we evaluated and plotted precisions against the True   306 We used our model to infer gene targets with strong evidence of being regulated by at least one enhancer. The probability of gene k having at least one active regulatory link from an enhancer under our model is defined,

Prediction of target genes
(1 − P (I I I (1) where the product above is equal to the probability that no enhancers regulate the gene. scores defined in Eqn. (9), we assessed how many of our predicted distally regulated genes were differentially expressed at early time points. Using the EdgeR processed GRO-seq data we filtered the GRO-seq 313 determined DE genes at 10, 40, 160 min after E2 stimulation with q-value (multiple hypotheses testing 314 adjusted p-values from EdgeR) of less than 0.05, 0.01, 0.001. For each q-value, we combined the DE 315 genes from each of the time points into a single list.

317
We demonstrate our method using ChIP-Seq time course data collected from the MCF7 breast cancer of cells in estrogen free media to estradiol. ChIA-PET data are also available in this system and were 326 used to evaluate our method's performance (Fullwood et al., 2009;Li et al., 2010Li et al., , 2012.

ER-α bound enhancers overlap experimentally determined promoter interaction regions 328
To locate binding events formed after stimulation with estradiol, we determined a set of genomic loci  ChIP-seq time series data 343 We calculated the number of mapped reads for each of our ChIP-seq datasets over promoter-extended-gene 344 bodies and over our consensus ER-α binding sites to create time series data for genes and enhancers (see 345 Materials and Methods). We clustered the ER-α and Pol II data to help visualise the occupancy dynamics 346 at enhancers and genes. As shown in Fig. 1, the clusters show substantial differences in occupancy 347 dynamics across both genes and enhancers. This is expected for Pol II which shows a broad range of 348 response profiles in this system (Honkela et al., 2015). Additionally, some differences in ER-α profiles 349 were also detected, suggesting that occupancy is not solely determined by the nuclear concentration of substantially from the background for all four datasets, with interacting regions more highly correlated 357 on average. This difference is most pronounced for ER-α and Pol II ( Fig. 2a and Fig. 2b) while there 358 is a much smaller difference for the histone marks H2AZ and H3K4me3 ( Fig. 2c and Fig. 2d). We 359 also compare the distribution of genomic separation for interacting and non-interacting promoters and 360 enhancers in Fig. 2e. Although a highly informative feature, there is a substantial overlap in the positive 361 and background distance densities due to a large separation of many ER-α bound enhancers from their 362 target promoters; therefore, distance alone is insufficient for accurate prediction of interactions. We note 363 that our ChIA-PET data does not contain very short ChIA-PET links. Links of a size shorter than 4.5kB 364 are usually considered to be the result of self-ligations and are filtered out Li et al. (2010). In Figure S3 we 365 plotted the corresponding histograms using data from all chromosomes. We observe that the distribution 366 9/17 does not change with the addition of data from even chromosomes.

367
Naive Bayes classifier performance 368 We developed a Naive Bayes classifier which integrates several discriminative features to estimate the 369 probability of interactions between enhancer and putative target genes. Fig. 3 shows predicted interactions 370 with only a small number confirmed by ChIA-PET (green). Interactions are shown using different shading 371 for classification probabilities above 0.72, 0.54, 0.49 thresholds corresponding to 0.2, 0.25, 0.3 FDR levels 372 (posterior probabilities with the highest TPR which are associated with the selected FDRs (1-precision)) 373 estimated using the training data (combination of features: Pol II, ER, distance).
The  infers 26.7 times more interactions than predictions based on proximity alone (see Table 1). In addition to 389 considering precision-recall curves, we also tested how often using maximum a posteriori probabilities 390 (MAP) to link all enhancers (in the training and test data) to their most probable promoters would result in 391 correct assignments according to the ChIA-PET data (right-most column of plots in Fig. 4a and Fig. 4b).

392
The mean performance in the MAP case is reduced and the added value of the ChIP-Seq data relative 393 to the proximity information is also reduced. This is because for many enhancers the ChIP-Seq data 394 signal is relatively weak and therefore focussing on the enhancer-promoter pairs with higher classification 395 probabilities (as in the PR curves approach) produces better quality prediction on average than when we 396 make predictions for all enhancers.

403
The majority (79%) of enhancer-promoter interactions lie within domains. The PR curves in Fig. 4d 404 and Fig. 4e show that the ER-α and distance features provide the greatest contribution to performance.

405
The Pol-II feature is also informative but does not add much to performance when combined with the 406 ER-α data. Interestingly, within domains the "data-alone" model possesses much higher predictive 407 power than in the chromosome-wide model.  beyond domain boundaries, the number of false positives is greatly reduced. Nevertheless, we see that 409 incorporation of the distance feature still improves classification performance within domains.

410
On the contrary (see Fig. 4e and Fig. 4f our default parametrisation where we switched that parameter off. Figure S11 shows that the distributions 428 of features remain similarly unchanged.

429
Validation of ER-regulated target gene predictions 430 Finally, we used our method to provide a highly confident (FDR of 0.25) list of directly ER-regulated 431 target genes in this system. This list (Table S1) includes 1978 genes with at least one predicted enhancer 432 link. In Fig. 5 we compared our set of predicted distally regulated genes against a list of early differentially 433 expressed genes obtained from GRO-seq experiments (Hah et al., 2011). PR curves showed that the larger 434 the value of the score (see Materials and Methods), which is roughly proportional to the number of times 435 a gene is predicted to be a target of distal enhancer, the higher the chance that the gene is differentially 436 expressed. Using a score based only on proximity of ER-α binding events is much less predictive of early 437 differential expression.

439
We have developed a Bayesian method which is capable of integrating genomic distance with a correlation 440 of ChIP-seq time series in order to predict physical interactions between enhancers and promoters. 441 We evaluated the performance of our method against ChIA-PET predicted links and using different