Predicting stimulation-dependent enhancer-promoter interactions from ChIP-Seq time course data

We have developed a machine learning approach to predict stimulation-dependent enhancer-promoter interactions using evidence from changes in genomic protein occupancy over time. The occupancy of estrogen receptor alpha (ERα), RNA polymerase (Pol II) and histone marks H2AZ and H3K4me3 were measured over time using ChIP-Seq experiments in MCF7 cells stimulated with estrogen. A Bayesian classifier was developed which uses the correlation of temporal binding patterns at enhancers and promoters and genomic proximity as features to predict interactions. This method was trained using experimentally determined interactions from the same system and was shown to achieve much higher precision than predictions based on the genomic proximity of nearest ERα binding. We use the method to identify a genome-wide confident set of ERα target genes and their regulatory enhancers genome-wide. Validation with publicly available GRO-Seq data demonstrates that our predicted targets are much more likely to show early nascent transcription than predictions based on genomic ERα binding proximity alone.


INTRODUCTION
Gene expression is dependent upon the binding of transcription factor (TF) proteins to genomic regions 30 which regulate transcriptional initiation (Nagarajan et al., 2014). In eukaryotic cells, these regulatory  method uses three features: evolutionary conservation, correlation of enhancer scores derived from histone 83 marks from RNA-seq data, and an average of correlations between TF ChIP-Seq and gene expression 84 across 12 cell-types. A distance constraint is also imposed to aid inference.

85
The majority of the above methods require data from multiple cell-types and therefore do not allow 86 discovery of interactions given data from one cell-type. Most existing methods also assume a stringent 87 distance constraint and are therefore unable to discover distal links beyond this constraint. Finally, these 88 methods do not take into account evidence from time course data. 89 We show how ChIP-Seq time course data that reports TF and RNA polymerase occupancy at multiple 90 time points after cellular stimulation can be used to predict enhancer-promoter interactions within 91 chromosomes. We have developed a Bayesian classifier that combines evidence from the correlation 92 of ChIP-Seq time course data at enhancers and across gene bodies with the genomic separation of EGTA, 20 mM HEPES pH 7.6, once with the solution as before but with 500 mM NaCl, once with 155 solution of composition 0.25 M LiCl, 0.5% DOC, 0.5% NP-40, 1 mM EDTA, 0.5 mM EGTA, 20 mM 156 HEPES pH 7.6 and two times with 1 mM EDTA, 0.5 mM EGTA, 20 mM HEPES pH 7.6. A control 157 library was generated by sequencing input DNA (non-ChIP genomic DNA 67%, 61%, 64% of ER-α, Pol-II (rep 1), Pol-II (rep 2), H3K4me3, and H2AZ ChIP-seq reads were 173 mapped uniquely to the genome. The non-uniquely mapped reads were discarded from further analysis.

174
Using the statistical criterion provided by MACS, we established that our sequencing depth allows for no 175 duplicates of reads, thus we discarded any duplicated reads as they are most likely an artefact in ChIP-Seq.  Figure S1. Since our analysis is aimed at intergenic 184 ER-α-bound enhancers, we ignored the consensus peaks which overlapped with either gene bodies or 185 upstream 300bp-long regions by which the genes were extended to account for a promoter region. This is 186 a limitation of the data used and the method could potentially work with different ChIP-Seq data. With 187 other enhancer-associated ChIP-Seq data then we could also potentially apply the method to intronic 188 enhancers.  Manuscript to be reviewed final number of clusters. The R implementation of AP can search through values of p to achieve an approximately pre-specified number of clusters. The method is similar to k-means but can achieve much 208 better optimisation of the k-means objective function than the standard EM algorithm.

209
To reduce the effect of noise, for Pol II we clustered only the pairs of the time series for which the 210 Pearson correlation coefficient was at least 0.2 between replicates and the total number of mapped reads 211 was at least 30. For ER-α, due to lack of replicates, we only clustered the time series with more than 100 212 reads in total across all times. Prior to the clustering we standardized each time series to z-scores to bring 213 all time series onto the same scale. We obtained 20 and 22 clusters for Pol II time series over enhancer 214 and genes, respectively. Similarly we obtained 21 and 21 clusters for ER-α time series over enhancer and 215 genes. We also jointly clustered time series of PolII and ER-α. The results of the clustering can be seen 216 in Figure S2. series (X X X j,n ,Y Y Y k,n ), and for each dataset n ∈ N, where N is a number of time course ChIP-seq datasets.

235
Additionally, the data vector also contains the Euclidean distance d j,k calculated between the genomic 236 coordinates of the canonical TSS of a gene k to the centre of an enhancer j. 237 The joint likelihood of the model can be written as: (1) The model provides a probability of observing a particular D D D j under a given structure I I I j . Due to its 238 regulatory role, an enhancer is unlikely to regulate a high number of genes, thus we can expect that the 239 true P(I I I j ), which in the Bayesian treatment is a prior distribution over the structures, would be sparse.

240
Moreover, we could expect that D j,k and D j,k ′ of any two interacting pairs k, k ′ would be interlinked, as i.e correlations c j,k,n and distance d j,k of a pair j, k of the vector D j,k would also be correlated.

245
The modelling of all dependencies however is difficult given the relative sparsity of our training data. 246 We therefore restrict the form of the joint distribution and construct an approximate joint probability  a) The joint distribution factorises 250 We assume that the likelihood P(D D D j |I I I j ) can be factorised and written in the form: where I I I j = I j,1 , . . . , Manuscript to be reviewed b) An enhancer regulates a single gene 253 We assume further, that an enhancer j can interact with only one gene k. We restrict the event space of P(D j , I j ) to its subspace P(D j , I I I (1) j,k ), where I I I (1) j,k = 0, . . . , 1 kth , . . . , 0 . From (2) the events are given by: The prior distribution P(I I I j ) follows a multivariate Bernoulli distribution, and thus the restriction is equivalent to setting the probabilities of all the structures I I I j with non-singular number of contacts i.e. I I I j,k we assume that the prior is uniform across these sparse vectors, i.e.

P(I I I
(1) so that each I I I (1) j,k is equally likely a priori.

254
c) The distribution of attributes is independent 255 Assuming that the attributes are conditionally independent, the likelihood component P(D j,k |I j,k ) becomes: where d j,k is a distance from the centre of an enhancer j to the TSS of a gene k, whereas c j,k,n is a 256 correlation between the time series of the n th time course dataset between an enhancer j and gene k. 257 Combining the assumption of the factorisable likelihood (2) with the conditional independence of attributes (5) yields, Restricting the event space to single enhancer-gene events (3) results in, The assumption of conditional independence of features in (5) and the fact that each vector I I I The posterior distribution under the model is: The posterior distribution can be used to find the probability of each structure I I I and computational cost associated with KDE, employing the same approach for negatives was infeasible.

291
Their size, however, also entails less requirement for optimised fitting, and thus to select the bandwidth 292 we resorted to the Scott's rule (Scott, 2015).

294
We trained the classifier on the odd chromosomes and estimated the training error. Similarly, we tested 295 the method on the even chromosomes and obtained the test error. Since the test data is not used to build 296 the classifier (i.e. fit the feature densities), its predictions on the test data can be considered unbiased. 297 We measured the performance in two ways. Firstly, we evaluated and plotted precisions against the True   310 We used our model to infer gene targets with strong evidence of being regulated by at least one enhancer. The probability of gene k having at least one active regulatory link from an enhancer under our model is defined,

Prediction of target genes
(1 − P (I I I (1) where the product above is equal to the probability that no enhancers regulate the gene.

321
We demonstrate our method using ChIP-Seq time course data collected from the MCF7 breast cancer

ER-α bound enhancers overlap experimentally determined promoter interaction regions 332
To locate binding events formed after stimulation with estradiol, we determined a set of genomic loci  substantially from the background for all four datasets, with interacting regions more highly correlated 361 on average. This difference is most pronounced for ER-α and Pol II (Fig. 2a and Fig. 2b) while there 362 is a much smaller difference for the histone marks H2AZ and H3K4me3 (Fig. 2c and Fig. 2d). We  Figure S3 we 369 plotted the corresponding histograms using data from all chromosomes. We observe that the distribution Manuscript to be reviewed does not change with the addition of data from even chromosomes.

371
Naive Bayes classifier performance 372 We developed a Naive Bayes classifier which integrates several discriminative features to estimate the 373 probability of interactions between enhancer and putative target genes. Fig. 3
The  Table 1 shows that using the probability cut-offs to infer links across 23 chromosomes 390 our model (combination of features: PolII, ER, distance) consistently outperforms the distance-alone 391 model in terms of the number of uncovered true links. We show that at FDR equal to 0.20 our model 392 infers 26.7 times more interactions than predictions based on proximity alone (see Table 1). In addition to 393 considering precision-recall curves, we also tested how often using maximum a posteriori probabilities 394 (MAP) to link all enhancers (in the training and test data) to their most probable promoters would result in 395 correct assignments according to the ChIA-PET data (right-most column of plots in Fig. 4a and Fig. 4b).

396
The mean performance in the MAP case is reduced and the added value of the ChIP-Seq data relative 397 to the proximity information is also reduced. This is because for many enhancers the ChIP-Seq data 398 signal is relatively weak and therefore focussing on the enhancer-promoter pairs with higher classification 399 probabilities (as in the PR curves approach) produces better quality prediction on average than when we 400 make predictions for all enhancers.

407
The majority (79%) of enhancer-promoter interactions lie within domains. The PR curves in Fig. 4d 408 and Fig. 4e show that the ER-α and distance features provide the greatest contribution to performance.

409
The Pol-II feature is also informative but does not add much to performance when combined with the 410 ER-α data. Interestingly, within domains the "data-alone" model possesses much higher predictive 411 power than in the chromosome-wide model.  our default parametrisation where we switched that parameter off. Figure S11 shows that the distributions 432 of features remain similarly unchanged. 433 We have used eight timepoints for this study, but most epigenomic time course datasets from a 434 single stimulation have fewer available timepoints. We therefore assessed the performance for reduced 435 datasets with six and four timepoints in Fig. S12. Inclusion of data with less timepoints reduced the 436 performance of the data-only method substantially, but combining data with the prior still leads to a 437 significant improvement over the prior-only model even with only four timepoints.

438
Validation of ER-regulated target gene predictions 439 Finally, we used our method to provide a highly confident (FDR of 0.25) list of directly ER-regulated 440 target genes in this system. This list (Table S1)

448
We have developed a Bayesian method which is capable of integrating genomic distance with a correlation 449 of ChIP-seq time series in order to predict physical interactions between enhancers and promoters. 450 We evaluated the performance of our method against ChIA-PET predicted links and using different