A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features

Technological advances have lead to the creation of large epigenetic datasets, including information about DNA binding proteins and DNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TADs). TADs are involved in the regulation of gene expression activity, but the mechanisms of their formation are not yet fully understood. Here, we focus on machine learning methods to characterize DNA folding patterns in Drosophila based on chromatin marks across three cell lines. We present linear regression models with four types of regularization, gradient boosting, and recurrent neural networks (RNN) as tools to study chromatin folding characteristics associated with TADs given epigenetic chromatin immunoprecipitation data. The bidirectional long short-term memory RNN architecture produced the best prediction scores and identified biologically relevant features. Distribution of protein Chriz (Chromator) and histone modification H3K4me3 were selected as the most informative features for the prediction of TADs characteristics. This approach may be adapted to any similar biological dataset of chromatin features across various cell lines and species. The code for the implemented pipeline, Hi-ChiP-ML, is publicly available: https://github.com/MichalRozenwald/Hi-ChIP-ML


31
Machine learning has proved to be an essential tool for studies in the molecular biology of the eukaryotic  However, this approach has two substantial limitations. First, the prediction of TAD state as a 61 categorical output depends on the TAD calling procedure. It requires setting a threshold for the TAD 62 boundary definition and it is insensitive to sub-threshold boundaries. 63 Alternatively, the TAD status of a region may be derived from a Hi-C map either by calculation of    Hi-C datasets for three cultured Drosophila melanogaster cell lines were taken from Ulianov et al. (2016).

102
Cell lines Schneider-2 (S2) and Kc167 from late embryos and DmBG3-c2 (BG3) from the central nervous 103 system of third-instar larvae were analysed. The Drosophila genome (dm3 assembly) was binned at the 104 20-kb resolution resulting in 5950 sequential genomic regions of equal size. Each bin was described  locus. The procedure is briefly described below.

132
When parameter gamma is fixed, Armatus annotates each genomic bin as a part of a TAD, inter-TAD, 133 or TAD boundary. The higher the gamma value is used in Armatus, the smaller on average the TADs 134 sizes are. We perform the TAD calling with Armatus for a set of parameters and characterize each bin by 135 transitional gamma at which this bin switches from being a part of a TAD to being a part of an inter-TAD 136 or a TAD boundary. We illustrate the TADs annotation and calculation of transitional gamma in Figure 1A.   (MSE), instead of precision, recall or accuracy, as for binary variables. However, the distribution of the 155 target in our problem is significantly unbalanced (see Figure 1D), because the target value of most of the 156 objects is in the interval between 0 and 3. Thus the contribution of the error on objects with a high true 157 target value may be also high in the total score when using MSE. 158 We note that the biological nature of genomic bins with high transitional gamma is different from 159 other bins. Transitional gamma equal to 10 means that the bin never transformed from being a part of where N is the number of data points, y true i is the true value for data point number i, y pred i is the predicted   Figure 2. 184 We exploited the following parameters of the biLSTM RNN in our experiments.

185
The sequence length of the RNN input objects is a set of consecutive DNA bins with fixed length that 186 was varied from 1 to 10 (window size).

188
The weighted Mean Square Error loss function was chosen and models were trained with a stochastic 189 optimizer Adam (Kingma and Ba, 2014).

190
Early Stopping was used to automatically identify the optimal number of training epochs.

191
The dataset was randomly split into three groups: train dataset 70%, test dataset 20%, and 10% data for 192 validation.

193
To explore the importance of each feature from the input space, we trained the RNNs using only one The quality metric adapted for our particular machine learning problem, wMSE, demonstrates the 208 same level of improvement of predictions for different models (see Table 2). Therefore, we conclude that 209 wMSE can be used for downstream assessment of the quality of the predictions of our models.  The context-aware prediction of TAD state is the most reliable 220 The alternative model that we studied was biLSTM neural network, which provides explicit accounting 221 for linearly ordered bins in the DNA molecule. 222 We have investigated the hyperparameters set for biLSTM and assessed the wMSE on various input 223 window sizes and numbers of LSTM Units. As we demonstrate in Figure 3, the optimal sequence length is      (Table 3). On average, the smallest errors were produced on the test set of the BG3 cell Manuscript to be reviewed Computer Science In general, the quality of the prediction has mostly improved, suggesting the universality of the  leading to the clarification of other factors' roles in the formation of smaller TAD boundaries that are 334 beyond the resolution of our models. 335 We also note that transitional gamma is just one of multiple measures of the TAD state for a genomic 336 region. We motivate the use of transitional gamma by the fact that it is a parameter-independent way

343
Here we selected features that had been reported to be associated with the chromatin structure. We 344 note there might be other factors contributing to the TAD formation that were not included in our analysis.

345
The exploration of a broader set of cell types might be a promising direction for this research, as well as 346 the integration of various biological features, such as raw DNA sequence, to the presented models. We 347 also anticipate promising outcomes of applying our approach to study the chromatin folding in various 348 species except for Drosophila. 349 The code is open-source and can be easily adapted to various related tasks. in Drosophila. 356 Recurrent neural networks incorporate the information about epigenetic surroundings. The highest 357 prediction scores were obtained by the models with the biologically interpretable input size of 120 kb that 358 aligns with the average TAD size for the 20 kb binning in Drosophila. Thus, we propose that the explicit 359 accounting for linearly ordered bins is important for chromatin structure prediction.

360
The top-influencing TAD-forming factors of Drosophila are Chriz and histone modification H3K4me2.

361
The chromatin factors that influence the prediction most are stable across the cell lines, which suggests

Manuscript to be reviewed
Computer Science