Incorporating Gene Expression in Genome-wide Prediction of Chromatin Accessibility via Deep Learning

Regulatory elements (REs) in human genome are major sites of non-coding transcription which lack adequate interpretation. Although computational approaches have been complementing high-throughput biological experiments towards the annotation of the human genome, it remains a big challenge to systematically and accurately characterize REs in the context of a specific cell type. To address this problem, we proposed DeepCAGE, an deep learning framework that incorporates transcriptome profile of human transcription factors (TFs) for accurately predicting the activities of cell type-specific REs. Our approach automatically learns the regulatory code of input DNA sequence incorporated with cell type-specific TFs expression. In a series of systematic comparison with existing methods, we show the superior performance of our model in not only the classification of accessible regions, but also the regression of DNase-seq signals. A typical scenario of usage for our method is to predict the activities of REs in novel cell types, especially where the chromatin accessibility data is not available. To sum up, our study provides a fascinating insight into disclosing complex regulatory mechanism by integrating transcriptome profile of human TFs.

chromatin accessibility data is not available. Second, expression-based methods, such like BIRD [19], fail to utilize the comprehensive DNA sequence information, hence cannot make prediction of new loci that are not contained in the training dataset. DeepCAGE takes both the DNA sequence information and TFs gene expression data into consideration and adopts the architecture of densely connected convolutional neural network which has been experimentally proved to alleviate vanishing-gradient problem [20]. In a series of systematic evaluation, DeepCAGE not only achieves state-of-the-arts performance in the binary chromatin accessible status classification experiments, but also recovers continuous DNase-seq signal in regression settings. To make DeepCAGE more understandable, we proposed a strategy for visualizing the weights in the first convolutional layer. Interestingly, many known motifs were successfully recovered by DeepCAGE. We finally summarize DeepCAGE, as an effective and precise predictive model for dissecting regulatory code, which could shed light on the understanding the comprehensive regulatory mechanism.

Overview of DeepCAGE model
DeepCAGE consists of a hybrid architecture which takes DNA sequence and gene expression data as inputs and predicts chromatin accessibility in a binary value (classification) or continuous signal (regression) ( Figure 1). The input 1000 base pair (bp) DNA sequence is first extracted from the hg19 reference genome around the midpoint of a 200 bp bin and then converted to a one-hot matrix. The four columns of the matrix correspond to the four nucleotides A, C, G and T and each row represents a nucleotide in DNA sequence. For any position in DNA sequence with 'N', the corresponding row is filled with all zeros. After one-hot encoding, the DNA sequence then was fed to a densely connected deep convolutional neural network which contains three dense blocks. In each dense block, there are five layers and each layer connects to every The schematic of DeepCAGE model. On the one hand, the input DNA sequence is first converted to a one-hot matrix and goes through a densely connected deep convolutional neural network (DenseNet) which contains three dense blocks. On the other hand, the input DNA sequence is scanned with 402 non-redundant motifs and the motif scores are extracted. The expressions of the corresponding 402 human transcription factors (TFs) are extracted and quantile normalized. We combine normalized expression and motif scores in an element-wise product manner and then concatenated by the output of the DenseNet. The prediction is made based on the hybrid features. other layer in a feed-forward fashion. If traditional convolutional networks with have L connections then densely connected deep convolutional neural network will have L(L+1)/2 connections. Each layer in a dense block is constructed by two convolution operations with small filters size 1 × 1 and 3 × 1 of which 1 × 1 filters aim at reducing feature maps. Note that the first convolution operation is applied for reducing the concatenated channels to a fixed number. The dense block architecture has been proven to alleviate the vanishing-gradient problem and strengthen the feature propagation [20]. Between input and the first dense block, there is a transition module which contains a convolutional layer and max-pooling layer. The convolutional layer has 160 filters with size 4 × 15 for extracting low-level features and detecting DNA binding motifs while the max-pooling layer is applied for finding the most significant activation signal in a given sliding window of each filter. Between blocks, the similar transition module is utilized for extracting high-level features and dimension reduction. Note that the rectified linear units (ReLU) are used after each convolution operation for keeping positive activations and setting negative activation values to zeros. Batch normalization (https://arxiv.org/abs/1502.03167) and dropout strategy [21] are also applied after each ReLU function for reducing internal covariate shift and avoiding overfitting, respectively. Besides the neural network, the input DNA sequence is also scanned for searching potential transcription factors (TFs) binding sites with non-redundant motifs of 402 human core TFs from HOCOMOCO database [22] with HOMER motif finder tool from [23]. We then take the maximal score of all motif binding sites for each TF given a DNA sequence. For the gene expression input, only the gene expression levels (TPM values) of the above 402 human core TFs from a specific cell type are extracted. Next, the gene expression levels are log transformed after adding a pseudocount of 1 and then quantile normalized. Then we combine TFs gene expression data with motif score data in an element-wise product manner. Finally, we concatenate this feature to the output of the densely connected deep convolutional neural network, thus forming a hybrid feature. This hybrid feature will go through a fully connected layer with 512 hidden nodes and the final prediction is made by a sigmoid layer (see detailed hyperparameters in Supplementary Table 1). For DeepCAGE regression model, there are two major differences. First, the output layer directly applies a linear transformation as Y = instead of a sigmoid layer. Second, mean square error (MSE) is used as the loss function.

DNase-seq data processing
DNase-seq data across 55 different human cell types were downloaded from the ENCODE project [24] in narrowPeak and bam format for measuring chromatin accessibility (accession IDs are listed in Supplementary Data 1-2). The human hg19 reference genome was first divided into 200 base pair (bp) nonoverlapping bins. Considering the fact that one cell type may contain multiple replicates, we denote each bin (locus) as positive if it overlaps with the peak regions from at least half of the replicates, and denote as negative otherwise (Supplementary Figure 1). Binary label for locus in cell type represents the accessible state which was used for training and testing in classification experiments. For regression experiments, multiple replicates bam files from the same cell type are first pooled together. The number of pooled reads falling into each bin was counted for each cell type. To further eliminate the effect of different sequencing depths, bin read counts for each cell type were first divided by the total count , and then multiplied by a constant ( = { }), which is the minimal read count across all cell types. After such a procedure, the raw pooled read count for bin in cell type was converted into a normalized read count ̃= / . The normalized read counts were further log transformed after adding a pseudocount of 1. The transformed data represent the level of DNase I hypersensitivity and were used for training and testing in the regression models.

RNA-seq data processing
RNA-seq data across the same 55 different human cell types were downloaded from the ENCODE project [24] in tsv format for quantifying gene expression level (accession IDs are listed in Supplementary Data 3). The expression levels of the 402 core human transcription factors (TF) were considered and extracted. After further log transformation and quantile normalization based on TPM values, the normalized expression within each cell type was averaged across multiple replicates and the mean expression profile of each cell type was finally used for training and testing the predictive models.

Training-test data partition
We applied a five-fold cross-validation for randomly splitting training and test data. The 55 cell types were partitioned into a training dataset with 44 cell types and a test dataset with 11 cell types in each fold (see detailed five-fold partition in Supplementary Table 2-3). Due to the fact that not all genomic loci are regulatory elements, we defined 'known loci' which are denoted as positive in at least two training cell types, and we further defined 'novel loci' which are denoted as positive in at least two test cell types and do not overlap with 'known loci' in the meanwhile. Take the 5 th fold for example, there are 1,435,610 known loci and 54,171 novel loci respectively. The number of training examples can be as large as about 63 million (1,435,610 × 44). We provided both single-GPU and multi-GPU model for implementing DeepCAGE. In multi-GPU settings, the backpropagation process (GPU) and generation of the next batch of training data (CPU) is paralleled to ensure computing efficiency.

Model evaluation
Predictive models are evaluated based on test dataset. We defined two types of metrics in two perspectives, namely cell-type-wise evaluation metrics and locus-wise evaluation metrics respectively (see Supplementary Figure 1). Let × and ̂× be the true label matrix and predicted matrix where and denote the number of loci and test cell types, respectively. In the case of classification experiments, and ̂ denote the true and predicted chromatin accessible state of locus in cell type , respectively. Due to the extremely unbalanced datasets, we applied cell-type-wise auPR (area under Precision-Recall curve) and locus-wise auPR to evaluate the performance of classification models, where cell-type-wise auPR is calculated based on * = ( 1 , 2 , … , ) and ̂ * = (̂1 ,̂2 , … ,̂), locus-wise auPR is calculated based on * = ( 1 , 2 , … , ) and ̂ * = (̂1,̂2, … ,̂). In the case of regression experiments, and ̂ denote the true and predicted processed Dnase-seq signal. We applied cell-type-wise Pearson's r and locus-wise Pearson's r for evaluating the performance of regression models. Cell-type-wise Pearson's r is calculated based on * and ̂ * while locus-wise Pearson's r is calculated based on * and ̂ * . We further introduced prediction squared error (PSR) which is calculated by PSR = ∑ ∑ ( −̂) 2 / ∑ ∑ ( − ̅ * ) 2 . Note that ̅ * is the mean of * ( ̅ * = ∑ / ) and PSR is a statistic which considers both cell-type-wise prediction and locus-wise prediction. Besides, we introduced two statistics, namely cell range and cell variability, to describe the activity of a locus based on the true DNase-seq signal across test cell types. Cell range of locus is calculated by max( * ) − min ( * ) while cell variability of locus is defined by the standard deviation of * . The various metrics in both classification and regression experiments provide a comprehensive and systematical evaluation of predictive models.

Models comparison
Basset [18], DeepSEA [16] and DanQ [17] are three sequence-based neural network models which take DNA sequences as input. They cannot make cross-cell-type prediction without incorporating cell type specific information. BIRD [19] is an expression-based model which takes gene expression data as predictor variables. It can only estimate chromatin accessibility of pre-defined regions which may ignore potential novel regulatory elements (REs) that were not included in the pre-defined regions. ChromDragoNN [25] is a recently deep learning method which takes both DNA sequence and gene expression data as input. However, it ignores the motif information of each corresponding transcription factor (TF).

Gradient importance score
We proposed a strategy for prioritizing the importance of transcription factors (TFs) given a pair of cell type and a DNA locus. Typically, we first extended the locus to -100kb to 100kb from the midpoint. Then we calculated the average absolute gradient of predicted openness within above region with respect to TFs' expression.
Here, ̂ denotes the predicted openness of locus in cell type .
denotes the expression of TF in cell type . ℒ is the regulatory elements set that contains all loci within the above region. Gradient importance scores (GIS) give an intuition of which transcription factors (TFs) can play an important role in a specific cell type.

Motif analysis
We converted the weights of the filters from the first convolutional layer into position weight matrices (PWMs) by counting subsequence occurrences in a set of input sequences that activate a filter to a threshold value. All subsequences with activation value that greater than the threshold of each filter were pooled together and aligned. Then the PWMs were composed by the frequencies of the 4 nucleotides (A, C, G and T) at each position. We regard a subsequence at position as activated if ∑ ∑ , . × denotes the size of the filters which is 4 × 15 in the first convolutional layer. denotes the maximal activation value of filter which is represented by = max ).
is the control coefficient which is set to 0.7 in the experiments. We identity motifs using the TomTom v4.12.0 tool [26] with the E-value threshold 0.01 with comparison to the known motifs in the JASPAR 2018 database [27]. Besides, we calculated the information content (IC) of recovered motifs based on the information entropy. IC = ∑ ( log 2 − log 2 ) , where is the element in PWM matrix and , denote the nucleotide type and position, respectively (i = 0,1,2,3). denotes the background frequency of nucleotide (default: = 0.25).

DeepCAGE accurately predicts binary chromatin accessibiity status
We designed a series of experiments to systematically evaluate the performance of DeepCAGE in capturing genome accessibility code from the viewpoint of binary classification. DNase-seq and RNA-seq data across 55 cell types were downloaded from the ENCODE project [24]. The 55 cell types were partitioned into a training dataset with 44 cell types and a test dataset with 11 cell types using five-fold cross-validation. We then defined 'known loci' and 'novel loci' as potential regulatory elements which are determined by training data and test data respectively (see Methods). Predictive models were trained under 'known loci' of training cell types and tested under both 'known loci' and 'novel loci' of test cell types. We compared DeepCAGE to four methods, including Basset [18], DeepSEA [16], DanQ [17] and ChromDragoNN [25]. First, we evaluated the performance of the predictive models under 'known loci' of test cell types. Due to the extremely unbalanced data with a proportion of positive loci from 2.6% to 29% across 55 cell types, these models were assessed in term of cell-type-wise auPR (area under Precision-Recall curve) (see Methods). DeepCAGE achieves the highest performance among all the baseline models with the mean cell-type-wise auPR of 0.418, compared to 0.166 of Basset, 0.195 of DeepSEA, 0.188 of DanQ and 0.319 of ChromDragoNN (Figure 2A). DeepCAGE outperforms baseline models, especially sequence-based models by a large margin as our model can discern the difference of openness landscape among test cell types while sequence-based models fail to capture the cell-type specific information. Second, we evaluated the predictive models under 'novel loci' of test cell types. It is a more difficult task as 'novel loci' are not included in the training process which can reflect the ability of cross-locus prediction. DeepCAGE also achieves the highest performance among all the baseline models with the mean cell-type-wise auPR of 0.181, compared to 0.107 of Basset, 0.104 of DeepSEA, 0.110 of DanQ and 0.151 of ChromDragoNN (Figure 2A). We then assessed the performance of DeepCAGE in term of locus-wise auPR (see Methods). Note that locus-wise auPR does not exist in sequence-based baseline methods as they cannot make any different prediction of a locus in test cell types. We divided the 'known loci' into three groups based on the percentage of test cell types in which a locus is denoted as positive. For active test cell types less than 10%, DeepCAGE achieves a mean locus-wise auPR of 0.578. The mean locus-wise auPR continues to increase as the regarding loci are active in more test cell types ( Figure 2B). To sum up, DeepCAGE substantially achieves both high cell-type-wise and locus-wise prediction accuracy after incorporating with gene expression data of human core transcription factors (TFs).

DeepCAGE recovers continuous degree of chromatin accessibility
In the above classification experiments, we only consider the binary status of a locus in a specific cell type. However, binary label cannot reflect the difference of accessibility among sequences that share the same label. To address this problem, we further proposed DeepCAGE regression model (see Methods) which can predict the continuous degree of chromatin accessibility of a locus in a specific cell type. We define the degree of accessibility of a DNA sequence as the normalized average raw reads count that fall into the corresponding region (see Methods). We evaluate whether DeepCAGE can help recover continuous degree of chromatin accessibility using the same datasets. First, we compared DeepCAGE to two baseline methods, BIRD [19] and ChromDragoNN [25]. Regression models were assessed in term of cell-type-wise Pearson's r and prediction squared error (PSE) (see Methods). DeepCAGE achieves a mean cell-type-wise Pearson's r of 0.785, compared to BIRD of 0.637 and ChromDragoNN of 0.735 ( Figure 3A). DeepCAGE achieves cell-type-wise Pearson's r larger than 0.85 in 18.2% of the test cell types and it even achieves cell-typewise Pearson's r larger than 0.9 in two cell types (see examples in Figure 3B). DeepCAGE also achieves the minimal prediction square error (0.42) comparing to baseline methods (0.77 and 0.57) ( Figure 3C). DeepCAGE again outperforms baseline models by a quite large margin.
To further explore the performance of DeepCAGE under loci with difference properties, we introduced two statistics, cell range and cell variability (see Methods) for describing the activity of a considering locus based on the true DNase-seq signal cross test cell types. 'Known loci' and 'novel loci' were first divided into three groups based on the two statistics. 'Low' denotes the loci of which statistics were lower than the 1 st quartile and 'high' represents the loci of which statistics were higher than the 3 rd quartile. 'Median' means that statistics of loci were between the 1 st quartile and the 3 rd quartile. Loci of 'median' cell range and cell variability are relatively easier to be predicted by DeepCAGE (Figure 3D), which is consistent to BIRD model [19]. For example, DeepCAGE achieves a median locus-wise Pearson's of 0.512 under 'known loci' with median cell range across test cell types, compared to 0.435 of 'known loci' with low cell range and 0.399 of 'known loci' with high cell range. Comparing 'known loci' to 'novel loci', DeepCAGE achieves a slight lower performance under the loci within the same statistics range. Take cell variability for example, DeepCAGE achieves a median locus-wise Pearson's r of 0.384, 0.514 and 0.448 under 'known loci' with three cell variability ranges (low, median, high) respectively. The performance decreases slightly when it comes to 'novel loci' with three cell variability ranges (0.320, 0.431 and 0.302). Based on the above observation. we further divided 'known loci' into five groups based on the number of active test cell types when considering a locus. If a locus within a test cell type is denoted as active if the true normalized DNaseseq signal is greater than 1, and denoted as inactive otherwise, we found that the performance of DeepCAGE varies a lot under the loci with different number of active test cell types ( Figure 3E). DeepCAGE tents to have a worse performance under loci of which the number of active/inactive test cell types is less than 2. The reason might be that the training dataset contains most of the 'median' type locus, thus DeepCAGE has a more powerful generalization ability when making a prediction under 'median' type locus. We also provided a vivid example of which we visualized both the true (green) and predicted (yellow) DNase-seq signal of a same genomic region across three test cell types (GM12878, HepG2 and H1-hESC) in the UCSC genome browser [28] ( Figure 3F). Note that we also provided the 'mean signal' (red) as a reference which was calculated by taking the average DNase-seq signals across all training cell types. Obviously, DeepCAGE can well discriminate the difference of DNase-seq signals among test cell types while 'mean signal' fails.

Gradient importance score helps prioritize cell-type related TFs
We proposed a simple strategy for prioritizing cell-type related transcription factor (TF) according to the absolute gradient with regard to the expression of each TF ( Figure 4A, see Methods). Take the K562 cell line for example, we calculated the average gradient importance score (GIS) of all TFs from all test loci within up-streaming 100k bp to down-streaming 100k bp of p53 gene, which was proved to have a key role in myeloid blast transformation [29]. 402 human core TFs were then prioritized by the average GIS score in K562 cell line ( Figure 4B). Interestingly, many top ranked TFs in K562 cell type are related to functions in leukemia cell validated by previous literatures ( Figure 4B). E.g. EGR-1 (rank 1st ) is involved in regulating PMA-induced megakaryocytic differentiation of K562 cell line [30], the inhibition of E2F7 (rank 3rd ) may lead to a reduction of miRNAs involved in leukemic cell lines [31], JunB (rank 5th ) gene expression is inactivated by methylation in chronic myeloid leukemia [32]. The GO terms enriched by the top 5% prioritized TFs coding genes contain biological processes of leukocyte differentiation and hematopoietic development ( Figure 4C). The gradient importance score gives us an intuitive explanation of which TF can play an important role in predicting the chromatin accessibility given a specific cell type and a region.

Model ablation analysis of DeepCAGE
In order to evaluate contributions of DNA sequence, TFs' gene expression and TF motif scores to the performance of DeepCAGE, we performed a model ablation analysis by removing the TFs' gene expression or TF motif scores. Take the DeepCAGE regression model for example, the mean cell-type-wise Pearson's correlation decreases by 11.2% and 1.4% when removing gene expression and motif scores of 402 core human TFs, respectively (Supplementary Figure 2). Gene expression data can significantly help improves the performance of DeepCAGE in cross cell type prediction while TFs motif scores achieve slight incremental performance. The potential reason is that a large proportion of DNA sequence motifs have already been learned in the convolution layers of neural network. The motif scores can only achieve complementary contribution when contained in the prediction task.

DeepCAGE automatically learns the binding motifs of TFs
We explored the features that were automatically learned by DeepCAGE by investigating the weights of 160 filters from the first convolutional layer. The weights were converted into 160 position weight matrices (PWMs) through a proposed strategy (see Methods). We then compared to PWMs learned by DeepCAGE to the known motifs from JASPAR 2018 database [27]. 30% of the learned motifs can be matched to known motifs in JASPAR database with an E-value threshold of 0.01. Among the 48 matched filters, 52% have at least one matched core human TF used in DeepCAGE model. We first calculated the information content based on information entropy (see Methods) and set each of the filter weights to zeros and denoted the decreased cell-type-wise Pearson's correlation as the influence score for each motif. We showed several learned unmatched motifs that have high influence score ( Figure 5A) and illustrates a few examples of learned motifs that could be matched to known motifs in JASPAR database ( Figure 5B). The results demonstrated that DeepCAGE can not only help us find potential binding motifs, but also has the potential to guide us to find novel motifs which are not discovered by experiments yet.

DISCUSSION
Accurately predicting regulatory elements (REs) has become a fundamental problem in computation biology. In this paper, we demonstrated that the combination in software (CNNs), hardware (GPUs) and abundant genomic data can enable drastic performance boost on such problems. We introduced DeepCAGE, a deep learning framework that integrates a densely connected convolutional neural network to automatically extract DNA sequence signatures, capture TF binding motifs and implicate driving activity of transcription factors (TFs). DeepCAGE showed superior performance in both classification and regression settings. More importantly, it overcomes the limitations of previous sequence-based methods and expression-based methods. DeepCAGE could not only make cross cell type prediction, but also enables genome-wide prediction of chromatin accessibility, including novel loci that are not contained in the training set. A typical scenario to use DeepCAGE is to predict the activities of genome-wide regulatory elements in cell types where the chromatin accessibility data is not available.
To make DeepCAGE more understandable, we proposed two perspectives for model interpretation. First, the gradient importance score can give us an intuitional measurement of each transcription factor (TF) that Figure 5. The convolutional layer in DeepCAGE recovers both known and novel motifs. (A) Green dots and yellow dots represent known motifs and novel motifs recovered by DeepCAGE, respectively. The x-axis describes the information content (see Methods) and the y-axis denotes the influence score, which is calculated by setting the filter weights to zeros and taking average changes of openness over test cell types. (B) we display matched motifs with an E-value threshold 0.01 in the format of sequence logos (above: known motif from the JASPAR database, below: motif learned by DeepCAGE).
contributes to the prediction take of each test cell type with regard to a specific locus. Second, the visualization of kernel weights in the first convolutional layer proves to contain abundant known and novel TF binding motifs. Model interpretation can help DeepCAGE well dissect regulatory landscape under various cell conditions. Certainly, our model can be further improved from many aspects. First, DeepCAGE only considers the gene expression of 402 transcription factors (TFs) of which non-redundant binding motif could be found in HOCOMOCO database. Surely we can collect motif information of more TFs as prior knowledge by integrating multiple motif databases such as JASPAR [27], TRANSFAC [44] and UniPROBE [45] together. Second, we ignored the expression of genes which direct the synthesis of non-TF proteins. However, some proteins such as chromatin regulators (CRs), a class of enzymes with specialized function domains, can shape and maintain the epigenetic state in a cell context-dependent fashion [46], thus could also provide information for inferring chromatin accessible state. Third, DeepCAGE could not further determine the functions of a predicted region under a specific cell type currently. With the annotation of cis-regulatory elements in genome such as promoters, enhancers and silencers. DeepCAGE may further uncover the relationship between different kind of genomic regulatory elements and the genome-wide transcriptome profile.
To sum up, with DeepCAGE, a research can infer the chromatin accessibility code of the cell types with interests given corresponding gene expression data (RNA-seq, DNA microarray, etc.). We expect to see wide downstream applications of DeepCAGE with either public or in-house gene expression samples. We also hope DeepCAGE could help unveil the complex regulatory mechanism underlying various genetic signals.

DATA AVAILABILITY
DeepCAGE is an open-source software which can be downloaded from the GitHub repository (https://github.com/kimmo1019/DeepCAGE). Detailed instructions of DeepCAGE is provided for users.

SUPPLEMENTARY DATA
Supplementary Data are available online.