Motif signatures of transcribed enhancers

In mammalian cells, transcribed enhancers (TrEn) play important roles in the initiation of gene expression and maintenance of gene expression levels in spatiotemporal manner. One of the most challenging questions in biology today is how the genomic characteristics of enhancers relate to enhancer activities. This is particularly critical, as several recent studies have linked enhancer sequence motifs to specific functional roles. To date, only a limited number of enhancer sequence characteristics have been investigated, leaving space for exploring the enhancers genomic code in a more systematic way. To address this problem, we developed a novel computational method, TELS, aimed at identifying predictive cell type/tissue specific motif signatures. We used TELS to compile a comprehensive catalog of motif signatures for all known TrEn identified by the FANTOM5 consortium across 112 human primary cells and tissues. Our results confirm that distinct cell type/tissue specific motif signatures characterize TrEn. These signatures allow discriminating successfully a) TrEn from random controls, proxy of non-enhancer activity, and b) cell type/tissue specific TrEn from enhancers expressed and transcribed in different cell types/tissues. TELS codes and datasets are publicly available at http://www.cbrc.kaust.edu.sa/TELS.


INTRODUCTION
In mammalian cells, spatial and temporal activation of gene transcription and maintenance of expression levels is coordinated (mainly) by interactions between DNA regulatory elements, the most prominent being promoters and enhancers 1  show that transcription in enhancers mediated by Polymerase II (POL2) occurs on a genome-wide scale 5 . Enhancers' transcription produces eRNAs, a class of non-coding RNAs whose functions are unclear 6,7 . It is interesting to note that it may be difficult to clearly separate, based on transcriptional activation similarity, enhancers and promoters since both act as promoters but generate different classes of transcripts [8][9][10] .
Several enhancer identification methods, covering both experimental and computational approaches, has been subject of reviews 11,12 . Using the available enhancer-related information 13,14 , a number of studies linked variations in enhancer sequences to disease phenotypes, and development/progression of cancer [15][16][17][18][19] . Thus, deciphering the genomic characteristics of enhancers may help better understanding enhancers' functional roles.
Up to now, there are several approaches to analyse enhancers' DNA characteristics and associate sequence properties to enhancers' activities 20,21 . However, only a limited number of cases in terms of studied enhancer sequences, sequence motifs (e.g., kmers of length 6-8 bp), organisms (e.g., mouse or Drosophila), and tissues, have previously been examined or validated experimentally 22

Identification of DNA signatures of TrEn in all FANTOM5 facets
We used TELS to compile a comprehensive catalog of motif signatures that discriminate effectively TrEn in 112 cell types/tissues included in the 'all facets' dataset from 'all facets random controls' (see Materials and Methods for data description). The average classification performance across 112 cell types/tissues in terms of PPV and GM is 85.94% and 86.06%, respectively (Supplementary Figure 1).
Considering a performance threshold of 80% PPV, we observe that the identified motif signatures classify TrEn with high accuracy in 95% of cases (107 out of 112 cell types/tissues). This suggests that the identified combinations of sequence motifs capture a great portion of the sequence specificities required in TrEn. Figure 1 shows that the performance maximization across different cell types/tissues is achieved using different combinations of motifs. The number of selected motifs ranges from 204 to four, which correspond to the maximum and minimum numbers of motifs that discriminate efficiently TrEn from random controls, a proxy of non-enhancer activity. It is apparent that the model complexity, in terms of number of predictor variables, does not affect the classification efficiency, which appears almost always quite high.
To investigate further the identified sets of motif signatures, we select nine tissues that belong to three different developmental stages namely ectoderm (brain, spinal-cord, and eye), mesoderm (kidney, heart and spleen) and endoderm (lung, liver and pancreas) according to the tissue appears similar to lung and pancreas that belong to different developmental stage, both in terms of input sequences and selected motifs.
These observations suggest that TrEn display cell type/tissue specific motif signatures that are identified by TELS. Overall, the results presented in this subsection support the hypothesis that specific genomic characteristics enable TrEn to operate in a highly cell type/tissue specific manner and for this reason the identified motifs vary across different cell types/tissues.

Identification of DNA signatures of TrEn expressed in at least one cell type/tissue from FANTOM5
In this subsection we identify motif signatures that allow us to discriminate the 'robust set' of TrEn from the 'robust set random control' with maximized classification performance (see Materials and Methods for data description). Our results show that TELS achieves this goal with an average PPV and GM of 79.70% and 80.47%, respectively. Figure 3 shows the characteristic ROC curve, using the most informative combination of 31 motifs. Overall, using this motif signature we report an average Area Under Curve (AUC) of 0.854 with standard deviation of 0.009.
Next, we compare the results obtained using the 'robust set' of TrEn with the results achieved in the previous subsection by analysing the 'all facets' dataset. In particular, by aggregating the motif signatures per cell type/tissue from the 'all facets' dataset we observe that specific motifs are selected with high frequency across 112 cell types/tissues. Figure 4 shows the set of 31 informative motifs obtained by the 'robust set' (x-axis) and the selection frequency of every individual motif (y-axis) across 112 cell types/tissues from the 'all facets' dataset. We observe that six out of the 31 motifs are selected with more than 80% of times across different cell types/tissues. Notably, these motifs, namely CG, CGA, TCG, CGT, ACG and TA, almost always help maximizing discrimination performance. We also observe that two di-nucleotides appear very discriminative, CG and TA, and this finding has been also explored in experiments in Drosophila 20 and reported by other independent studies 5 . From 31 reported motifs, 10 are tri-nucleotides rich in CG, whereas 19 are tetra-nucleotides again rich in CG.

Identification of DNA signatures of TrEn transcribed exclusively in only one cell type/tissue from FANTOM5
The hypothesis we investigate in this subsection is whether or not FANTOM5 enhancers expressed  Figure 5 is in concordance with Figure 1 showing that the maximization of the classification performance is achieved using motif signatures of different sizes. However, the classification performance as indicated by PPV, is much lower compared to the case depicted in

Comparison to other sets of motif signatures
We addition, the enhancer's sequence length has a great impact on the sequence composition and on the selection of motif signatures. We note that in our study, the TrEn length does not rely on ad-hoc windows of fixed length surrounding ChIP-seq peaks or STARR-seq peaks but derives from the TrEn length estimated in ref. (5). There are also differences in the selection of machine-learning models and tuning of model parameters (e.g., C parameter for SVM or number of SVMs in the ensemble).
Here, to make the comparison as fair as possible we use two independent classifiers, the K- From the technical point of view, TELS achieves comparable classification performance using three independent classification methods namely LR, BDT, and KNN. This indicates that the findings are not biased to one particular classification model (i.e., we used LR during the motif selection).
Moreover, our results indicate that the classification algorithm used for assessing the motifs' importance resulted in no bias in the motif selection process

Validation using ENCODE enhancers
As an additional validation process we test the discriminative capabilities of the identified motif signatures on completely independent datasets. For this purpose, we utilize chromatin-defined enhancers reported by the ENCODE consortium 36

Data availability
The primary datasets included in our study derive from the FANTOM5 Atlas of TrEn 5 Figure 3) as well additional information about Materials and Methods.

Definition of positive and negative datasets
To identify motif signatures of TrEn and develop discrimination models, we use the following three datasets that are considered 'positive' data: Generating 'negative' data for the previously described 'positive' datasets without experimental validation (i.e., MPRA or STARR-seq) is a difficult task since it is unclear based only on computational consideration to infer whether or not a particular DNA sequence has enhancer activity in a cell type/tissue. To mitigate this problem we considered two alternative approaches: i) based on artificially generated sequences; and ii) using 'real' TrEn expressed in different cell types/tissues than the one of interest.
The rational behind (i) derives from studies showing that mutations in enhancer sequences disrupts the enhancer's activity 12,16,18,19 . Thus, for every TrEn in the 'all facets' dataset as well as for the 'robust set', we mutate randomly 10% of the genomic positions in every TrEn sequence. We decided to mutate 10% of the positions and not less to have a better proxy of the non-enhancers. In this way, from every 'real' TrEn we generate a random negative sample that has exactly the same sequence length but with shuffled DNA composition 22 . We generate in total 197,373 negative controls across 112 cell types/tissues named as 'all facets random controls' and 38,554 negative controls named as 'robust set random control' respectively. To note, that throughout the previous data generation process, we make sure that none of the randomly generated sequences belong to the superset of TrEn identified by FANTOM5. In (ii) we follow the 'one vs. all' paradigm and for every cell type/tissue specific 'exclusively transcribed' enhancer set, we generate a negative set that contains exclusively transcribed enhancers from all other cell types/tissues but not from the one of interest.
This process resulted to 96 negative sets and such dataset is denoted as 'negative exclusively transcribed'.
For all 'positive' and 'negative' datasets, we use the enhancer genomic coordinates provided in the bed format to generate the actual DNA sequences in the fasta format using SAMtools 23 and the UCSC reference genome (assembly version hg19).

DNA sequence encoding
To encode the input datasets for further use by TELS, we transform all 'positive' and 'negative' data samples into numerical vectors. In TELS, we focus on small sequence motifs and not on preselected

TELS implementation
TELS works in two phases. In phase (1), TELS identifies candidate combinations of sequence motifs that characterize the class of interest. In phase (2) for every candidate combination of motifs, TELS assesses its significance by measuring the classification performance for discriminating 'positive' from 'negative' data. The objective of TELS is to select the combination of motifs that maximize separation between 'positive' and 'negative' data. Typically, determining the relative importance of a set of predictor variables via computational techniques may be used to associate differences between the considered data classes. This information can be further utilized to identify sequence characteristics that are predictive of TrEn cell type/tissue specific activities.
In phase (1), to identify candidate combinations of motifs TELS uses filtering feature selection (FS) techniques. The FS problem in bioinformatics is very well studied [25][26][27][28][29][30][31][32][33] and it is well documented that FS is a strongly 'data-dependent' process. TELS first ranks the 346 individual variables using the Gini-index based FS. We decided to use Gini-index after comparison with two other state-of-the-art algorithms for FS, namely minimum redundancy maximum relevance criterion (mRMR) 34  To mitigate this problem in phase (2), we follow a greedy approach, and assess the significance of all possible combinations of ranked features (starting with the top-1, top-2, top-3 and up to top-K, where K is 346, the total number of variables we use) by measuring the classification performance of every candidate combination. Our objective is to select the combination of motifs that based on MCC minimizes the classification error (i.e., maximizes classification performance). For this task TELS utilizes the LR classifier. LR is a simple linear classification method, which runs fast and avoids extensive optimization of model parameters that frequently leads to poor performance on unseen data 24 . The implementation is made in Matlab R2014b using built-in functions for LR ('glmfit' function with the default setting).
In one classification run with LR and one candidate motif set, we randomly split the 'positive' and 'negative' data into testing and training sets. To achieve better generalization on unseen data, we use 20% of the total size of 'positive' and 20% of the total size of 'negative' samples for training and the remaining 80% from both is kept for testing. We finally report the average classification performance of 300 runs when for each run the above mentioned random split of data is performed.
Consequently, we estimate the average classification performance of 300 runs for all candidate combinations of top-ranked motifs (i.e., 346) and this guarantees an equitable selection of the combination that maximizes classification performance. We consider Geometric mean of Sensitivity

CONCLUSION
In this study we proposed TELS, a novel machine-learning algorithm for identifying predictive short motif signatures of TrEn. We tested TELS using CAGE-defined enhancers from FANTOM5. This allowed us to compile a comprehensive catalog of cell type/tissue specific combinations of short motif signatures that maximize discrimination using different negative datasets. Specifically, our analysis reported combinations of motifs that allowed us to discriminate effectively TrEn from random controls, a proxy of non-enhancer activity. We also reported combinations of motifs that maximize classification performance of exclusively transcribed enhancers in one cell type/tissue from enhancers transcribed exclusively in all other cell types/tissues. In addition, by analysing the so-called 'robust set' of TrEn we were capable of finding 31 motif signatures predictive of TrEn broad activity. We also showed that this set is quite universal and can discriminate with high performance chromatin-defined enhancers from six different ENCODE cell-lines. However, we note that findings obtained by any computational method require further targeted wet-lab experiments even when supported by statistical evidence.
Nonetheless, within the present bioinformatics method, many improvements are possible: (a) Replacing LR with a non-linear classifier such as DBT might improve further the already high discrimination performance; (b) Performing the same analysis on TrEn data obtained by single-cell analysis, if available by FANTOM, will eliminate potential biases caused by cell population heterogeneity and may lead to more fine-grained results about the enhancer genomic landscape; (c) Applying the same analysis to CAGE-defined promoters from FANTOM5 will answer equally important questions about promoters' sequence characteristics 'encrypted' within their genomic sequence; (d) Stratifying TrEn data by their expression levels similar to ref. (9), and analysing with TELS the sequence composition of different expression subclasses, may complement the findings presented in this study.