Keywords
Motif assessment, Motif comparison, Motif scoring functions, ChIP-seq, Motif enrichment, Motif quality
This article is included in the Bioinformatics gateway.
Motif assessment, Motif comparison, Motif scoring functions, ChIP-seq, Motif enrichment, Motif quality
Understanding gene regulation remains a long-standing problem in biological research. The main players, transcription factors (TFs), are proteins that bind to short and potentially degenerate sequence patterns (motifs) at gene regulatory sites to promote or repress expression of target genes. The search for a code to predict binding sites and model binding affinity of TFs has led to several experimental techniques and motif discovery algorithms being developed (Figure 1).
A position weight matrix (PWM) is the common form of representing TF binding specificity. For a motif of length L, the corresponding PWM is a 4×L matrix of probabilities of observing a base b (A, C, G or T) at position i through L. Other variations have been introduced1–4, but a PWM remains popular due to its simplicity and ease of use as well as the ease of visualizing a PWM using a sequence logo5. Besides, Weirauch et al. showed that a well-trained PWM performs comparably to more complex models6. Motifs can be found using a variety of methods including algorithms that do de novo motif discovery from sequences containing binding sites7–9 and in vitro methods such as protein binding microarrays (PBM)10 and high-throughput systematic evolution of ligands by exponential enrichment (HT-SELEX)11.
Initially, the low resolution of the available experimental techniques for TF binding specificity detection was a hindrance to the quality of binding models. However, next generation sequencing and techniques like chromatin immunoprecipitation (ChIP) followed by deep sequencing (ChIP-seq)12 and exonuclease cleavage in ChIP-exo13 that measure TF in vivo occupancy, have improved the resolution to single-nucleotide level. In addition to providing high resolution data for motif discovery, they are a useful resource to test the quality of the available motifs since they are TF specific. However, no benchmark capable of assessing the growing range of published motifs is available, with largely subjective quality measures14.
Despite the advance in techniques analysing TF binding specificity, both in vitro and in vivo, the quality of models derived has not improved in a comparative measure. Although this may be explained by the saturation of PWM models’ ability to describe TF binding, the lack of a robust approach to test the quality of the model and maximize the best-performing ones is also probable. How are the algorithms being developed, tested and improved? Furthermore, the number of motif finding algorithms from dissimilar data sets and subsequently the number of motif models for a single TF generated, continue to increase. There are at least 44 PWM motif models available in 14 different databases for Hnf4a alone. How does the end-user decide which motif to use? In this study, we review and test the approaches used to evaluate TF binding models.
The available motif assessment techniques can be divided into three categories: assess by binding site prediction, motif comparison or by sequence scoring, and classification.
Binding site prediction. Early review and assessment of motif-finding algorithms tested tools on the ability to predict sites, known or inserted into the sequence. Tompa et al. tested motif discovery algorithms by their ability to predict sites of inserted motifs using statistical measures for site sensitivity and correlation coefficient15. In this first comprehensive study, they found that a motif assessment problem is complex and admitted inserting random motifs fails to capture the biological condition of TF binding. Later, Hu et al.16 used real RegulonDB binding data in a large-scale analysis of five motif-finding algorithms. The tools available at that time performed poorly–“15–25% accuracy at the nucleotide level and 25–35% at the binding site level for sequences of 400 nt long”–largely due to the poor quality of RegulonDB annotations17.
Sandev and colleagues18–20 tested motif discovery algorithms using sequences with real and inserted binding sites as benchmarks; from Transfac, and the third-order Markov model respectively. Quest and colleagues21 developed the Motif Tool Assessment Platform (MTAP) as an automated test of motif discovery tools. However, this was computationally expensive and was made obsolete by new experimental data and algorithms.
The most comprehensive assessment based on binding site prediction so far has been by the Regulatory Sequence Analysis Tools (RSAT) consortium. In their ‘matrix quality’ script, they use theoretical – information content (IC) and E-values – and empirical scores computed by predicting binding sites in RegulonDB, ChIP-chip and ChIP-seq positive and negative control sequences17.
Inadequate knowledge of TF binding sites has mainly hampered the ability to assess motifs and algorithms by binding site prediction. Predicting binding sites that are inserted or known in the sequences cannot accurately identify unknown true sites. Techniques that identify such sites may be penalized. Until TF binding sites are well annotated, this technique cannot be confidently utilized.
Motif comparison. Novel motifs can be assessed by comparison to ‘reference motifs’ using the sum of square deviation, Euclidean distance and other statistics that measure divergence between two PWMs22,23. Thomas-Chollier et al. proposed a motif comparison approach for their RSAT algorithm where they combine multiple metrics, including Pearson’s correlation, width normalized correlation, logo dot product, correlation of IC, normalized Sandelinâ-Wasserman, sum of squared distances and normalized Euclidean similarity for each matrix pair24. They then unified all of these scores to ranks whereby the mean of the ranks is considered the overall score.
Assessing motifs by comparison, as currently implemented, only tests similarity to the available motifs with little information on quality and ranks of the motifs. It assumes accuracy of ‘reference motifs’, with no way of assessing novel ones. In addition, the definition of ‘reference motifs’ remains largely subjective.
Assessment by scoring. Motif assessment has since shifted towards scoring positive sequences known to contain binding sites and negative background sequences without binding sites, driven by high-throughput sequencing techniques6,25–27. This avoids the need to identify binding sites a priori by focusing on the ability to classify the two sets of sequences. The differences in the assessments arise from the choice of sequences to use as positive and negative, the thresholds used to identify binding sites, the length of the sequences in both sets, the scoring function and the statistic used to quantify the performance of the tool.
For ChIP-seq data, the main difference is that the length of sequences (250bp25, 600bp27, 100bp6 or 60bp28) and the choice of negative sets (300bp downstream25,27; random sequences, 5000bp from a transcription start site (TSS) or random genomic sequences6, or flanking sequences28) used. In addition Agius et al.28, tested PWMs and support vector regression (SVR) models in the 36bp sliding window of the test sequences, a deviation from the rest of the techniques. All these differences, in addition to the scoring functions and statistics used, can lead to variations in the results of comparative analyses. Users and algorithm developers therefore have to frequently re-invent the wheel to test their tools.
Figure 1 shows the evolution of experimental motif discovery assessment techniques. We have not focused on the experimental techniques or motif discovery algorithms as excellent reviews are already available14,29. Rather, we focus on TF binding models represented as a PWM and aim to determine how the choice and length of benchmark sequences, scoring functions, and the statistics influence motif assessment. We hope that this study will highlight some of the pitfalls in the previous motif assessments and advise design of a standard in motif assessment that will ensure comparability and reuse of results.
Human uniform ChIP-seq data were downloaded from the ENCODE consortium30 (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform). For each peak file, we used BEDTools v2.17.031 to extract the 500 highest scored sequences (after eliminating repeat masked sites) of 50, 100, and 250bp centred on the ChIP-seq peaks as a positive set. A similar negative set was extracted 500bp downstream from the positive sequences.
We used motifs from a number of databases and publications listed in Table 1. We converted these motifs from their various formats into MEME format and scored the positive and negative sequences with GOMER, occupancy, energy and log-odds scoring functions. We quantify how each motif performs using AUC, Spearman’s, MNCP and Pearson correlation (Figure 2). This was implemented in a Python module which is available free from https://github.com/kipkurui/Kibet-F1000Research. This repository also contains raw data and an Ipython notebook that documents how to reproduce the analysis we describe in this paper.
Database | Source | Size | Reference |
---|---|---|---|
JASPAR | Mixed | 127 | 32 |
UniPROBE | PBM | 386 | 10 |
Jolma | HT-SELEX | 843 | 11 |
Zhao | PBM-BEEML | 419 | 33 |
POUR | ChIP-seq | 292 | 34 |
HOCOMOCO | Mixed | 426 | 35 |
SwissRegulon | Mixed | 297 | 36 |
TF2DNA | 3D Structures | 1314 | 37 |
HOMER | ChIP-seq | 264 | 38 |
Chen2008 | PBM | 12 | 39 |
3DFOOTPRINT | 3D Structures | 297 | 40 |
GUERTIN | ChIP-seq | 609 | 41 |
CSP-BP | Mixed | 734 | 42 |
ZLAB | ChIP-seq | 409 | 43 |
When testing motifs by scoring ChIP-seq data, multiple scoring functions are available, which may affect the outcome. In the section that follows, we describe the scoring functions tested, as well as provide a review of how they have been previously applied.
GOMER scoring. The GOMER scoring framework was introduced by Granek et al.44 but adapted for PBM sequence scoring45,46. It seeks to compute the probability g(siΘ) that a TF, given PWM Θ, will bind to at least one of the sub-sequences (k) of Si of length L, where L is the length (number of sites) of the PWM model. This assumes that each site can be bound independently.
See Chen et al.45 for more details.
Occupancy score. The occupancy score calculates the occupancy of a PWM (Θ) of length l for subsequence of length k as the product of the probabilities of each base in S using Equation 1.
For a sequence, the sum of the occupancies of all subsequences (sum occupancy)25,47, the maximum score (maximum occupancy)27, or the average occupancy (average motif affinity–AMA) have been used.
Sum occupancy is defined in Equation 3:
BEEML-PBM energy scoring. The energy scoring framework of binding energy estimation by maximum likelihood for protein binding microarrays (BEEML-PBM)4, computes the logarithm of base frequencies with the idea that this is proportional to the energy contributions of the bases. The binding energy at each location is computed; the lower the binding energy, the higher the binding affinity. It has mainly been used to score PBM data6,27.
The probability that sequence Si is bound is given by Equation 4,
where, for a sequence S of length T, E(Si) is given Equation 5,
for binding site of length L, ∈(b, m) is the energy contribution of base b while Si(b, m) is an indicator function of site m (1 with base b, 0 otherwise).
Log likelihood scoring. In log likelihood scoring, used by a majority of the MEME Suite tools48, the score for a given site is the sum of the log-odds ratios at a PWM at the match site. For a sequence S of length N scored using PWM Θ, the log-odds score is given by Equation 6,
where p the background probability and l is the indicator function for base b at position i (1 with base, 0 otherwise).
The score for a given sequence can then be derived by summing individual scores or by finding the maximum score. Sum log-odd scoring has generally been used by MEME Suite tools while maximum log-odds scoring has also been used to compare motifs represented differently (PWM, k-mer and SVM models) against one another27,28. Each of these approaches has inherent advantages but may produce inconsistent results.
With the scores of each motif for the sequences acquired, binding prediction can be evaluated by various statistics. The area under the receiver operating characteristic curve (AUC)49 has been widely used, especially with the advent of PBM6,25,45. In addition to popularizing AUC, Clarke et al.49 also introduced a novel metric, minimum normalized conditional probability (MNCP), for quantifying the correlation between DNA features and gene regulation. This statistic has been applied in motif assessment in GIMME motifs50 and is said to be less affected by the presence of false positives compared with AUC since it places emphasis on true positives. We use MNCP to test how it contributes to better prediction in an effort to encourage its use.
Pearson and Spearman’s rank correlation are still widely used as a measure of motif performance. Spearman’s rank correlation has been used for PBM and ChIP-seq sequences25 while Pearson’s correlation was used by Weirauch et al.6. However, Weirauch et al. cautioned on the use of Spearman’s correlation for PBM data citing its inability to exclude low intensity probes. We wish to check the usefulness of correlation in motif assessment.
In addition to comparing the scoring approaches, we use CentriMo version 4.10.0 in differential mode51 – an option that tests differences in motif enrichment between two sequence sets – in a novel way for motif assessment. We set differential mode parameters for local rather than central enrichment of all the input motifs in the positive (primary) and negative (control) set, as described in the Data section above, by using a very large threshold. The negative log of the E-value is used as the measure of a motif’s enrichment and rank. Motif enrichment has previously been performed43 using the FIMO algorithm52 to scan for motif matches in sequences and calculate an enrichment value.
The size of the putative binding region – length of the sequences in each data set – is to some extent a proxy for how accurate the ChIP-seq experiment was. If the result was accurate a narrow region should contain the true site.
For the three variants of sequence length, we did not observe a significant effect (p=0.113, for 50 and 100; p=0.0545, 50 and 250; p=0.678, 100 and 250bp–Wilcoxon rank-sum test) on the scoring of the sequences (Figure 3). The scores assigned for each sequence length, however, seems to indicate how the TFs bind. Motifs with higher scores at lower sequence length (50 or 250bp) are generally enriched at the ChIP-seq peak, which is also a strong indicator of direct binding53. This is consistent with a previous observation that a successful ChIP-seq experiment localizes binding within about 100bp of the true site54. Others with significantly better AUC values at 250bp sequence length like Elf1 (p=0.017, Wilcoxon rank-sum test) and Sp1 (p=0.013, Wilcoxon rank-sum test)55, are known to bind cooperatively.
Transcription factors bind to their possible sites in a sequence-specific manner. Some actually have alternative binding motifs depending on the tissue or cell line. Unless the interest is tissue-specific binding, if more than one set of data is available, an average should be used. For example, as shown in Figure 4, the Foxa motif from the POUR data set is significantly differentially enriched only in the A549 cell line and not so much in the other cell lines.
In light of this possible effect, the results displayed throughout this paper are based on the mean score of all the available ChIP-seq data sets to avoid a bias towards cell line-specific motifs.
Generally, the AUC and MNCP statistics are in strong agreement. However, in some situations like Hnf4a and Ctcf, they are not (Figure 5). The motifs that are ranked higher only by MNCP are generally long or with high IC (Table 2). Those are highly specific motifs confirming that MNCP prefers specific motifs, which will have more true positives. When energy scoring is used, there is agreement between the scores assigned by AUC and MNCP hinting that, like MNCP, energy scoring also puts emphasis on true positive hits.
We tested the ability of PWM models to discriminate positive (top 500 peaks of width 100bp centred on the peak) and negative (500 peaks 100bp wide located 500bp downstream from the positive) sequence sets using five scoring functions. Maximum and sum log-odds scoring had low discriminative power for most of the motifs when all three statistical measures are used (Figure 6). However, sum log-odds scoring had some good performance (over 0.55 AUC scores) for some TF motifs like Max, Nrf1, Tcf3 and Pax5.
There is no significant difference in performance when GOMER, energy or occupancy scores (sum, maximum and AMA) are used for scoring (Figure 6) with AUC statistic (see Table_S1 for details of statistical significance). Also, we did not observe any significant difference (p=0.85, Wilcoxon rank-sum test) between sum occupancy and maximum (Table 3), contrary to a claim by Orenstein et al.25. The variation in the scores is particularly reduced when MNCP statistic is used (Figure 7); though Ctcf, Egr1 and Hnf4a score significantly higher in energy. For other TFs like Pou2f2 and Esrra, the preference is reversed. These observations are reflective of the inherent features of the scoring functions or the statistics used.
Motif length has little bearing on the quality of motif, independent of other factors. However, specific motifs with very high IC such as those from POUR have a lower performance (Figure 8). In the same light, those motifs with low IC also have a lower performance in vivo.
The heat map in Figure 8 shows how the motif scores from the four discriminative functions correlate with motif length, full-length IC and average IC. The examples have no consistent correlation between the IC and the scores (Figure 8A). However, there is a negative correlation between both the total IC and motif length, and the scores except for sum log-odds scoring, which has no significant correlation (p=0.34, correlation p-value). This shows that motif length, rather than the IC of the motifs, negatively influences the scores assigned by these functions. This is not a general rule. Some TFs exemplify a different scenario. For example, Egr1 (Figure 8B) has a strong positive correlation between IC and scores and a negative correlation with motif length, showing that it has a highly specific binding site56. Mef2a, on the other hand, has a positive correlation between motif length and scores showing preference for longer low information motifs (Figure 8C). This could also reflect variability in binding sites. Ctcf has the highest negative correlation for average IC, with a neutral to positive correlation for motif length (Figure 8D), which may indicate preference for longer low IC motifs.
We have shown that the effect of scoring algorithms is TF-specific. We also test to see how, overall, the different databases (DBs) are ranked against each other. For TFs with more than one motif in a given DB, we use the best performing one to represent the DB. We also use motif enrichment-based assessment using CentriMo version 4.10.0 to provide more evidence to scoring based techniques’ results. Motif enrichment analysis compares how various motifs in foreground sequences are enriched compared with background sequences. In comparing how two or more motifs for the same TF perform, the level of enrichment of the motif in sequences known to contain possible binding sites of the TF compared to some background should provide a measure of the quality of the motif.
Figure 9 provides a summary of ranking of the various databases for the given TFs. We observed that the performance of a majority of the motif databases did not differ much, except for TF2DNA motifs, but the ranking or the performance of individual motifs differs. This further supports the observation of TF-specific performance of scoring function, algorithms and DBs. It also shows that no single database currently outperforms the others for all TFs. There is agreement in ranking of the best (ZLAB and HOCOMOCO) and worst performing (TF2DNA and SWISSREGULON) DBs. We observe that, compared with GOMER (Figure 9A), the score for all DBs drops when using energy (Figure 9B) except for POUR motifs. This shows that POUR motifs, or at least the best performing ones, are favoured by energy scoring. It is also noteworthy that POUR and GUERTIN DB motifs, discovered and tested on ENCODE ChIP-seq data, perform poorly.
We have described a comparative analysis on the effect of scoring functions, ChIP-seq test data processing and statistics on motif assessment. We chose to focus on TF binding models represented as PWM, since it is most commonly used. The review reveals the complexity of the motif assessment problem, showing no appropriate solution is available so far. The available techniques focus on testing motif algorithms or the experimental techniques used, but little has been done to compare the available motifs for a given TF. There is a need for a tool, accessible and easy to use by end-users, to aid in choosing motifs.
The use of 100 or 250bp sequence length provides necessary discrimination for the TFs tested (Figure 3). The performance was also found to be TF specific, an observation that could reflect inherent binding behaviour; direct, indirect or cooperative binding of the TF. This supports the observation that direct binding can be inferred from ChIP-seq peaks53. We also confirm that 100bp provides acceptable specificity in motif assessment given that most TF binding sites are less than 30bp54.
Since TF binding is cell line specific57, users should be aware of bias caused by use of one cell line in an assessment. We observe that the use of more than one cell line reduces the bias towards cell line specific motifs (Figure 4).
The MNCP rank-order metric is similar to AUC but derived by plotting true positive hits against all sequences’ scores. This places emphasis on true positives, and therefore, less affected by false positives. Our analysis confirms this observation and demonstrates the power of MNCP compared with AUC, which penalize specific motifs (Figure 5). We propose that energy scoring has the same benefit, though further research may be needed to validate this. Although there is no clear winner among the scoring function, occupancy based (GOMER, AMA, sum and max) and energy scoring functions are preferred. We recommend using occupancy scoring with MNCP statistic or energy scoring with normal AUC or MNCP statistic.
There is no significant correlation (p=0.513, correlation p-value) between the IC and the motif scores (Figure 8). This contrasts with the observation that the best-quality motifs may have low IC motifs6, or high IC motifs58. Weirauch et al. did not normalize for motif length, which results in high IC motifs that are generally longer but not necessarily more specific6. A shorter motif with higher IC per position will be more specific but have lower total IC. We argue that the effect of IC on motif quality is dependent on the TF binding behaviour. TFs with short and specific binding sites will favour high IC while those with long and variable binding sites will be more accurately modelled with low IC. Furthermore, it has been shown the low IC flanking motif sites contribute to specificity of binding in vivo58.
We have also shown that the techniques used in motif assessment have a direct effect on motif discovery. We observe how motifs generated from similar data using the same techniques could have highly variable performance in POUR, ZLAB and GUERTIN motifs (Figure 9). This difference in quality can be explained by motif discovery algorithms used, data processing as well as the assessment techniques used in each motif discovery pipeline. POUR motifs are learned from full-length sequences of the top 250 peaks using five motif finding algorithms (MEME, MDscan, Trawler, AlignAce and Weeder)34, the ZLAB group used 100bp of the top 500 sequences centred on the ChIP-seq peaks using MEME-ChIP59, while GUERTIN reports the top 5 motifs for each technique generated using MEME. For quality assessment, POUR34 used a TFM-PVALUE60 to match motifs against the testing ChIP-seq data set and the most enriched motifs against a background composed of intergenic non-repetitive regions. ZLAB group used FIMO52, which uses a log likelihood score for motif scanning.
The worst performing motifs, from TF2DNA, are generated from 3D models of TF from experimental or homology-modelled PDB files. When generating these models, they determined the accuracy of the models based on similarity to UniPROBE and JASPAR motifs at a given threshold. They claimed their technique successfully identifies true motifs 41–81% of the time depending on the quality of templates used in modelling 3D structures. This supports our view that use of motif comparison against ‘reference motifs’ as a measure of motif quality is not reliable. Although JASPAR and UniPROBE are widely used, reliance on reference motifs is problematic, especially with the advent of motif databases like HOCOMOCO and CIS-BP that have motifs with better prediction quality. As a general principle, it is not reasonable to use historical data as a benchmark for assessing potentially better new methods.
We have confirmed that motif assessment has transcription-specific variability, an observation previously made61. Assessments should be less focused on how a particular motif database or algorithm performs but on how different motifs, for a particular TF, perform compared to the other potential motifs. For the end user, no single database can provide the sole measure of quality of new data. This raises the need for collation of the different motifs tested using a variety of motif assessments to provide information to the end user on their ranks.
We have demonstrated that the scoring techniques used in motif assessment influence ranking of motifs in a TF-specific manner. Motif assessments and tools developed to date have focused on comparing algorithms, experimental techniques or databases. This does not help the user choose which motif to use for a given TF. Some TFs reviewed here have at least 44 PWM motifs available, raising the need for a tool that can be utilized to rank these motifs. We have also shown that data processing as well as motif assessment can have a significant influence on the quality of motifs derived. Therefore, the choice of an assessment approach should be given as much thought as that of the motif discovery algorithm to use. We have also shown the effect of IC on motif quality is influenced by TF binding behaviour.
In short, a single measure of motif quality is likely to remain elusive, pointing to the need for tools and methods for comparative analysis using multiple methods. Lessons learned from this analysis will be useful in a number of ways. Firstly, we are working on a web-based application that can allow users to compare motifs available in different databases for a specific TF. Secondly, we intend to extend the motif by comparison approach to avoid ‘reference motifs’ bias. Thirdly, we have shown the effect of motif scoring on motif discovery. We intend to use the robust motif assessment techniques we introduce to improve motif finding.
Data, software, supplementary files and documentation for ‘Transcription factor motif quality assessment requires systematic comparative analysis’ are available from Github: https://github.com/kipkurui/Kibet-F1000Research.
Archived files at the time of publication are available from Zenodo: doi: 10.5281/zenodo.3372669.
CK designed and performed the analysis and wrote the first draft. PM supervised the work and contributed to subsequent drafts. All authors read and approved the final manuscript.
The financial assistance of the South African National Research Foundation (NRF) towards this research is hereby acknowledged. Opinions expressed and conclusions arrived at, are those of the authors and are not necessarily to be attributed to the NRF. PM funding: NRF/IFR Grant 85362; CK: DST Innovation Doctoral Scholarship.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
References
1. Kulakovskiy I, Levitsky V, Oshchepkov D, Bryzgalov L, et al.: From binding motifs in ChIP-Seq data to improved models of transcription factor binding sites.J Bioinform Comput Biol. 2013; 11 (1): 1340004 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 14 Mar 16 |
read | read |
Version 1 11 Dec 15 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)