2008 Special IssueCombining experts in order to identify binding sites in yeast and mouse genomic data
Introduction
Binding site prediction is both biologically important and computationally interesting. One aspect that is challenging is the imbalanced nature of the data and that has allowed us to explore some powerful techniques to address this issue. In addition the nature of the problem allows biological heuristics to be applied to the classification problem. Specifically we can remove some of the final predicted binding sites as not being biologically plausible.
Computational predictions are invaluable for deciphering the regulatory control of individual genes and by extension aiding in the automated construction of the genetic regulatory networks to which these genes contribute. Improving the quality of computational methods for predicting the location of transcription factor binding sites (TFBS) is therefore an important research goal. Currently, experimental methods for characterising the binding sites found in regulatory sequences are both costly and time consuming. Computational predictions are therefore often used to guide experimental techniques. Larger scale studies, reconstructing the regulatory networks for entire systems or genomes, are therefore particularly reliant on computational predictions, there being few alternatives available.
DNA molecules are composed of a long chain of linked monomers, known as nucleotide bases, which come in four different types. The sequence of bases in a DNA sequence can be used to encode information necessary for the proper function of many biological systems. Two important examples include the gene sequences which encode an organism’s complement of proteins and the regulatory sequences which by binding transcription factors help determine the coordinated expression of the proteins in space and time. Functional annotation of DNA sequences has taken an increasingly important role in the post-genomic era. Many regions of considerable functional importance, such as binding sites for transcription factors, consist of subtle signals encoded in the DNA sequence. Detection of these regions in genomic sequences is a critical step in our evolving understanding of gene regulation and gene regulatory networks. Transcription factor binding sites are notoriously variable from instance to instance and they can be located at considerable distances from the gene being regulated in higher eukaryotes. Computational prediction of cis-regulatory binding sites is widely acknowledged as a difficult task (Tompa et al., 2005).
Computational analysis of DNA sequences typically relies on a string based representation where four characters represent the sequence of nucleotides defining a DNA sequence. The use of string based representations of DNA sequences has made possible the application of a wide range of powerful computational algorithms to the analysis of DNA sequences. A limitation common to many if not all algorithmic approaches is that they are inherently constrained with respect to the range of binding sites that they can be expected to reliably predict. For example, co-regulatory algorithms would only be expected to successfully find binding sites common to a set of co-expressed promoters, not any unique binding sites that might also be present. Scanning algorithms are likewise limited by the quality of the position weight matrices available for the organism being studied.
Given the differing aims of these algorithms it is reasonable to suppose that an efficient method for integrating predictions from these diverse strategies should increase the range of detectable binding sites. Furthermore, an efficient integration strategy may be able to use multiple sources of information to remove many false positive predictions, while also strengthening our confidence about many true positive predictions. The use of algorithmic predictions prone to high rates of false positive is particularly costly to experimental biologists using the predictions to guide experiments. High rates of false positive predictions also limit the utility of prediction algorithms for their use in regulatory network reconstruction. Reduction of the false positive rates is therefore a high priority.
In this paper we show how algorithmic predictions can be combined so that a Support Vector Machine (SVM) can subsequently perform a new prediction that significantly improves on the performance of any one of the individual algorithms. Moreover we show how the number of false positive predictions can be reduced by around 80%. We use two different datasets: for our major study we use a set of annotated yeast promoters taken from the SCPD (Zhu & Zhang, 1999), and then in order to validate the method with a complex multi-cellular species, the mouse, we used a set of 47 experimentally annotated promoters extracted from the ABS (Blanco, Farre, Alba, Messeguer, & Guigo, 2006) and ORegAnno (Montgomery et al., 2006) databases.
Section snippets
Background
The use of a non-linear classification algorithm for the purposes of integrating difference sources of evidence relating to cis-regulatory binding site locations, such as the predictions generated from a set of cis-regulatory binding site prediction algorithms, is explored in this paper. This is achieved by first generating a number of algorithmic predictions (a real number between 0 and 1 representing the probability that a nucleotide is part of a binding site, see Section 3) for a set of
Description of the data
High quality experimentally annotated datasets were used in this study. In all cases it is important to be aware that such annotations are limited to positive observations and as such cannot guarantee completeness. It is possible that additional binding sites exist in the sequences used and will here be classified as background. Any additional binding sites which are present but which are not included in the annotations will necessarily affect our evaluation of prediction accuracy in this study.
Performance metrics
As approximately 8% of the yeast dataset (see Table 3) is annotated as being a part of a binding site, this dataset is imbalanced (as is the mouse dataset). If the algorithms are to be evaluated in a useful manner simple error rates are inappropriate, it is therefore necessary to use other metrics. Several common performance metrics, such as Recall (also known as Sensitivity), Precision (also known as Specificity), False Positive rate (FP-Rate) and F-Score, can be defined using a confusion
Techniques for learning imbalanced datasets
In an earlier work (Sun et al., 2005b) we have investigated a range of trainable classifiers applied to this problem. These include majority voting, weighted majority voting, single layer neural networks, Adaboost, decision trees and SVMs. These produced varying results, but the more sophisticated classifiers, the SVM and Adaboost methods clearly outperformed the others.
So here we use an SVM with Gaussian kernel (using LibSVM (Chang & Lin, 2001)). However without addressing the imbalance of the
Biologically constrained post-processing
One important concern when applying classifier algorithms to the output of many binding site prediction algorithms is that the classifier decisions could result in biologically unfeasible results. The original algorithms only predict reasonable, contiguous sets of base pairs as constituting complete binding sites. However when combined in our meta-classifier each base pair is predicted independently of the neighbouring base pairs, and it is therefore possible to get lots of short predicted
Results
Before presenting the main results we should point out that predicting binding sites accurately is extremely difficult. For the yeast dataset the performance of the best individual original algorithm (Fuzznuc) is as given in Table 5.
Here we can see over three times as many false positives as true positives. This makes the predictions almost useless to a biologist as most of the suggested binding sites will need expensive experimental validation and most will not be useful. Therefore a key aim
Discussion
The identification of regions in a sequence of DNA that are regulatory binding sites is a very difficult problem. Individually the original prediction algorithms are inaccurate and consequently produce many false positive predictions. Our results show that by combining the predictions of the original algorithms we can make a significant improvement from their individual results. This suggests that the predictions that they produce are complementary, perhaps giving information about different
References (26)
- et al.
New computational approaches for analysis of cis-regulatory networks
Developmental Biology
(2002) - et al.
Functional discovery via a compendium of expression profiles
Cell
(2000) - et al.
Classification and knowledge discovery in protein databases
Journal of Biomedical Informatics
(2004) - et al.
Transcription binding site prediction using Markov models
Journal of Bioinformatics and Computational Biology
(2006) - Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datase. In 15th European...
- et al.
Efficient detection of unusual words
Journal of Computational Biology
(2000) - Bailey, T.L., & Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in...
- et al.
FootPrinter: A program designed for phylogenetic footprinting
Nucleic Acids Research
(2003) - et al.
ABS: A database of annotated regulatory binding sites from orthologous promoters
Nucleic Acids Research
(2006) - Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector...
SMOTE: Synthetic minority over-sampling Technique
Journal of Artificial Intelligence Research
Methods for comparison
The UCSC genome browser database
Nucleic Acids Research
Cited by (3)
Using Varying Negative Examples to Improve Computational Predictions of Transcription Factor Binding Sites
2012, Communications in Computer and Information ScienceImproving transcription factor binding site predictions by using randomised negative examples
2012, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Effect of using varying negative examples in transcription factor binding site predictions
2011, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)