Elsevier

Neural Networks

Volume 21, Issue 6, August 2008, Pages 856-861
Neural Networks

2008 Special Issue
Combining experts in order to identify binding sites in yeast and mouse genomic data

https://doi.org/10.1016/j.neunet.2008.07.004Get rights and content

Abstract

The identification of cis-regulatory binding sites in DNA is a difficult problem in computational biology. To obtain a full understanding of the complex machinery embodied in genetic regulatory networks it is necessary to know both the identity of the regulatory transcription factors and the location of their binding sites in the genome. We show that using an SVM together with data sampling to classify the combination of the results of individual algorithms specialised for the prediction of binding site locations, can produce significant improvements upon the original algorithms. The resulting classifier produces fewer false positive predictions and so reduces the expensive experimental procedure of verifying the predictions.

Introduction

Binding site prediction is both biologically important and computationally interesting. One aspect that is challenging is the imbalanced nature of the data and that has allowed us to explore some powerful techniques to address this issue. In addition the nature of the problem allows biological heuristics to be applied to the classification problem. Specifically we can remove some of the final predicted binding sites as not being biologically plausible.

Computational predictions are invaluable for deciphering the regulatory control of individual genes and by extension aiding in the automated construction of the genetic regulatory networks to which these genes contribute. Improving the quality of computational methods for predicting the location of transcription factor binding sites (TFBS) is therefore an important research goal. Currently, experimental methods for characterising the binding sites found in regulatory sequences are both costly and time consuming. Computational predictions are therefore often used to guide experimental techniques. Larger scale studies, reconstructing the regulatory networks for entire systems or genomes, are therefore particularly reliant on computational predictions, there being few alternatives available.

DNA molecules are composed of a long chain of linked monomers, known as nucleotide bases, which come in four different types. The sequence of bases in a DNA sequence can be used to encode information necessary for the proper function of many biological systems. Two important examples include the gene sequences which encode an organism’s complement of proteins and the regulatory sequences which by binding transcription factors help determine the coordinated expression of the proteins in space and time. Functional annotation of DNA sequences has taken an increasingly important role in the post-genomic era. Many regions of considerable functional importance, such as binding sites for transcription factors, consist of subtle signals encoded in the DNA sequence. Detection of these regions in genomic sequences is a critical step in our evolving understanding of gene regulation and gene regulatory networks. Transcription factor binding sites are notoriously variable from instance to instance and they can be located at considerable distances from the gene being regulated in higher eukaryotes. Computational prediction of cis-regulatory binding sites is widely acknowledged as a difficult task (Tompa et al., 2005).

Computational analysis of DNA sequences typically relies on a string based representation where four characters represent the sequence of nucleotides defining a DNA sequence. The use of string based representations of DNA sequences has made possible the application of a wide range of powerful computational algorithms to the analysis of DNA sequences. A limitation common to many if not all algorithmic approaches is that they are inherently constrained with respect to the range of binding sites that they can be expected to reliably predict. For example, co-regulatory algorithms would only be expected to successfully find binding sites common to a set of co-expressed promoters, not any unique binding sites that might also be present. Scanning algorithms are likewise limited by the quality of the position weight matrices available for the organism being studied.

Given the differing aims of these algorithms it is reasonable to suppose that an efficient method for integrating predictions from these diverse strategies should increase the range of detectable binding sites. Furthermore, an efficient integration strategy may be able to use multiple sources of information to remove many false positive predictions, while also strengthening our confidence about many true positive predictions. The use of algorithmic predictions prone to high rates of false positive is particularly costly to experimental biologists using the predictions to guide experiments. High rates of false positive predictions also limit the utility of prediction algorithms for their use in regulatory network reconstruction. Reduction of the false positive rates is therefore a high priority.

In this paper we show how algorithmic predictions can be combined so that a Support Vector Machine (SVM) can subsequently perform a new prediction that significantly improves on the performance of any one of the individual algorithms. Moreover we show how the number of false positive predictions can be reduced by around 80%. We use two different datasets: for our major study we use a set of annotated yeast promoters taken from the SCPD (Zhu & Zhang, 1999), and then in order to validate the method with a complex multi-cellular species, the mouse, we used a set of 47 experimentally annotated promoters extracted from the ABS (Blanco, Farre, Alba, Messeguer, & Guigo, 2006) and ORegAnno (Montgomery et al., 2006) databases.

Section snippets

Background

The use of a non-linear classification algorithm for the purposes of integrating difference sources of evidence relating to cis-regulatory binding site locations, such as the predictions generated from a set of cis-regulatory binding site prediction algorithms, is explored in this paper. This is achieved by first generating a number of algorithmic predictions (a real number between 0 and 1 representing the probability that a nucleotide is part of a binding site, see Section 3) for a set of

Description of the data

High quality experimentally annotated datasets were used in this study. In all cases it is important to be aware that such annotations are limited to positive observations and as such cannot guarantee completeness. It is possible that additional binding sites exist in the sequences used and will here be classified as background. Any additional binding sites which are present but which are not included in the annotations will necessarily affect our evaluation of prediction accuracy in this study.

Performance metrics

As approximately 8% of the yeast dataset (see Table 3) is annotated as being a part of a binding site, this dataset is imbalanced (as is the mouse dataset). If the algorithms are to be evaluated in a useful manner simple error rates are inappropriate, it is therefore necessary to use other metrics. Several common performance metrics, such as Recall (also known as Sensitivity), Precision (also known as Specificity), False Positive rate (FP-Rate) and F-Score, can be defined using a confusion

Techniques for learning imbalanced datasets

In an earlier work (Sun et al., 2005b) we have investigated a range of trainable classifiers applied to this problem. These include majority voting, weighted majority voting, single layer neural networks, Adaboost, decision trees and SVMs. These produced varying results, but the more sophisticated classifiers, the SVM and Adaboost methods clearly outperformed the others.

So here we use an SVM with Gaussian kernel (using LibSVM (Chang & Lin, 2001)). However without addressing the imbalance of the

Biologically constrained post-processing

One important concern when applying classifier algorithms to the output of many binding site prediction algorithms is that the classifier decisions could result in biologically unfeasible results. The original algorithms only predict reasonable, contiguous sets of base pairs as constituting complete binding sites. However when combined in our meta-classifier each base pair is predicted independently of the neighbouring base pairs, and it is therefore possible to get lots of short predicted

Results

Before presenting the main results we should point out that predicting binding sites accurately is extremely difficult. For the yeast dataset the performance of the best individual original algorithm (Fuzznuc) is as given in Table 5.

Here we can see over three times as many false positives as true positives. This makes the predictions almost useless to a biologist as most of the suggested binding sites will need expensive experimental validation and most will not be useful. Therefore a key aim

Discussion

The identification of regions in a sequence of DNA that are regulatory binding sites is a very difficult problem. Individually the original prediction algorithms are inaccurate and consequently produce many false positive predictions. Our results show that by combining the predictions of the original algorithms we can make a significant improvement from their individual results. This suggests that the predictions that they produce are complementary, perhaps giving information about different

References (26)

  • C.T. Brown et al.

    New computational approaches for analysis of cis-regulatory networks

    Developmental Biology

    (2002)
  • T.R. Hughes et al.

    Functional discovery via a compendium of expression profiles

    Cell

    (2000)
  • P. Radivojac et al.

    Classification and knowledge discovery in protein databases

    Journal of Biomedical Informatics

    (2004)
  • I. Abnizova et al.

    Transcription binding site prediction using Markov models

    Journal of Bioinformatics and Computational Biology

    (2006)
  • Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datase. In 15th European...
  • A. Apostolico et al.

    Efficient detection of unusual words

    Journal of Computational Biology

    (2000)
  • Bailey, T.L., & Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in...
  • M. Blanchette et al.

    FootPrinter: A program designed for phylogenetic footprinting

    Nucleic Acids Research

    (2003)
  • E. Blanco et al.

    ABS: A database of annotated regulatory binding sites from orthologous promoters

    Nucleic Acids Research

    (2006)
  • Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector...
  • N.V. Chawla et al.

    SMOTE: Synthetic minority over-sampling Technique

    Journal of Artificial Intelligence Research

    (2002)
  • R.J. Henery

    Methods for comparison

  • D. Karolchik et al.

    The UCSC genome browser database

    Nucleic Acids Research

    (2003)
  • Cited by (3)

    View full text