Elsevier

Methods

Volume 93, 15 January 2016, Pages 15-23
Methods

Enhancing protein function prediction with taxonomic constraints – The Argot2.5 web server

https://doi.org/10.1016/j.ymeth.2015.08.021Get rights and content

Highlights

  • Argot2.5 is a web server for automated function prediction based on Gene Ontology.

  • The performance has been improved with respect to the previous version, Argot2.

  • False positive predictions have been thoroughly reduced at low thresholds.

  • Taxon constraints reduce erroneous annotation transfers between distant taxa.

Abstract

Argot2.5 (Annotation Retrieval of Gene Ontology Terms) is a web server designed to predict protein function. It is an updated version of the previous Argot2 enriched with new features in order to enhance its usability and its overall performance. The algorithmic strategy exploits the grouping of Gene Ontology terms by means of semantic similarity to infer protein function. The tool has been challenged over two independent benchmarks and compared to Argot2, PANNZER, and a baseline method relying on BLAST, proving to obtain a better performance thanks to the contribution of some key interventions in critical steps of the working pipeline. The most effective changes regard: (a) the selection of the input data from sequence similarity searches performed against a clustered version of UniProt databank and a remodeling of the weights given to Pfam hits, (b) the application of taxonomic constraints to filter out annotations that cannot be applied to proteins belonging to the species under investigation. The taxonomic rules are derived from our in-house developed tool, FunTaxIS, that extends those provided by the Gene Ontology consortium. The web server is free for academic users and is available online at http://www.medcomp.medicina.unipd.it/Argot2-5/.

Introduction

The functional annotation of gene products is a crucial step for understanding the biology of living organisms, in all its physiological and pathological aspects. Since 2000, the Gene Ontology Consortium (GOC) has provided a powerful resource to collect the multitude of known functions in a structured vocabulary, the Gene Ontology [1], which is organized as a directed acyclic graph and facilitates the access to functional data through automatic tools.

The development of bioinformatics methods for gene products annotation is an active field of research and an international challenge is held periodically to assess such methods and provide a snapshot of the state of the art [2]. Automated tools predict function using different criteria, but most of them take advantage of sequence similarity based approaches to find matches between the input gene or protein and a database of already characterized entities (either at whole sequence or domain level), in order to transfer functional features [3], [4].

Despite the considerable effort of the scientific community, the main outcome from the first Critical Assessment of Function Annotation (CAFA) is that there is significant room for improvement, because all automatic pipelines that participated in the challenge suffered from lack of both recall and precision [2]. This problem also affects annotations in the Gene Ontology Annotation database (GOA), in particular those generated automatically: even though they are a valuable resource for many proteins and many organisms [5], their error rate is difficult to be quantified and even controlled. The GOC is constantly trying to limit this phenomenon by implementing a number of automatic checks which verify both file formats and partially data coming from annotation submitters. For example, back in 2011 they introduced an “Annotation black list” which specifies protein:GO term combinations that are not allowed as annotations [6]. Furthermore, Deegan et al. [7] proposed a more general approach to prevent incorrect associations between certain functions and specific taxa, providing a list of “taxon constraints” which explicitly defines such incompatibilities. Nevertheless, novel methods are needed to improve annotation quality, either by correcting existing errors or preventing the generation of novel ones. Our group already contributed to this research field through the development of Argot2 [6], a tool for the automated prediction of protein function which placed in the top ten of the best performing algorithms at CAFA and CAFA2. However, the context where the algorithm works has dramatically changed since the time of its initial development, in particular as regards the size of the databanks: UniProt [8], for example, has nearly quadrupled the number of entries, while the amount of annotations in GOA has increased by a factor of six. Such expansion negatively impacts the tool performance, by affecting both the execution time and the management of intermediate steps. In addition, new resources have become available to improve the predictive ability, such as the taxonomic constraints mentioned before.

In this paper, we present Argot2.5 (Annotation Retrieval of Gene Ontology Terms), a tool designed for high throughput annotation of large sequence data sets which improves upon its predecessor [9] thanks to the implementation of multiple novel features: (1) a clustered version of UniProt is used for BLAST searches in order to remove redundancy and speed up the searching time; (2) a novel semantic similarity measure is adopted and tested; (3) an extended set of taxonomic constraints is applied according to the in-house developed tool FunTaxIS (http://www.medcomp.medicina.unipd.it/funtaxis), that expands the list provided by GOC [7].

Section snippets

Materials and methods

Argot2.5 algorithm is based on the Argot2 approach [9]. Briefly, the algorithm starts from DNA/protein sequences and performs a BLAST [10] search against UniProt [8] database and a HMMER3 [11] search against Pfam [12]. The results retrieved from these steps are used to query the GOA databank: the collected GO terms are ranked according to both the significance of the hit they come from (provided by the e-value) and their occurrence in the results. The terms are then further grouped by means of

Yeast benchmark test set

Argot2.5 performance has been assessed over the whole yeast proteome and we have simulated the scenario in which no close species are present in the reference databases, by subtracting all proteins belonging to Fungi (NCBI Taxonomy: 4751) both from UniProt and Pfam. This setting is very challenging for two main reasons: (1) some query proteins do not find any significant alignment in the databases, or only few weak matches with high e-values (see also Sections 3.2 Benchmark test set dissection:

Conclusions

We present Argot2.5, a revisited algorithm and an updated web server which improve upon a previous platform named Argot2; the new functionalities have contributed to enhance both performance and usability. Performance assessment and comparison with the previous version have been carried out on the well characterized S. cerevisiae genome from which fungal proteins had been removed and on the CAFA challenge dataset. A simple BLAST-based algorithm has been used as baseline.

Taken together, the

Acknowledgements

The research is supported by “progetto d’Ateneo” PRAT CPDA138081/13, Grant “assegno senior” and “bando giovani studiosi” GRIC13AAI9 from University of Padova.

References (27)

  • S. Kumar et al.

    Identification of a set of genes with developmentally down-regulated expression in the mouse brain

    Biochem. Biophys. Res. Commun.

    (1992)
  • M. Ashburner et al.

    Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

    Nat. Genet.

    (2000)
  • P. Radivojac et al.

    A large-scale evaluation of computational protein function prediction

    Nat. Methods

    (2013)
  • R. Sharan et al.

    Network-based prediction of protein function

    Mol. Syst. Biol.

    (2007)
  • A.K. Tiwari et al.

    A survey of computational intelligence techniques in protein function prediction

    Int. J. Proteomics

    (2014)
  • N. Skunca et al.

    Quality of computationally inferred gene ontology annotations

    PLoS Comput. Biol.

    (2012)
  • R.P. Huntley et al.

    The GOA database: gene ontology annotation updates for 2015

    Nucleic Acids Res.

    (2014)
  • J.I. Deegan et al.

    Formalization of taxon-based constraints to detect inconsistencies in annotation and ontology development

    BMC Bioinf.

    (2010)
  • UniProt: a hub for protein information, Nucleic Acids Res, 43 (2015)...
  • M. Falda et al.

    Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms

    BMC Bioinf.

    (2012)
  • S.F. Altschul et al.

    Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

    Nucleic Acids Res.

    (1997)
  • S.R. Eddy

    Accelerated Profile HMM Searches

    PLoS Comput. Biol.

    (2011)
  • R.D. Finn et al.

    Pfam: the protein families database

    Nucleic Acids Res.

    (2014)
  • Cited by (40)

    • Genetic basis underlying the serological affinity of leptospiral serovars from serogroups Sejroe, Mini and Hebdomadis

      2022, Infection, Genetics and Evolution
      Citation Excerpt :

      Genomic regions containing genes with higher similarity value among samples of srg Sejroe than of non-Sejroe serogroup were depicted for further analysis. Proteins encoded in depicted regions were functionally annotated using the ARGOT webserver (Lavezzo et al., 2016). To verify the presence and the conservedness of those proteins in the 722 leptospiral samples from the NCBI RefSeq Genome Database, we submitted them to a sequence similarity search using the BLAST algorithm (blastp, Altschul et al., 1990) against the proteome of each leptospiral sample available at the NCBI database.

    • Genome comparison and transcriptome analysis of the invasive brown root rot pathogen, Phellinus noxius, from different geographic regions reveals potential enzymes associated with degradation of different wood substrates

      2020, Fungal Biology
      Citation Excerpt :

      Secondary metabolism genes were detected using antiSMASH 4.0 (Blin et al., 2017). Gene Ontology (GO) information was obtained using the Argot 2.5 web server (Lavezzo et al., 2016); only predictions with a total score ≥200 or with internal confidence ≥0.99 and with total score ≥2.0 were kept. Fisher’s exact tests, with p-values adjusted for multiple comparisons (Benjamini-Hochberg), were used to compare the number of transcripts in the 15 more abundant transcriptome GO terms to the number of corresponding genes assigned to the same GO terms.

    View all citing articles on Scopus
    1

    These authors contributed equally to this work.

    View full text