Enhancing protein function prediction with taxonomic constraints – The Argot2.5 web server
Introduction
The functional annotation of gene products is a crucial step for understanding the biology of living organisms, in all its physiological and pathological aspects. Since 2000, the Gene Ontology Consortium (GOC) has provided a powerful resource to collect the multitude of known functions in a structured vocabulary, the Gene Ontology [1], which is organized as a directed acyclic graph and facilitates the access to functional data through automatic tools.
The development of bioinformatics methods for gene products annotation is an active field of research and an international challenge is held periodically to assess such methods and provide a snapshot of the state of the art [2]. Automated tools predict function using different criteria, but most of them take advantage of sequence similarity based approaches to find matches between the input gene or protein and a database of already characterized entities (either at whole sequence or domain level), in order to transfer functional features [3], [4].
Despite the considerable effort of the scientific community, the main outcome from the first Critical Assessment of Function Annotation (CAFA) is that there is significant room for improvement, because all automatic pipelines that participated in the challenge suffered from lack of both recall and precision [2]. This problem also affects annotations in the Gene Ontology Annotation database (GOA), in particular those generated automatically: even though they are a valuable resource for many proteins and many organisms [5], their error rate is difficult to be quantified and even controlled. The GOC is constantly trying to limit this phenomenon by implementing a number of automatic checks which verify both file formats and partially data coming from annotation submitters. For example, back in 2011 they introduced an “Annotation black list” which specifies protein:GO term combinations that are not allowed as annotations [6]. Furthermore, Deegan et al. [7] proposed a more general approach to prevent incorrect associations between certain functions and specific taxa, providing a list of “taxon constraints” which explicitly defines such incompatibilities. Nevertheless, novel methods are needed to improve annotation quality, either by correcting existing errors or preventing the generation of novel ones. Our group already contributed to this research field through the development of Argot2 [6], a tool for the automated prediction of protein function which placed in the top ten of the best performing algorithms at CAFA and CAFA2. However, the context where the algorithm works has dramatically changed since the time of its initial development, in particular as regards the size of the databanks: UniProt [8], for example, has nearly quadrupled the number of entries, while the amount of annotations in GOA has increased by a factor of six. Such expansion negatively impacts the tool performance, by affecting both the execution time and the management of intermediate steps. In addition, new resources have become available to improve the predictive ability, such as the taxonomic constraints mentioned before.
In this paper, we present Argot2.5 (Annotation Retrieval of Gene Ontology Terms), a tool designed for high throughput annotation of large sequence data sets which improves upon its predecessor [9] thanks to the implementation of multiple novel features: (1) a clustered version of UniProt is used for BLAST searches in order to remove redundancy and speed up the searching time; (2) a novel semantic similarity measure is adopted and tested; (3) an extended set of taxonomic constraints is applied according to the in-house developed tool FunTaxIS (http://www.medcomp.medicina.unipd.it/funtaxis), that expands the list provided by GOC [7].
Section snippets
Materials and methods
Argot2.5 algorithm is based on the Argot2 approach [9]. Briefly, the algorithm starts from DNA/protein sequences and performs a BLAST [10] search against UniProt [8] database and a HMMER3 [11] search against Pfam [12]. The results retrieved from these steps are used to query the GOA databank: the collected GO terms are ranked according to both the significance of the hit they come from (provided by the e-value) and their occurrence in the results. The terms are then further grouped by means of
Yeast benchmark test set
Argot2.5 performance has been assessed over the whole yeast proteome and we have simulated the scenario in which no close species are present in the reference databases, by subtracting all proteins belonging to Fungi (NCBI Taxonomy: 4751) both from UniProt and Pfam. This setting is very challenging for two main reasons: (1) some query proteins do not find any significant alignment in the databases, or only few weak matches with high e-values (see also Sections 3.2 Benchmark test set dissection:
Conclusions
We present Argot2.5, a revisited algorithm and an updated web server which improve upon a previous platform named Argot2; the new functionalities have contributed to enhance both performance and usability. Performance assessment and comparison with the previous version have been carried out on the well characterized S. cerevisiae genome from which fungal proteins had been removed and on the CAFA challenge dataset. A simple BLAST-based algorithm has been used as baseline.
Taken together, the
Acknowledgements
The research is supported by “progetto d’Ateneo” PRAT CPDA138081/13, Grant “assegno senior” and “bando giovani studiosi” GRIC13AAI9 from University of Padova.
References (27)
- et al.
Identification of a set of genes with developmentally down-regulated expression in the mouse brain
Biochem. Biophys. Res. Commun.
(1992) - et al.
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
Nat. Genet.
(2000) - et al.
A large-scale evaluation of computational protein function prediction
Nat. Methods
(2013) - et al.
Network-based prediction of protein function
Mol. Syst. Biol.
(2007) - et al.
A survey of computational intelligence techniques in protein function prediction
Int. J. Proteomics
(2014) - et al.
Quality of computationally inferred gene ontology annotations
PLoS Comput. Biol.
(2012) - et al.
The GOA database: gene ontology annotation updates for 2015
Nucleic Acids Res.
(2014) - et al.
Formalization of taxon-based constraints to detect inconsistencies in annotation and ontology development
BMC Bioinf.
(2010) - UniProt: a hub for protein information, Nucleic Acids Res, 43 (2015)...
- et al.
Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms
BMC Bioinf.
(2012)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Res.
Accelerated Profile HMM Searches
PLoS Comput. Biol.
Pfam: the protein families database
Nucleic Acids Res.
Cited by (40)
Genetic basis underlying the serological affinity of leptospiral serovars from serogroups Sejroe, Mini and Hebdomadis
2022, Infection, Genetics and EvolutionCitation Excerpt :Genomic regions containing genes with higher similarity value among samples of srg Sejroe than of non-Sejroe serogroup were depicted for further analysis. Proteins encoded in depicted regions were functionally annotated using the ARGOT webserver (Lavezzo et al., 2016). To verify the presence and the conservedness of those proteins in the 722 leptospiral samples from the NCBI RefSeq Genome Database, we submitted them to a sequence similarity search using the BLAST algorithm (blastp, Altschul et al., 1990) against the proteome of each leptospiral sample available at the NCBI database.
Genome comparison and transcriptome analysis of the invasive brown root rot pathogen, Phellinus noxius, from different geographic regions reveals potential enzymes associated with degradation of different wood substrates
2020, Fungal BiologyCitation Excerpt :Secondary metabolism genes were detected using antiSMASH 4.0 (Blin et al., 2017). Gene Ontology (GO) information was obtained using the Argot 2.5 web server (Lavezzo et al., 2016); only predictions with a total score ≥200 or with internal confidence ≥0.99 and with total score ≥2.0 were kept. Fisher’s exact tests, with p-values adjusted for multiple comparisons (Benjamini-Hochberg), were used to compare the number of transcripts in the 15 more abundant transcriptome GO terms to the number of corresponding genes assigned to the same GO terms.
- 1
These authors contributed equally to this work.