GO Bench: shared hub for universal benchmarking of machine learning-based protein functional annotations

Abstract Motivation Gene annotation is the problem of mapping proteins to their functions represented as Gene Ontology (GO) terms, typically inferred based on the primary sequences. Gene annotation is a multi-label multi-class classification problem, which has generated growing interest for its uses in the characterization of millions of proteins with unknown functions. However, there is no standard GO dataset used for benchmarking the newly developed new machine learning models within the bioinformatics community. Thus, the significance of improvements for these models remains unclear. Results The Gene Benchmarking database is the first effort to provide an easy-to-use and configurable hub for the learning and evaluation of gene annotation models. It provides easy access to pre-specified datasets and takes the non-trivial steps of preprocessing and filtering all data according to custom presets using a web interface. The GO bench web application can also be used to evaluate and display any trained model on leaderboards for annotation tasks. Availability and implementation The GO Benchmarking dataset is freely available at www.gobench.org. Code is hosted at github.com/mofradlab, with repositories for website code, core utilities and examples of usage (Supplementary Section S.7). Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
The Gene Ontology (GO) is a curated network of over 40 000 terms capturing possible functions of a protein (Ashburner et al., 2000). Matching a protein with one or more GO terms is a useful method for characterization, but out of hundreds of millions of sequenced proteins, only a small fraction are functionally annotated. Thus, a goal for the bioinformatics community is predicting annotations from sequence, with potential to aid researchers in understanding organism proteomes (Ghatak et al., 2019) or identifying the significance of mutations (Ding et al., 2004). Dozens of machine learning methods have emerged for GO annotation (Zhou et al., 2019), but there are several barriers to evaluating new models, starting with the technical difficulty of generating a strong dataset with relevant metrics. Creating a complete dataset often requires downloading and parsing the 90 GB Gene Ontology Annotation (GOA) database and developing custom scripts. Additionally, it is difficult to immediately compare results from models built on different datasets. The current platform for fair comparison of GO annotation models is the CAFA competition (Zhou et al., 2019), but because this relies on new experimental data for evaluation, it is only possible on a semi-annual basis and may have varying difficulty. Typically, current teams simply reproduce their efforts for each model they wish to compare against, or replicate the CAFA dataset. In both cases, this is a costly and inefficient process. Our GO Bench website offers several simple and standardized datasets for the GO annotation problem and allows users to customize their datasets through a simple interface while still allowing for comparisons between results. It also allows researchers to upload model predictions, in the same style as for the CAFA competition, and view benchmarking statistics for all submitted models. GO Bench's primary purpose is to generate customizable but compatible datasets of GO annotations, through a user-customized pipeline (Fig. 1). In addition, GO Bench allows users to upload model predictions in order to compare their results on a public leaderboard. For reproducibility or for customized projects, we release our codebase as an open-source python package, along with several code repositories showing sample usage (Supplementary Section S.7). Data collection: 566 996 high-quality protein sequences are obtained from SwissProt (The UniProt Consortium, 2020), while the GOA dataset is cross-referenced for roughly 1.3 million annotations (Huntley et al., 2015). Annotations are associated with evidence codes representing their quality (Chibucos et al., 2017) and filtered for experimental evidence only by default. However, there is the option to include lower-quality annotations, such as those that are predicted, but human reviewed, because they may improve model performance in practice. Train-test: For model evaluation following training, we pre-generate two data divisions, with a random split assigning proteins to train or test randomly, and a clus-ter50 split grouping proteins with high (50%) similarity. The latter is used by default since it prevents highly similar proteins from appearing in training and testing and artificially inflating results. For compatibility, proteins used in 'Benchmarking GO function predictions using negative annotations' are filtered from testing data (Warwick Vesztrocy and Dessimoz, 2020). Annotation propagation: The GO not only contains functional terms but also an inheritance structure in which 'child' terms qualify as instances of their 'parents' (Ashburner et al., 2000). Only the most specific term is recorded when a protein annotation is discovered, thus, annotations are optionally propagated up through the GO structure before use. Term filtering: The GO contains over 40 000 terms, but thousands have few or even zero associated proteins. GO Bench typically filters out terms which are too infrequent for a model to process based on a user-determined cutoff. Other factors, such as quality of evidence for an annotation or date of creation, are also available for filtering. Leaderboard: Users may submit model predictions to a leaderboard containing several different testing datasets and metrics. In particular, our leaderboard evaluates models on the S min and F max metrics used by the popular CAFA competition (Clark and Radivojac, 2013;Zhou et al., 2019), as well as on the F score metric known to work well for datasets with many positive, but unlabeled datapoints (Bekker and Davis, 2020) (Supplementary Section S-5.3). Model performance is graphically displayed for each leaderboard entry.

Results and conclusion
We include baseline models including the CAFA version of BLAST, a convolutional model based on DeepGOPlus (Kulmanov and Hoehndorf, 2020), and a feedforward network with 3-mer distributions of sequences as an input, similar to the original ProtVec paper (Asgari and Mofrad, 2015). The naive model is based only on GO annotation frequencies. In Table 1, we show our results for the Molecular Function category with a cluster50 train/test split, and only experimental evidence codes in testing data. Notably, our baseline 3-mer model performs best when trained on all but completely computational annotations. To our knowledge, the GO Bench project provides the first directly usable datasets for training and benchmarking GO annotation models. They allow for flexible trade-offs but maintain a backbone of the same testing data split to guarantees compatibility between results. Based on these, we offer competitive baselines and find that typically untapped data sources can substantially improve performance. We expect GO Bench to be widely used by the bioinformatics community because it unifies and simplifies efforts to evaluate new models through a standard benchmark presented in the GO Bench leaderboard.

Funding
This work was supported by the National Science Foundation [17284077].
Conflict of Interest: none declared.

Data availability
The data underlying this article are available at www.gobench.org. The datasets were derived from sources in the public domain: Uniprot (www.uniprot.org/) and GOA (https://www.ebi.ac.uk/ GOA/index). Note: Models were evaluated in the molecular function domain on the F max , S min and F score metrics (detailed in Supplementary Section S.1.1), and train/test split is by cluster (Supplementary Section S.2). The 3-mer model was trained on experimental annotations only, any human-reviewed annotations (non-IEA), and finally all annotations. The highest scores are bolded for each metric.