NetGO 3.0: Protein Language Model Improves Large-scale Functional Annotations

As one of the state-of-the-art automated function prediction (AFP) methods, NetGO 2.0 integrates multi-source information to improve the performance. However, it mainly utilizes the proteins with experimentally supported functional annotations without leveraging valuable information from a vast number of unannotated proteins. Recently, protein language models have been proposed to learn informative representations [e.g., Evolutionary Scale Modeling (ESM)-1b embedding] from protein sequences based on self-supervision. Here, we represented each protein by ESM-1b and used logistic regression (LR) to train a new model, LR-ESM, for AFP. The experimental results showed that LR-ESM achieved comparable performance with the best-performing component of NetGO 2.0. Therefore, by incorporating LR-ESM into NetGO 2.0, we developed NetGO 3.0 to improve the performance of AFP extensively. NetGO 3.0 is freely accessible at https://dmiip.sjtu.edu.cn/ng3.0.


Introduction
Proteins are complex molecules that play essential roles in various biological activities.To understand the underlying mechanism of an organism as a physical system, annotating the functions of proteins is a crucial task.Gene Ontology (GO) came into being in 1998 to describe varying levels of functional information on gene/RNA/protein, which contains three domains: molecular function (MF), biological process (BP), and cellular component (CC) with over 40,000 terms [1].As of November 2022, the number of raw protein sequences is more than 230 million in Universal Protein Knowledgebase (UniProtKB), but less than 0.1% of them have experimental annotations [2].It is thus desirable to develop highperformance computational methods to achieve automated function prediction (AFP) without costly experiments [3].
AFP is a large-scale multi-label classification problem in which multiple related GO terms are assigned to a target protein.In the last few years, several high-performance web servers have been developed for AFP, such as INGA 2.0 [4], DeepGOWeb [5], MetaGO [6], and QAUST [7].Under the learning to rank (LTR) framework [8], GOLabeler [9], NetGO [10], and NetGO 2.0 [11] achieved a state-of-the-art performance in the recent community-wide Critical Assessment of Functional Annotation (CAFA) [3].Specifically, NetGO 2.0 integrates protein information from different sources to encode proteins in a computer-understandable way, such as sequences, protein domains, protein-protein interaction networks, and scientific literature.However, it does not leverage valuable information from unannotated proteins (> 99.9% of all known proteins).
Recently, the idea of pre-training in natural language processing [12] has been extended to build protein language models using self-supervised learning on millions of sequences [13][14][15].Most protein language models predict the masked or next amino acid within a sequence and generate protein embeddings that can generalize across downstream tasks (more details shown in File S1).Some recent studies have explored protein language models for AFP [16,17].However, they have a common limitation: less frequent GO terms (e.g., having less than 40 annotated proteins) are excluded in the evaluation, which accounts for around 75% of total annotations in the CAFA setting.In this work, we predicted the associations between proteins and each GO term based on Evolutionary Scale Modeling (ESM)-1b embeddings, which were trained on over 250 million protein sequences [13].Our experimental results showed that the learned representations were helpful to AFP.Therefore, we developed NetGO 3.0 by incorporating ESM-1b embeddings in order to improve the performance extensively, which highlights the predictive power of the protein language model for AFP.

Method Protein language models
A challenging problem is figuring out how to represent protein sequences as fixed-length vectors that capture the realistic sequence-function relationship.Traditional methods rely on a holistic understanding of protein properties.Recently, protein language models have provided a solution that interprets protein sequences as sentences and amino acids as words to extract fundamental features of a protein with rich and systematic information.Protein language models train nonlinear neural networks with an unsupervised objective on a large-scale dataset of protein sequences [13][14][15]21].
Generally, protein language models apply deep learning models such as recurrent neural networks (RNN) and Transformer to achieve statistical embeddings of proteins from tremendous sequences.UniRep represented protein sequences as fixed-length vectors by long short-term memory (LSTM) with 24 million sequences [15].Task Assessing Protein Embeddings (TAPE) distilled protein properties from sequences by semi-supervised learning based on ResNet, LSTM, and Transformer, and then evaluated their performance on five biologically relevant tasks [21].Moreover, a multi-task learning framework was recently proposed to incorporate structural information (e.g., contact maps and structural similarity prediction tasks) to enrich protein language models [22].Furthermore, researchers applied protein language models to study protein molecular function prediction [17].UDSMProt put forward a task-agnostic representation for proteins and achieved good performance on protein-level prediction tasks, namely enzyme class prediction and GO prediction [16].However, both methods should have considered less frequent GO terms.
In this study, a new component logistic regression (LR)-ESM in NetGO 3.0 was proposed to utilize ESM-1b, a 34layer Transformer-based model trained on Universal Protein Archive (UniParc) database with 250 million protein sequences and 650 million parameters, to generate protein-level representations by average pooling across all residue-level embeddings [13].

Implementation
NetGO 2.0 integrates seven component methods, which are Naı¨ve, BLAST-KNN, LR-3mer, LR-InterPro, Net-KNN, LR-Text, and Seq-RNN.We replaced Seq-RNN with LR-ESM in NetGO 3.0, which makes function prediction based on a protein language model.Specifically, LR-ESM utilized ESM-1b, a 34-layer Transformer-based model trained on the UniParc database with 250 million sequences [13], to generate protein embeddings and complete prediction.As ESM-1b has a limitation of protein sequence length, we kept the first 1000 amino acids for those protein sequences longer than 1024.We then used ESM-1b to encode each amino acid as an embedding of size 1280 for a target protein.To obtain the protein-level embedding, we applied the operation of average pooling on all amino acid positions, which comprehensively collects information from sequence data alone.Finally, LR-ESM utilized protein embeddings as input to train LR classifiers and estimated the association between target proteins and each GO term.

Benchmark datasets
As NetGO 2.0 collected the data following the setting of CAFA, we utilized the same benchmark dataset to evaluate the performance of NetGO 3.0 and the competing methods.Table S1 reports the number of proteins in the benchmark dataset.
To take advantage of the latest annotation data, we collected sequences and GO terms before January 2022 from Universal Protein (UniProt) [2], Gene Ontology Annotation (GOA) [23], and GO [1].Similarly, we trained and updated our model on the new dataset by following the standard protocols of NetGO 2.0 [11].Training data are all experimental annotation data before January 2020.Validation data are all experimental no-knowledge and limited-knowledge proteins annotated from January 2020 to December 2020.Testing data are all experimental no-knowledge proteins between January 2021 and December 2021.More details for the new dataset and the definition of no-knowledge and limited-knowledge proteins are listed in File S1 and Table S2.

Results
We compared the performance of NetGO 3.0 with the competing methods on the benchmark dataset from NetGO 2.0.The performance was evaluated by area under the precision-recall curve (AUPRC) and two standard metrics in CAFA, the maximum F1-score (F max ) and the minimum semantic distance (S min ).The definitions of these three metrics are given in Section S1 of File S1.

Performance comparison of NetGO 3.0 with its component methods and competing methods
Table 1 illustrates the test results for NetGO 3.0, NetGO 2.0, GOLabeler, DeepGOWeb, and the component methods of NetGO 3.0.Previous studies have shown that GOLabeler and NetGO 2.0 achieved top performance in CAFA3 and CAFA4, respectively [9,11], and DeepGOWeb provided an accurate prediction for protein function by deep learning [5].
We selected Naı¨ve, BLAST-KNN, and Seq-RNN [11] from NetGO 2.0 as three baseline methods.The Naı¨ve method annotates each pair of protein and GO term with a score that equals the probability of the term appearing in the training data.BLAST-KNN assigns a protein with GO terms based on annotations of its top BLAST hits [9].Although the first two are component methods inherited from both NetGO and NetGO2.0,Seq-RNN is a new component of NetGO2.0, which is designed to extract the deep representation of a protein sequence [11].As shown in Figure 1 and Table 1, LR-ESM outperformed baseline methods on all three GO domains.As a replacement for Seq-RNN, LR-ESM achieved a better performance.Specifically, in terms of F max , LR-ESM achieved 21.6%, 31.3%, and 7.5% improvements over Seq-RNN on MF, BP, and CC, respectively, which indicates the effectiveness of ESM-1b for AFP.Moreover, LR-ESM and LR-InterPro showed comparable performance in all three GO domains (Table 1).Note that, in terms of S min , LR-ESM outperformed all other component methods and even achieved a better performance on MF than NetGO 2.0.Therefore, it is reasonable to construct a more robust model by incorporating LR-ESM into NetGO 2.0.
Furthermore, we compared NetGO 3.0 with GOLabeler, DeepGOWeb, and NetGO 2.0, three high-performance methods in CAFA.As shown in Table 1, NetGO 3.0 achieved a more superior performance than the competing methods.In terms of F max and S min , NetGO 3.0 achieved a better performance in all three GO domains.For example, NetGO 3.0 achieved the highest F max of 0.378 in BP, which is 16.0%, 23.9%, and 3.3% improvements over GOLabeler (0.326), DeepGOWeb (0.305), and NetGO 2.0 (0.366), respectively.The results demonstrate that NetGO 3.0 can benefit from protein language models with deep dense embeddings.
To better illustrate the strength of NetGO 3.0, we drew Venn diagrams in Figure 2 to show the overlaps and differences among the prediction results of NetGO 3.0, GOLabeler, and DeepGOWeb.There are three main findings.(1) Although each method can predict distinct GO terms, the prediction results of the three methods overlapped substantially, especially in CC.Specifically, there were 6.96 GO terms assigned to one protein on average that were predicted by all three methods in CC, which accounted for 62.5%, 70.1%, and 77.3% in the prediction results of DeepGOWeb (11.14),GOLabeler (9.84), and NetGO 3.0 (9.00), respectively.(2) DeepGOWeb predicted more GO terms but achieved lower performance than the other two methods, indicating that false-positive GO terms are common in the prediction results.For example, DeeGOWeb predicted 21.34 distinct GO terms   BP in terms of F max , where NetGO 3.0 (0.378) achieved a 16.0% improvement compared with GOLabeler (0.326).It demonstrates that NetGO 3.0 is more accurate and can predict more true-positive terms for query proteins.

Performance on specific species (humans and mice)
Species-specific analyses are helpful for researchers to study a certain species.Here, we explored the performance of different AFP methods over two model species, humans and mice.Table 2 and Table 3 showed the performance of NetGO 3.0 and NetGO 2.0, as well as the components of both methods for protein function prediction in humans and mice.We observed that all methods obtained a better prediction performance on human proteins than on mouse proteins.For example, LR-InterPro, LR-ESM, and NetGO 3.0 achieved higher AUPRC values of 0.704, 0.690, and 0.730 on human proteins in MF, whereas the three methods only achieved AUPRC values of 0.609, 0.615, and 0.620 on mouse proteins.The annotation information for different species is from different databases, which may lead to the difference.Moreover, LR-ESM again achieved a similar performance as LR-InterPro in both species, which strongly demonstrates that features extracted by ESM-1b are as robust as InterProScan among many species.
For human and mouse proteins, NetGO 3.0 outperformed NetGO 2.0 in all three GO domains.Specifically, NetGO 3.0 performed better than NetGO 2.0 in human BP prediction, which achieveed 9.3% and 9.5% improvements in terms of F max and AUPRC, respectively.Further, the results highlight the importance of source data and the effectiveness of the protein language model.

Performance comparison over groups categorized by the number of annotations per GO term
We divided GO terms in the test dataset into three groups according to the number of annotations per GO term: 10-30, 31-100, and > 100.Table 4 showed the M-AUPRC computed in each group, where M-AUPRC is GO term-centric by averaging AUPRC on each GO term.LR-ESM outperformed other component methods in most cases, which indicates that ESM-1b embeddings are informative.Note that LR-ESM con-sistently ranked higher than LR-InterPro for three domains in the first group, especially for BP, which obtained a 47.8% improvement.It proves that protein embeddings are effective with such a vast amount of training data for AFP.
NetGO 3.0 achieved the best results among all the methods in every group and domain except in the first group in BP, and the improvement over NetGO 2.0 was especially significant in CC.Specifically, the advances made by NetGO 3.0 were 28.7%, 8.4%, and 8.2% for the three groups, respectively.Moreover, we collected the CC terms in the second and third layers annotated with more than ten proteins in the test set.As shown in Figure S1, NetGO 3.0 achieved a better performance on most GO terms, which strongly suggests that ESM-1b is powerful for predicting protein functions about CC.

Performance comparison on difficult proteins
Following the CAFA setting, proteins with a BLAST identity of less than 0.6 to any protein in training data are identified as ''difficult proteins" [3].In the test set, there are 66, 85, and 70 difficult proteins in MF, BP, and CC, respectively.It is evident that methods based on homology find it hard to predict the function of difficult proteins accurately.Table 5 showed the performance of different methods in dealing with difficult proteins.As mentioned above, BLAST-KNN, a method that annotates target proteins by homology proteins, ranked last in 9 experimental settings.We found that LR-InterPro and LR-ESM were the two best-performing component methods in this scenario.For example, in terms of S min , there is a slight difference between the two methods in three domains.LR-ESM and LR-InterPro achieved the best performance for all component methods in 6 and 3 out of 9 settings.Once again, NetGO 3.0 was proved to be the best method for predicting the function of difficult proteins.

Performance comparison on proteins with sequence length longer than 1000 amino acids
We performed a truncation operation for proteins longer than 1000 amino acids so that ESM-1b could generate representations for all proteins in the dataset.Focusing on the performance of each method on these long proteins helps us better     understand the advantages and limitations of NetGO 3.0.There exist 21, 78, and 26 test proteins in MF, BP, and CC, respectively.Table 6 showed the prediction results of component methods, NetGO 2.0, and NetGO 3.0.We found that LR-ESM was no longer one of the best-performing component methods, which indirectly led to the worse performance of NetGO 3.0 than NetGO 2.0 in MF and BP.By comparing the performance of each method on the entire test set in Table 1, we noticed that the performance decreased for all methods except Net-KNN.This suggests that function prediction for long proteins is a challenge.Moreover, we compared the prediction performance of NetGO 2.0 and NetGO 3.0 on several unannotated proteins Q3UZV7, F1QKQ1, and Q2HX28.The sequence lengths of these three proteins are 1028, 1356, and 1409, respectively.As shown in Table S3, NetGO 2.0 achieved better AUPRC on three proteins, which indicates that the truncated sequences in long proteins are important sources of information and are critical for predicting functions.This further confirms that NetGO 3.0 needs to be improved in handling long sequences, which will be important future research work.

Visualization of the predicted results
We presented more options to visualize the predicted GO terms to better illustrate prediction results.Compared with NetGO 2.0, the new web server offers a novel perspective to present the results, which can provide more relevant information about predicted GO terms.Figure 3 showed the new result page of NetGO 3.0, which mainly includes three ways to visualize the prediction performance.Although GO terms in top layers usually achieve a higher score and rank higher, NetGO 3.0 clarifies the depth of predicted GO terms, which allows users to find specific GO terms in bottom layers.Note that the color in the result page and node size in Figure 3D are determined by the predicted confidence score, which can help users better understand the predicted results in an original view.

Case study
Finally, we selected a specific protein as input and showed the results obtained by NetGO 3.0 and its competing methods.Ubiquitin-like protein 5 (UniProt ID: Q9FGZ9) is a difficult protein with low BLAST similarity to training proteins.Table S4 showed the 18 GO terms in BP annotated to protein Q9FGZ9.Figure 4 also depicted the directed acyclic graph (DAG) according to the relationship of 18 GO terms in GO.As shown in Table S4, BLAST-KNN failed to achieve a valid result because homology-based methods were not suitable for difficult protein function prediction.LR-InterPro and LR-ESM extracted features from raw amino acid sequences and obtained better results than BLAST-KNN.In the top 20 predicted GO terms, the number of true-positive samples achieved by LR-ESM was significantly larger than other methods, which predicted 14 correct function labels.NetGO and NetGO 2.0 predicted only six correct GO terms, which were not competitive compared to LR-ESM and NetGO 3.0.The reason for  2.0 failed to predict. Figure 4 illustrated the hierarchy of correctly predicted GO terms, indicating that NetGO 3.0 is able to predict those GO terms with less information in the deeper layers.Overall, this typical example demonstrates that the high predictive performance of NetGO 3.0 is closely related to the protein language models.

Conclusion
We have developed NetGO 3.0 to improve the performance of large-scale AFP by incorporating a new component LR-ESM, which utilizes a protein language model to generate powerful representations of proteins.Interesting future work would be integrating protein structural information into NetGO 3.0 to enhance the performance of AFP [18][19][20].

Figure 1
Figure 1 Performance comparison on F max and S min The performance of Naı¨ve, BLAST-KNN, Seq-RNN, LR-ESM, NetGO 2.0, and NetGO 3.0 on the benchmark dataset of NetGO 2.0 over three GO domains is shown.Higher values for F max and lower values for S min indicate better performance over three GO domains.The error lines denote the confidence intervals (95%) calculated by bootstrapping with 100 iterations on the test set.MF, molecular function; BP, biological process; CC, cellular component; GO, Gene Ontology; LR, logistic regression; KNN, k-nearest neighbors; BLAST, Basic Local Alignment Search Tool; RNN, recurrent neural networks; ESM, Evolutionary Scale Modeling.

Figure 2
Figure 2 The overlap and difference among the GO terms predicted by GOLabeler, DeepGOWeb, and NetGO 3.0 The Venn diagrams depict the overlap and difference among the GO terms predicted by GOLabeler, DeepGOWeb, and NetGO 3.0 in MF (A), BP (B), and CC (C), respectively.Numbers in the graph represent the average number of predicted GO terms over test proteins in three methods.

Figure 3
Figure 3 Visualization of prediction results on the web server A. Prediction result page of NetGO 3.0 website.''GO DAG", ''Bar plot", and ''Bubble plot" are the new interfaces to visualize the predicted GO terms.We also added a new column named ''Depth" to show the depth of GO terms in GO analysis.B. The predicted GO terms and their DAGs.C. Bar plot showing the predicted GO terms and their confidence scores.D. Bubble plot showing the predicted GO terms and their depth in GO analysis.DAG, directed acyclic graph.

Figure 4
Figure 4 DAG of GO terms associated with Q9FGZ9 in BP Each GO term is attached with tags, which illustrates that the GO term is predicted correctly by corresponding methods.

Table 1
Performance comparison of NetGO 3.0 with its components and competing methods on the test set Note: Naı¨ve, BLAST-KNN, LR-3mer, LR-InterPro, Net-KNN, Seq-RNN, and LR-Text are component methods from NetGO 2.0.LR-ESM is a new component method which replaces Seq-RNN in NetGO 3.0.The underlined numbers imply the best performance for component methods.The bold numbers mean the best performance among competing methods.F max , the maximum F1-score; AUPRC, area under precision-recall curve; S min , the minimum semantic distance; MF, molecular function; BP, biological process; CC, cellular component; LR, logistic regression; KNN, knearest neighbors; BLAST, Basic Local Alignment Search Tool; RNN, recurrent neural networks; ESM, Evolutionary Scale Modeling; GO, Gene Ontology.

Table 2
Performance comparison of NetGO 3.0 and NetGO 2.0 as well as their component methods for protein function prediction in humans Note: BLAST-KNN, LR-InterPro, Net-KNN, and Seq-RNN are component methods from NetGO 2.0.LR-ESM is a new component method which replaces Seq-RNN in NetGO 3.0.The underlined numbers imply the best performance for component methods.The bold numbers mean the best performance among competing methods.

Table 3
Performance comparison of NetGO 3.0 and NetGO 2.0 as well as their component methods for protein function prediction in mice BLAST-KNN, LR-InterPro, Net-KNN, and Seq-RNN are component methods from NetGO 2.0.LR-ESM is a new component method which replaces Seq-RNN in NetGO 3.0.The underlined numbers imply the best performance for component methods.The bold numbers mean the best performance among competing methods.

Table 4
Performance comparison over groups categorized by the number of annotations per GO term BLAST-KNN, LR-InterPro, Net-KNN, and Seq-RNN are component methods from NetGO 2.0.LR-ESM is a new component method which replaces Seq-RNN in NetGO 3.0.The underlined numbers imply the best performance for component methods.The bold numbers mean the best performance among competing methods.

Table 5
Performance on difficult proteins BLAST-KNN, LR-InterPro, Net-KNN, and Seq-RNN are component methods from NetGO 2.0.LR-ESM is a new component method which replaces Seq-RNN in NetGO 3.0.The underlined numbers imply the best performance for component methods.The bold numbers mean the best performance among competing methods.

Table 6
Performance comparison on proteins with sequence length longer than 1000 amino acids BLAST-KNN, LR-InterPro, Net-KNN, and Seq-RNN are component methods from NetGO 2.0.LR-ESM is a new component method which replaces Seq-RNN in NetGO 3.0.The underlined numbers imply the best performance for component methods.The bold numbers mean the best performance among competing methods.