ECRECer: Enzyme Commission Number Recommendation and Benchmarking based on Multiagent Dual-core Learning

Enzyme Commission (EC) numbers, which associate a protein sequence with the biochemical reactions it catalyzes, are essential for the accurate understanding of enzyme functions and cellular metabolism. Many ab-initio computational approaches were proposed to predict EC numbers for given input sequences directly. However, the prediction performance (accuracy, recall, precision), usability, and efficiency of existing methods still have much room to be improved. Here, we report ECRECer, a cloud platform for accurately predicting EC numbers based on novel deep learning techniques. To build ECRECer, we evaluate different protein representation methods and adopt a protein language model for protein sequence embedding. After embedding, we propose a multi-agent hierarchy deep learning-based framework to learn the proposed tasks in a multi-task manner. Specifically, we used an extreme multi-label classifier to perform the EC prediction and employed a greedy strategy to integrate and fine-tune the final model. Comparative analyses against four representative methods demonstrate that ECRECer delivers the highest performance, which improves accuracy and F1 score by 70% and 20% over the state-of-the-the-art, respectively. With ECRECer, we can annotate numerous enzymes in the Swiss-Prot database with incomplete EC numbers to their full fourth level. Take UniPort protein"A0A0U5GJ41"as an example (1.14.-.-), ECRECer annotated it with"1.14.11.38", which supported by further protein structure analysis based on AlphaFold2. Finally, we established a webserver (https://ecrecer.biodesign.ac.cn) and provided an offline bundle to improve usability.


Introduction
With the widespread adoption of high-throughput methods and high-quality infrastructure in biotechnology and bioindustry, the speed of new protein discovery has increased dramatically. However, this was not followed by a concomitant increase in the speed of protein annotation (see Supplemental, SI Appendix FIGURES, Fig.  S2). For example, 5 241 146 sequences were added to TrEMBL in the UniProt database [1] in the single month of March 2021, while only 521 sequences were reviewed and added to Swiss-Prot in the same period (see Supplemental, SI Appendix FIGURES, Fig. S2). Such a slow speed of protein annotation considerably restricts related research and industrial applications.

Zhenkun Shi et al., Enzyme Commission Number Recommendation and Benchmarking
Among the multiple and complex protein annotation tasks, one of the crucial steps is enzyme function annotation [2,3]. Annotations of enzyme function provide critical starting points for generating and testing biological hypotheses [3]. Current functional annotations of enzymes describe the biochemistry or process by assigning an Enzyme Commission (EC) number. This is a four-part code associated with a recommended name for the corresponding enzyme-catalyzed reaction that describes the enzyme class, the chemical bond acted on, the reaction, and the substrates [4]. Thus, the primary task of enzyme annotation is to assign an EC number to a given protein sequence. However, as the uncertainty of the assignments for uncharacterized protein sequences is high and biochemical data are relatively sparse, both the speed and the quality of enzyme annotation are considerably restricted.
To achieve improved, rapid and intelligent functional annotation, computational methods were introduced to assign or predict EC numbers. The simplest and most commonly used method is multiple sequence alignment (MSA) [5], which can yield an appropriate annotation by using similar sequences. Based on this approach, researchers have developed most major EC databases and profile-based methods for the functional annotation of enzymes [6,7,8,9]. However, these methods cannot perform annotations for novel proteins with no similar sequences, which is generally the case for newly discovered enzymes. To overcome this restriction, researchers introduced machine learning methods, such as SVM [10], KNN [11], and hidden Markov model [12] for the functional annotation of enzymes. Although these methods can predict EC numbers even if the given protein sequences have no similar references, the prediction speed and precision are not ideal. Since deep learning has delivered powerful results in many areas [13,14,15,16], more researchers are trying to use deep learning methods to predict EC numbers and significantly improve the precision of functional annotation. However, deep learning methods are prone to overfitting due to an unbalanced distribution of training datasets. In EC number prediction, this leads to prediction results with high precision, medium recall, and low accuracy.
Overall, there has been a steady improvement in computational methods for enzyme annotation [9,7,17,2], but several obstacles still exist that have slowed the progress of computational enzyme function annotation. One of the direct challenges is a lack of publicly available benchmark datasets to evaluate the existing and newly proposed models, which makes it troublesome for the end-user to choose the best method in their production scenario. Another notable challenge is the lack of an efficient and universal protein sequence embedding method. Thus, researchers have to spend large amounts of time on handcrafted feature engineering to encode the sequence, such as functional domain encoding [18] and position-specific scoring matrix encoding [19], as encoding quality dramatically impacts the performance of downstream applications [20]. The third challenge is the lack of an explicitly designed method to deal with this extreme multi-label classification problem (more than 5000 EC numbers in UniProt). Thus, obtaining reliable EC number prediction results is not straightforward, and the prediction performance is not ideal. The fourth noteworthy challenge is the usability of existing tools that need refinement so that the end-user can use them smoothly even with no coding experience.
In this paper, we take a unified approach to address these challenges. For the first challenge, we constructed three standard datasets for benchmarking and evaluation. The datasets contain more than 470,000 distinct labeled protein sequences from Swiss-Prot. To address the second challenge, we introduced the cutting-edge ideology from natural language embedding for protein sequence representation. Firstly, state-of-the-art deep learning methods were evaluated and adopted for universal protein sequence embedding [21,22]. Then, we used a feedback mechanism to choose the most suitable method in response to the downstream tasks for optimization. To address the third challenge, we proposed a Dual-core Multiagent Learning Framework (DMLF) for EC number prediction. In DMLF, we formulate the EC number prediction as a three-step hierarchical extreme multi-label classification problem. The first step predicts whether a given protein sequence is an enzyme or not. The second predicts how many functions the enzyme can perform, i.e., multifunctional enzyme prediction. The last step predicts the exact EC number for each enzyme function. We use traditional machine learning methods in the first two steps and a novel deep learning-based extreme multi-label classifier in the last step, then use a greedy strategy to integrate these steps to maximize the EC prediction performance. To address the last challenge, we streamlined the construction process and open-sourced our codes. Moreover, we published a webserver, so that anyone can annotate EC numbers smoothly in high-throughput, whether they have coding experience or not.

Methodology
This section consists of five subsections. We first formulate the enzyme function annotation problem in the first subsection, and then describe the benchmark data construction process in the second subsection. The third subsection describes our proposed DMLF framework for the benchmark tasks. The fourth subsection describes the baselines. In the last subsection, we describe the evaluation metrics.

Problem Formulation
In order to annotate the enzyme function of a new protein sequence, the initial and basic task is to define whether a given protein is an enzyme. Since there are numerous multifunctional enzymes, the next task to consider is to determine whether the enzyme is monofunctional or multifunctional. If it is multifunctional, the number of functions needs to be classified. After completing the above two tasks, it is necessary is to assign an EC number to each function. Based on these considerations, we proposed three basic tasks for the functional annotation of enzymes, as shown below.

Enzyme or Non-enzyme Annotation.
The enzyme or non-enzyme annotation task is formulated as a binary classification problem: where X = {x 1 , x 2 , ⋯, x n }, n ≥ 1 represents a group of protein sequences, and {0, 1} is the label indicting whether a given protein is an enzyme .

Multifunctional Enzyme Annotation.
Multifunctional enzyme annotation is formulated as a multi-classification problem: where k represents the maximum number of EC number for a given protein.

Enzyme Commission Number Assignment.
The enzyme commission number assignment task is also formulated as a multi-classification problem as defined in Eq. 3.

Dataset Description
To address the first challenge, we constructed three standard datasets (Supplemental, SI Appendix Materials and Methods, A. Dataset). Similar to previous work [21,26], these datasets are extracted from the Swiss-Prot database. To simulate real application scenarios as closely as possible, we did not shuffle data randomly. Instead, after data preprocessing (See Supplemental, SI Appendix Materials and Methods A. Preprocessing), we organized data in chronological order. Specifically, we used a snapshot from Feb 2018 as the training dataset. The training data contains 469,134 distinct sequences in a total 556,825 records, among which 53.56% are non-enzymes, while the remaining 47.44% are enzymes. The testing data was extracted from the June 2020 snapshot and sequences that appeared in the training set were excluded. The details are listed in Table 1.

∎ Dataset 1: Enzyme and Non-enzyme Dataset
As listed in Table 2, the training set in total has 469,134 records, 222,567 of which are enzymes, and 246,567 are non-enzymes. The testing set contains 7101 records, 3304 of which are enzymes, and the other 3797 are non-enzymes. To make the data more inclusive, we did not filter any sequence in terms of length and homology, which is different from previous studies. An enzyme is labeled as 1 and non-enzyme is labelled as 0. More details about the dataset can be found in the Supplemental, SI Appendix Materials and Methods, A. Dataset.

∎ Dataset 2: Multifunctional Enzyme Dataset
The multifunctional enzyme dataset only contains enzyme data (225,871 records). The number of EC categories ranges from 1 to 8. The details of the dataset are listed in Table 3.

Proposed Framework
To address the second and third challenges: lack of a generic method with high EC prediction performance and an efficient universal protein sequence embedding method, we proposed the DMLF approach, composed of an embedding core and a learning core. These two cores operate relatively independently. The embedding core is responsible for embedding protein sequences into a machine-readable matrix. The learning core is responsible for solving specific downstream biological tasks (e.g., enzyme and non-enzyme prediction, multifunctional enzyme prediction, and EC number prediction). The overall scheme of DMLF is illustrated in Fig. 2. ∎ Core 1: Embedding The objective of this core is to calculate the embedding representations for protein sequences. For protein sequence encoding/embedding, recent studies have shown the superior performance of deep learning-based methods compared to traditional methods [23,24]. Accordingly, we only compared one-hot encoding to show the difference between these two kinds of embedding in this study. Here, we adopted three different embedding methods to calculate the sequence embedding patterns that adequately represent protein sequences. The first one is commonly the used one-hot encoding [25]. The second is Unirep [21], an mLSTM "babbler" deep representation learner for proteins. We used the last layer for protein representation. The third is the evolutionary scale modeling embedding method (ESM) [22], a pretrained transformer language model for protein representation. We used the hidden states from the 1st, 32nd, 33rd layers as protein embeddings.

∎ Core 2: Learning
The learning core is specialized to perform specific biological tasks using different agents. In this work, the learning core includes three agents. Agent-1 is a binary classifier that performs enzyme or non-enzyme prediction. This classifier was constructed using KNN [26]. Agent-2 is a multi-classifier that predicts the number of putative functions for a given enzyme. It was implemented using an integrated sequence aligner, a gradient boost decision tree, and XGBoost. Agent-3 is also a multi-classifier that performs the EC number prediction task. As EC number prediction is an extreme multilabel classification (5852 classes in this benchmark), the performance of traditional multilabel classification methods such XGBoost, decision tree, and SVM is abysmal (less than 5% in terms of accuracy). Therefore, we trained a scalable linear extreme classifier (SLICE)[27] to obtain a more reliable classification performance in this study. The details of agent implementation and parameter settings can be found in Supplemental, SI Appendix Materials and Methods C. Models. ∎ Integration, fine-tuning and output As illustrated in Fig. 2, the final EC number prediction output is an integrated process. As shown in Eq. 4, we formulated this integrated process as an optimization problem: where ag 1 , ag 2 , and ag 3 are the respective prediction results from Agent-1, Agent-2, and Agent-3, while sa is the predicted result from multiple sequence alignment. The integration and fine-tuning process aims to maximize the optimizing objective. In this work, the objective is the performance of EC number prediction in terms of the F1 score. We used a greedy strategy to perform this optimization.

Compared Baselines
To evaluate our proposed method comprehensively, we compared our proposed method with four existing stateof-the-art techniques with 'GOOD' usability (see Supplemental, SI RELATED WORK). Four state-of-the-art techniques are: CatFam, PRIAM (version 2), ECPred, and DeepEC.

Evaluation Metrics
To comprehensively evaluate the proposed method and existing baselines, we use 5 metrics to evaluate binary classification problems and 4 metrics to evaluate multiple classification problems. For the binary classification Figure 2: DMLF is an explicitly designed dual-core driven framework for EC number prediction. It consists of 2 independent operation units -an embedding core and a learning core. The embedding core is tasked with converting protein sequences into features. The learning core is designed to address the specific biological tasks defined in the problem formulation section. We use different agents to solve different tasks. Agent 1 was designed to solve the enzyme or non-enzyme classification task, agent 2 was designed to solve the multifunctional enzyme prediction task, and agent 3 was designed to solve the EC number assignment task.
task, the evaluation criteria include ACC(accuracy), PPV (positive predictive value, precision), NPV(negative predictive value), RC (recall), and F1 value: where T P is the true positive value that represents the number of samples correctly identified as positive, F P is the false positive value that represents the number of samples wrongly identified as positive, T N is the true negative value that represents the number of samples correctly identified as negative, F N is false negative value that represents the number of samples wrongly identified as negative, U P is unclassified positive samples, and U N is unclassified negative samples.
For multiple classification problems, the evaluation criteria included mACC (macro-average accuracy), mPR(macro-average precision), mRecall(macro-average recall), and mF1(macro-average F1 value): where N represents the total number of classes, while ACC i , P P V i , and Recall i represent the accuracy, precision, and recall of the i-th class in a one-VS-all mode[28], respectively.

Embedding Core Performance Evaluation
We evaluated five different protein embedding methods, one-hot embedding, Unirep embedding, and ESM embedding with three different layers (0,32,33) in our three proposed tasks. We used six machine learning baselines, including K-nearest neighbor (KNN), logistic regression (LR), XGBoost, decision tree (DT), random forest (RF), and gradient boosting decision tree (GBDT) to conduct this evaluation. For embedding, ESM-32 exhibited the best overall performance among all six baselines regarding all evaluation metrics for embedding (see Supplemental Tables S12 and S13). As shown in Fig. 3, in task 1, ESM-32 achieved 21.67 and 6.03% improvements over one-hot and Unirep in terms of accuracy, as well as 27.20 and 7.32% in terms of F1, respectively (see Supplemental Table S11). This experiment suggests that better embedding can lead to better learning performance, and deep latent representation can comprehensively represent the protein sequence. The embedding performance of ESM-32 was better than that of ESM-33, suggesting that a deeper embedding layer is not always better. DMLF can automatically choose the best embedding methods based on the downstream tasks, and ESM-32 exhibited the best performance in this work.

Task 1: Enzyme or Non-enzyme Prediction
In this work, the workflow of enzyme number assignment is: task 1, determine whether the given protein sequence is an enzyme → task 2, if the given protein is an enzyme, then predict how many enzyme functions it can perform; → task3, assign an EC number for each enzyme function. According to this workflow, the first benchmarking task is enzyme or non-enzyme prediction. In this task, we trained an integrated binary classification model, which is driven by KNN and sequence alignment. KNN was implemented using scikit-learn, and the alignment was implemented using diamond v2.0.11. As shown in Fig. 4, our method can achieve scores of 93.12, 95.25, and 88.99% in terms of accuracy, precision, and recall, respectively (see Supplemental Table S8). Compared with other state-of-the-art tools and techniques the overall accuracy was greatly improved. For example, DeepEC yielded 74.10%, compared with 93.12% using our algorithm. Many previous methods were designed to obtain high precision while neglecting accuracy, NPV, and recall. For example, DeepEC can reach 94.68% precision while recall is only 20.83%. Methods that only offer high precision are very likely to miss many new functions. The F1 score might be a better evaluation metric for the EC assignment of real-world proteins.

Task 2: Multifunctional Enzyme Prediction
The second benchmarking task we addressed is multifunctional enzyme prediction. The backward prediction engine is agent 2 (see Fig. 2). In this task, we trained an integrated multiple-classification model driven by sequence alignment and XGBoost.
As shown in Fig. 5, our method was superior to existing baselines (see Supplemental Table S9). For example, the accuracy and recall of DeepEC were 8.52 and 13.6%, respectively. Moreover, the f1-macro of ECPred, DeepEC, CatFam, and PRIAM-V2 was less than 6%, lower than a random prediction accuracy of 10%. Hence, the performance was notably insufficient when dealing with multifunctional enzyme prediction. The low performance is mainly due to a lack of multifunctional enzyme data (see Table 3). Although our proposed method is 6.3 times better than random prediction in terms of accuracy, the performance is still insufficient, so it should be further improved in future work.

Task 3: Enzyme Commission Number Prediction
This task corresponds to agent-3 in DMLF. In order to develop a balanced EC number prediction algorithm with high accuracy combined with reasonable precision and recall, we trained an extreme multi-label classification model.  Our method achieved 86.91% accuracy with 69% precision and 63.88 recall (Fig. 6), which means that if 100 protein sequences were uploaded for annotation, we can obtain approximately 87 correct annotations. PRIAM is mainly designed to include more sequences, so the recall is high (78.48%), while the accuracy (3.0%) and precision (20.80%) are very low. DeepEC, ECPred, and CATFAM pursue high precision, so the accuracy is very low (less than 7.5%), which means that if we upload 100 protein sequences for annotation, we can only obtain 7.5 correct annotations while the remaining 92.5 are wrong. Obviously, our method shows a clear advantage in terms of EC number assignment.

Web Server Implementation
To make the workflow accessible for biologists around the world, we built a web application (https://ecrecer.biodesign.ac.cn/, Fig. 7). End-users can simply upload sequences to our platform, and then click the submit button to trigger the prediction workflow. In general, the whole workflow can be completed in a few seconds. We use Amazon DynamoDB to store job information, and users can track the previous submission records and corresponding status information. Once the analysis is finished, the user can view or download the corresponding results.
For EC assignment workflow, we use Amazon ECR to store Docker images, which packages a set of bioinformatics software, such as diamond and in-house python scripts. We built a scalable, elastic, and easily maintainable batch engine using AWS Batch. This solution took care of dynamically scaling our computer resources in response to the number of runnable jobs in our job queue. Finally, we used AWS step functions to coordinate the components of our applications easily, process message passed from AWS API Gateway, and invoke the workflows asynchronously. AWS API Gateway was used as the API server to handle the HTTP requests and route traffic to the correct backends. The static website was hosted by AWS S3 and sped up using AWS CloudFront.

Case Study and Discussion
When dealing with the EC assignment problem in a daily production scenario, ECRECer offers two optional modes for end-users: a prediction mode and a recommendation mode. In prediction mode, we provide the results with the highest probability, while in the recommendation mode, we deliver up to 20 possible EC number annotations ranked by their respective likelihood. Here, we present an up-to-date EC number prediction case to simulate the real-time challenge by conducting EC number assignments in the prediction mode. We collected testing protein sequences from June 2020 to November 2021, encompassing 1968 records. These data were not employed in the development process of the existing methods or our proposed method, which is in line with the daily production scenario. In this evaluation, we compared our method with the state-of- Interestingly, after numerous literature reviews, we found that this protein has an active site based on PROSITE-ProRule annotation (PROSITE-ProRule: PRU10022), and it is a beta-ketoacyl synthases active site (https://prosite.expasy.org/rule/PRU10022). Therefore, it is very likely that this protein indeed has beta-ketoacyl-acyl-carrier-protein synthase activity. Another example is iron/alpha-ketoglutarate-dependent dioxygenase AusU (UniProt ID: A0A0U5GJ41). This protein has a two-level EC number in the database (1.14. -.-). When we used ECRECer for EC annotation, it assigned this protein with the fourth-level EC 1.14.11.38. This protein was recently integrated into UniProtKB/Swiss-Prot (September 29, 2021). After blasting it against the UniProtKB database, we found that the top 5 reviewed proteins with the highest identities include three verruculogen synthase (Fig. 8a). We took protein Q4WAW9 as an example, and found that both genes belong to exactly the same protein families with the same domains (Fig. 8b). To further validate the results, we compared the structure of A0A0U5GJ41 (alphfold2 predicted) and Q4WAW9 (alphfold2 predicted and crystal structure). The results showed that these two proteins have a highly similar structure (see Supplemental Figs. S3-S6) with small RMSD (1.104). Therefore, the protein could be potentially annotated as EC 1.14.11.38 as well.
In addition to EC number assignment, another advantage of ECRECer is the recommendation of EC numbers, which makes our tool unique. The recommendation is particularly helpful for the discovery of multifunctional enzymes. To demonstrate the inclusiveness and predictive ability of our proposed method, we conducted EC number prediction on an unreviewed protein family. Corynebacterium glutamicum, the famous industrial workhorse for amino acid production with a current output of over 6 million tons per year (Lee et al., 2016), is increasingly being adopted as a promising chassis for the biosynthesis of other compounds. However, unlike E. coli (1652 protein sequences with EC numbers out of 4322 proteins, 38.2%), the protein sequences of Corynebacterium glutamicum were not well annotated. Out of 3305 protein sequences, only 537 were reviewed and included in the Swiss-Prot database (357 proteins have assigned EC numbers). We used the other 2768 protein sequences to compare our tool with DeepEC. Our approach was able to assign 1056 proteins with EC numbers s, while DeepEC only assigned 157 EC numbers (123 same EC numbers between DeepEC and ECRECer). Although there is no gold standard to decide which prediction is correct, we believe our algorithm should provide a more reasonable prediction as the proportion of protein sequences with EC numbers is similar to that of E. coli (42% vs. 15.5% in the case of DeepEC). The newly predicted EC numbers for the protein sequences are crucial for further analysis, such as retrieving metabolic reactions for genome-scale modeling.

Conclusion
In this work, we proposed a novel dual-core multiagent learning framework to complete three benchmarking tasks: 1) enzyme or non-enzyme annotation; 2) predicting the number of catalytic functions of a single multifunctional enzyme; and 3) EC number prediction. The method developed in this work has two calculation cores, an embedding core and a learning core. The embedding core is responsible for selecting the best available embedding method among one-hot, Unirep, and ESM to calculate sequence embeddings. The learning core is responsible for completing the specific benchmarking tasks using the best calculated protein sequence embedding as input.
We were guided by two principles in the design of our methods. The first principle is high usability (both can be accessed via the world-wide-web and provide standalone suit for high throughput prediction) with relatively balanced prediction performance (which can achieve the best accuracy with reasonable precision and recall). The second principle is providing comprehensive evaluation metrics with accessible reproduction datasets and source codes. To implement the first principle, we proposed DMLF. To implement the second principle, we provided a web server, standalone packages and opened all the source codes, including data preprocessing, dataset buildup, model training, and model testing/evaluation. Experiments on real-world datasets and comprehensive comparisons with existing state-of-the-art methods demonstrated that our tool is highly competitive, has the best performance with high usability, and meets the proposed objectives. Although our tool exhibited the best performance, it still has much space for improvement. For example, the performance of multifunctional enzyme annotation is relatively low, while the accuracy and recall of EC number annotation is less than 90%. Our feature work will focus on improving the prediction precision.

Key Points
• A multiagent dual-core learning framework is proposed to predict Enzyme Commission (EC) Numbers by using protein sequence data. • A protein language model and an extreme multi-label classifier are adopted to reduce the heavy head-crafted feature engineering and elevate the prediction performance. • The proposed framework remarkably outperforms the existing state-of-the-the-art method in terms of accuracy and F1 score by 70% and 20%, respectively. • An online service and an offline bundle are provided for end-users to annotate EC numbers in high-throughput easily and efficiently.

Data availability
The data underlying this article are available in the article and in its online Supplementary Material. The code of ECRECer,the training data,and the prediction results areavailable at urlhttps://ecrecer.biodesign.ac.cn/.   17 PRIAM in terms of precision and coverage. CatFam has been developed for more than 18 12 years. Although the precision is not as good as in the latest ones, the recall remains 19 1/18   webserver for the public but does not provide the source code for reimplementation and 55 evaluation, and the webserver is not capable of high-throughput prediction. Thus, this 56 algorithm is usable but not user-friendly. precision, low computing time, and low disk space requirements. DeepEC is sensitive in 77 detecting the effects of mutated domains/binding site residues. DeepEC did not provide 78 a source code for self-training and reimplementation. It only provides well-trained 79 parameters for local installation and prediction. However, no webserver is given.

80
Considering its good performance in precision, we also used use DeepEC as one of our 81 baselines in this work.

103
A commonly used EC number prediction dataset is the EzyPred dataset from Shen and 104 Zhou, published in 2007 [12]. The EzyPred dataset is a two-level EC number dataset 105 that was extracted from the ENZYME database (released May 1, 2007), with a 40% 106 sequence similarity cutoff. This dataset contains 9,832 two-level specified enzymes and 107 9850 non-enzymes. The details of this dataset can be found in their published paper 108 [12]. This dataset can only be used to predict two-level EC numbers, and the volume of 109 this dataset is unsuitable for machine learning. Accordingly, the majority of the later 110 studies used a similar approach to extract and construct datasets from Swiss-Prot 111 [11,7]. The typical steps of constructing the dataset are as follows: 112 1) Obtain the latest reviewed protein data from Swiss-Prot and label the sequences 113 as enzyme or none-enzyme utilizing the protein annotation. However, these principles of dataset construction were explicitly designed for the EC 125 number prediction of monofunctional enzymes and are not suitable for multifunctional 126 enzymes. Moreover, the construction of training and testing datasets using randomly 127 mixed data is not in accordance with the facts and may lead to information leaks.

128
Beyond that, filtering sequences by length and homology may obscure patterns and 129 other information, which will reduce the learning performance. Therefore, the steps of 130 constructing the dataset in this work were more straightforward:  For the multifunctional enzyme prediction dataset, to minimize distractions from 146 non-enzymes and balance the dataset, we excluded the non-enzyme data ( Table 3). The 147 remaining enzyme data was labeled based on the number of functions (i.e., 1, 2, ...., 8). 148 The details are listed below: 149   After a comprehensive evaluation using machine learning baselines (See Table 5, 160 below) , we adopted KNN as our first agent. The K Nearest Neighbor (KNN) method 161 has been widely used in data mining and machine learning applications due to its simple 162 implementation and distinguished performance [14]. The optimized parameters are As shown in Table 6, when dealing with multifunctional enzyme prediction, the learning 181 performance of KNN is not optimal. Thus, after a comprehensive evaluation, in agent 2, 182 we choose XGBoost as our algorithm for multifunctional enzyme prediction. XGBoost is 183 an implementation of gradient boosted decision trees that have shown superior 184 performance in many data science problems [3]. The optimized parameters used in this 185 work are listed as follows:    As illustrated in Fig. 1, the final EC number prediction output is an integrated process. 236 As shown in SE. 1, we formulated this integrated process as an optimization problem: where ag 1 , ag 2 , and ag 3 are the predicted results from Agent1, Agent2, and Agent3, 238 respectively, while sa is the predicted results from multiple sequence alignment. The 239 integration and fine-tuning process aims to maximize the optimizing objective. In this 240 work, the objective was the performance of EC number prediction in terms of the F1 241 score. We used a greedy strategy to finish this optimization.      train data ← uniprot sprot − only2018 02.tar.gz 3: test data ← uniprot sprot − only2020 06.tar.gz 4: extract protein records from downloaded data ▷ prepare task dataset.ipynb # Step 4 5: extract protein id 6: extract protein name 7: extract protein ec number 8: extract protein sequence as seq 9: format ec number and seq 10: caculate protein arrtruibutes. ▷ exact ec from uniprot.py 11: preprocessing protein records ▷ prepare task dataset.ipynb # Step 6 12: drop duplicates by seq 13: remove changed seq with same id 14: format ec number in standard four level like: -.-.-.- 15: trim ec number and seq strings 16: get esm embedding ▷ prepare task dataset.ipynb # Step 6.6