TAXyl: An in-silico method for predicting the thermal activity for xylanases from GH10 and GH11 families

Xylanases are a class of enzymes with numerous industrial applications and are involved in the degradation of xylose polysaccharide, which is present in lignocellulosic biomass. The optimum temperature of enzymes is the indicator of their thermal activity and is an essential factor to be considered when choosing an appropriate biocatalyst for a particular purpose. Therefore, in-silico prediction of this enzymatic attribute is a significant cost and time-effective step in the effort to identify and characterize novel enzymes. The objective of this study was to develop an accurate computational method to predict the thermal activity status of xylanases from glycoside hydrolases families 10 and 11, the most prevalent known xylanase families. Here we present TAXyl (Thermal Activity Prediction for Xylanase), a new sequence-based machine learning method that has been trained using a selected combination of various physicochemical protein features. This ensemble of four supervised learning algorithms discriminates mesophilic, thermophilic, and hyper-thermophilic xylanases based on their optimum temperature with the process of soft-voting. TAXyl’s performance was ultimately evaluated through multiple iterations of six-fold cross-validations, and it exhibited a mean accuracy of ~0.94, F1-score of ~0.91, and MCC of ~0.9. Additionally, the model was tested on previously unseen data and depicted relatively similar performance. To the best of our knowledge, this tool is the most accurate and practical prediction tool currently available and operating on this class of enzymes. TAXyl is freely accessible as a web-service at http://arimees.com/ and provides users with several features to facilitate the characterization of GH10 and GH11 xylanases.

One of the essential attributes of enzymes is their optimum temperature, in which they exhibit their maximum relative activity. An enzyme's optimum temperature the indicator of its thermal activity (3) .
On the one hand, high temperature increases substrates' solubility and bioavailability, accelerates molecular dynamics, and decreases the probability of microbial contamination significantly (4,5). On the other hand, numerous biological processes are carried out at mild or cold temperatures (6) .
Therefore, various research studies have been designed to predict the enzymes' optimum temperature in order to introduce novel appropriate biocatalysts for specific processes.
New enzymes are being discovered at an increased pace, thanks to the advances in sequencing technologies. Unlike culture-dependent methods that are unable to cultivate up to 99% of microorganisms, culture-independent methods such as metagenomics enable the extended exploration of the natural diversity within an environmental sample (7) . Metagenomics is the direct analysis of genetic material found within an environmental sample (8,9) . This field of study is a comparatively new culture-independent method to analyze microbial communities, their functional genes, and phylogenetic properties. Therefore, metagenomics provides access to almost all the genetic material inside an environmental sample. However, experimental functional annotation of newly identified genes is becoming a significant challenge (10,11) . Utilizing an in-silico approach to address this problem can make a notable contribution. Consequently, the availability of metagenomic data is rapidly 4 increasing, hence employing computational methods instead of wet-lab experiments in order to identify new enzymes with specific properties can effectively reduce the costs and make the process much faster.
The primary structure of a protein is one of the most important factors affecting an enzyme's thermal activity, and there is a strong correlation between this functional property and its sequence.
Many studies have used sequence similarity-based methods in order to predict different properties of proteins. However, there are many cases where sequence similarity does not directly correlate with the functional resemblance, as proteins with a nearly similar primary structure can sometimes depict unique properties or functional similarity can be observed among proteins with different amino acid sequences (12). Machine learning approaches have been successfully applied to predict various properties of proteins such as tertiary structure (13) , (14) , function (15) , localization (16) , thermal stability (17) , etc. (18), (19) . These computational methods are capable of learning more complex relationships between the primary structure of proteins and their different properties (20)(21)(22).
Numerous studies have presented in-silico methods for predicting enzymatic attributes.
Discrimination between thermophilic and mesophilic proteins by using machine learning methods was the focus of a study by Gromiha and Suresh (23) . Similarly, Tang et al. used support vector machines (SVM) to develop a two-step method for discriminating thermophilic proteins (24) and amino acid compositions have been the basis of a statistical method for a similar task (25 (27) . AcalPred is another similar study that utilizes SVM to classify acidic and alkaline enzymes based on their primary structure (28) . In another research, Ariaeenejad et al. applied a regression model based on a pseudo amino acid composition (PAAC) (29) to predict the optimum temperature and pH of xylanase in strains of Bacillus subtilis enzymes (30). Genetic Algorithm-Artificial Neural Network 5 (GA-ANN) have been employed for the optimization of xylanase production for industrial purposes (31,32) . In order to find features with the highest correlation with the thermostability of proteins, Kmean algorithm clustering method and a decision tree have been employed (33). Panja et al. found that the prevalence of smaller non-polar amino-acids, more hydrophobicity, and salt-bridges are some shared characteristics among most thermophilic proteins (34) . Moreover, feature selection methods such as recursive feature elimination (RFE) have been previously used by some studies in order to choose best sequence-extracted features. As an instance, Kumar et al. employed RFE with SVM to classify the enzymes' function into different classes and subclasses (35).
The increasing use of new high throughput technologies can rapidly produce enormous amounts of data, while the processes of getting access to the protein's tertiary and quaternary structures are much slower. Therefore, an accurate sequence-based approach with acceptable agility is in demand. The objective of this study was to design and implement a multi-step method for the classification of the thermal activity of xylanases from glycoside hydrolases families 10 and 11 based on their optimum temperature. Since most of the available data in the literature belong to GH10 and GH11 families, we focused on the members of these two protein families. To the best of our knowledge, this is the first time that a combination of different protein descriptors is calculated, selected, and used to train multiple machine-learning algorithms to make predictions on xylanase optimum temperature by an ensemble voting method. We presented TAXyl (Thermal Activity prediction for xylanase), a prediction web-server for the thermal activity of xylanases. The performance of TAXyl was evaluated using multiple cross-validation tests and also on previously unseen data.

6
A new dataset from GH families 10 and 11, which constitutes a considerable ratio of known xylanases, had to be collected. Even though this makes the scope narrower, it can help the final estimation be more accurate and reliable. The National Center for Biotechnology Information (NCBI) database was explored by searching for thermoactive or thermostable xylanases, and 254 results were found. After removing the records without the exact optimum temperature report, the remaining sequences were divided into two groups of families 10 and 11. Afterward, the BRENDA database was explored for xylanases with reported optimum temperature, and newly collected data were added to the previous dataset. The Uniprot website was finally searched with the same strategy, and new samples were collected.
Redundant or highly similar samples were removed using the CD-Hit tool, which clusters highlyhomologous sequences, with a 0.9 cut-off (36) . In some protein sequences, a few amino acids are unknown. These residues are represented by the character "X." Since the existence of such noise could potentially interfere with the learning process and feature extraction tools are designed for 20 amino acid residues, all unknown amino acids were removed from the sequences.
The final dataset consisted of 145 different xylanases from GH families 10 and 11 with optimum temperature ranging from 25°C to 95°C. These samples were labeled accordingly into three different categories: 1) "Non-thermoactive" with optimum temperature below 50°C; 2) "thermoactive" with the optimum temperature between 50°C and 75°C, and 3) "hyper-thermoactive" with the optimum temperature above 75°C. Fig. 1 shows the proportion of each thermal class and GH family in the dataset.

Feature extraction
Using an appropriate set of features is undoubtedly one of the most crucial elements in creating an efficient classifier. Because there is not much precise evidence regarding the most related features to thermal activity, in this study, various protein descriptors were computed using the PyDPI python package (37).
The PyDPI computed protein features are 15 descriptor types, which are from six main groups.
Amino acid composition (AAC), dipeptide composition (2AAC) and tripeptide composition (3AAC), represent their fraction in the protein sequence. Unlike previous feature groups, pseudo amino acid composition (PAAC) and amphiphilic pseudo amino acid composition (APAAC), try to evade missing the sequence-order information and combine that with the composition data (38), (39). Conjoint triad 8 features (CTF) cluster twenty amino acids into several classes based on their dipoles and the number of side chains (40) . CTF considers the properties of an amino acid and its neighboring ones while regarding any amino acid triads as a unit. Other features are calculated by taking structural and physiochemical properties into account. Three autocorrelation descriptors, including normalized Moreau−Broto, Geary, and Moran autocorrelations, attempt to describe the amount of correlation among peptide or protein sequences. Composition, transition, distribution descriptors (CTD), sequenceorder coupling number (SOCN), and quasi-sequence-order (QSO) delineate the distribution pattern of amino acids along a protein sequence in terms of structural and physicochemical attributes.
As a result, a feature vector with 10,074 descriptors was calculated for each protein sequence.
Many of these features were duplicates and were removed afterward. Due to the different ranges of the descriptors, all raw values had to be rescaled into the same range. This transformation was done using MinMaxScaler.

Feature selection
Different protein descriptors were used separately to train the same models and depicted different prediction performances. These individually evaluated features were then combined and submitted to steps of feature selection which resulted in a significant improvement in prediction performmance. All features do not contribute equally to the final prediction. Thus, a process of feature selection is necessary in order to find the most relevant descriptors for optimum temperature classification and to remove ones of less importance.
Filter feature selection methods utilize a statistical measure to rank features based on their score.
For this step, an f-test was applied as the filter method, and the best features were chosen using the SelectKBest method for the next selection step. Recursive feature elimination (RFE) trains a machine learning model using input features and ranks them based on their contribution to the prediction. The RFE method was executed with logistic regression, a classification algorithm, and most relevant features were selected. As a result, the final feature vector was prepared with 512 dimensions.

Model selection and training
A plethora of classification algorithms are currently available. Each algorithm is built based on different theories. In order to find the best methods for our problem, various classification algorithms were tested, including multilayer perceptron (MLP), decision tree, Gaussian Naïve Bayes, Gaussian process, AdaBoost, KnearestNeighbors and support vector machine (SVM), all of which were applied from the sci-kit learn python package (41) Four of these supervised learning algorithms with the best accuracies were chosen. These four classifiers were MLP, SVM, Gaussian process classifier, and Gaussian Naïve Bayes.
An MLP with four hidden layers was constructed to address this classification. SVM classifiers form hyperplanes that categorize inputs into different classes. The hyperplane can be constructed using various kernel functions, including linear, radial basis, and polynomial. Our support-vector machine used a radial basis function (RBF) kernel to predict the protein's thermal activity. Naïve Bayes is based on Bayes theorem and can be applied for this classification problem. This algorithm is relatively fast since it only calculates the probability of each class and the probability of each class given different sets of inputs. Gaussian Naïve Bayes is a popular supervised learning algorithm for dealing with continuous data, and this method was used for this problem (42). Gaussian process classifier is another classification algorithm that was employed with the RBF kernel (43). Afterward, an ensemble classifier was implemented to decide the final output based on soft voting among the four mentioned classifiers (44) . In the process of soft voting, each model returns an array representing the probabilities of each class, and the ensemble classifier decides the final answer based on the weighted average of class probabilities.
The pipeline mentioned above required several parameter tuning steps, all of which were done using the GridSearchCV method. Grid searching is the process of testing the model with various hyperparameters and finding the optimum configuration. The sci-kit learn python package was used several times during this study (41)

Evaluation criteria
Since this is a multi-class classification problem, accuracy, macro-recall, macro-precision, the macro-f1 score, and the Matthews correlation coefficient were used as evaluation metrics. These metrics were calculated using the following formulas: For the evaluation step, samples were randomly split into two subsamples (85% as the training set and 15% as the test set) five times, and each time 20 iterations of the six-fold cross-validation were done on the training set. Subsequently, the models were tested on the unseen subsample (test set). This 11 means that our classifiers were evaluated through 100 iterations of six-fold cross-validations and were tested on unseen data five times to assure that the models are robust. In the six-fold cross-validation step, the training set was randomly split into six equal subsamples, five of which were used as training sets, and one subsample was then used for testing the models with different evaluation metrics. This was done six times, leaving out one subsample each time for testing. Figure 2 is the schematic diagram of the steps of development and evaluation of the current prediction model.

Feature importance analysis and feature selection
Generated protein descriptors encapsulate various molecular and sequential information with different degrees of relevance to the enzyme's thermal activity. In order to obtain a better perception of the relationship between different descriptors and the prediction performance, the same model was 12 trained using different sets of features. A summary of generated features, their feature groups, and their dimensions are presented in Table 1. The average of metrics for different models were calculated and then compared to combined features that went through feature Selection steps. Amino acid composition, Pseudo amino acid composition, dipeptide composition, amphiphilic pseudo amino acid composition, and Quasi sequence order were ranked in descending order by their performances when implemented individually. This implies their order in terms of importance and relevance to thermal activity. A complete representation of feature importance analysis is presented in Table 2 and Fig. 3. Our results showed that a selected combination of different descriptors augments the prediction performance noticeably.
13  In the process of filter feature selection, both chi-2 test and f-test were both used as filter methods, and the f-test showed a slightly better result based on final evaluations. For the filter feature selection, we used the SelectKBest method, and the value of K was chosen to be 3500 based on the better prediction performance after multiple tests. Afterward, 3500 selected features were used in the feature selection step with the RFE method. In the RFE step, logistic regression was employed, and 512 most contributing features were chosen as the final feature vector. Again, the dimension of 512 was opted due to better performance.

Model performance
As it is shown in Table 3, our four models depicted reasonably high performance, achieving acceptable scores at different evaluation metrics. Although all four models mostly depicted agreement on correct predictions, in case of mistake, outputs were diverse. This diversity is because of the different algorithms that each model implements to make the prediction. Therefore, an ensemble method such as Voting Classifier could be a promising synergistic approach to get a higher overall performance since it enables us to exploit multiple learning algorithms for a single prediction. Soft voting refers to the process in which the final output is determined based on the computed probabilities of all models for each class. Since our four models had different capabilities, voting was executed by assigning uneven weights to each model. Reported performance metrics were computed through 100time six-fold cross-validation tests. In each cross-validation (CV) iteration, the dataset was shuffled with different random seeds before splitting. The ensemble method demonstrated a slight improvement, out-performing the most accurate individual model, which was MLP. Table 3 demonstrates the performance of different individual classification methods in comparison to the ensemble voting classifier.

Testing the model on unseen test data
As described, our final model was chosen to be the ensemble classifier due to the achievement of better classification scores. The thermal activity prediction for xylanases (TAXyl) was tested using unseen test data (15% of the initial dataset, which was reserved at the beginning). This step was executed five times, and each time, the data was shuffled entirely. Table 4 shows the comparison of TAXyl evaluation during cross-validation and holdout method tests and Fig. 4 illustrates the TAXyl's performance through multiple cross-validation tests.

Online Server
TAXyl is freely available at http://arimees.com. This web service is capable of getting inputs in the forms of the FASTA file, amino acid sequence, or protein entry of xylanases from GH families 10 and 11 and returns their probable thermal activity status. TAXyl also enables users to download and export the selected features for their inserted protein sequences in CSV format for other machine learning applications. This web service is also accessible from the CBB lab website (http://cbb.ut.ac.ir), under databases and tools sub-menu.

Discussion and conclusion
Xylanases are carbohydrate-active enzymes responsible for the hydrolysis of xylan polysaccharide into xylose. This group of biocatalysts have multiple industrial applications and therefore have high commercial value (1) . Determining the optimum temperature of activity plays an essential role in choosing the appropriate enzymes for a specific task, making this enzymatic property substantially important. Since the advent of next-generation sequencing and its accelerating improvements, getting access to metagenomic data is becoming increasingly easier and more affordable, and the only possible way to analyze these enormous amounts of data is by fast and accurate computational methods.
In this study, we presented a novel method based on an ensemble of machine learning algorithms to predict the optimum temperature of xylanases activity from GH families 10 and 11. Because of the limited number of xylanases with a reported optimum temperature in the literature, we explored different learning methods to find ones with an acceptable interpretation of the data. TAXyl uses sequence-based and length-independent protein descriptors to train four different supervised machine learning algorithms and employs a voting classifier to integrate all individual models to obtain a more accurate prediction. As demonstrated, the ensemble method which benefits from the synergistic combination of various information sources can slightly improve the performance by taking several learning methods into account for a single task of decision making. Furthermore, the voting classifier exhibited less variance in its metrics in comparison to other methods, which stems from its greater flexibility and robustness.
In comparison to similar previous studies, TAXyl out-performs the model developed by Gromiha et al. on the prediction of GH10 and GH11 xylanases by providing the CV accuracy of 94% over 89% (23) . Similarly, our model's performance was higher than the statistical method which was based on amino acid compositions (25) . Although TAXyl and the two-step discrimination model of Tang et al.   19 had a relatively similar accuracy (24) , our model, unlike non of the above, extends the classification ability to the third class which are the hyper-thermoactive enzymes.
Our results indicated the competence of computational methods to address common problems in bioinformatics and the capability of sequence-based and length-independent protein descriptors for training supervised learning algorithms to effectively predict general enzymatic features (26) . It is clear that various protein attributes are related to their structural and functional properties. With a proper multi-step method, it is possible to predict them with minimal cost in time and expenses. We observed that amino acid composition, pseudo amino acid composition and sequence order descriptors are among the most relevant protein features for thermal activity prediction (33) Moreover, our findings indicate that the feature selection steps amended the performance considerably by enabling us to exploit the best descriptors from different feature groups and pruning the trivial ones. As presented in Table 3, all individual models were capable of reasonably accurate classifications. When testing the classifiers separately, MLP and SVM demonstrated a better performance than Gaussian Naïve Bayes and Gaussian process classifier. However, the ensemble classifier depicted a better performance by combining all four classifiers' discrimination ability.
Advancements in sequencing technologies and metagenomics have revolutionized our access to a wealth of sequence data from enzymes with possible industrial applications. In comparison with our previous two studies, in which xylanase enzymes from the metagenomic source were identified and characterized by experimental techniques (10,11) , TAXyl enabled the identification of more putative thermoactive xylanases and enhanced the extended exploration of the metagenomic data as well as validating our two previously characterized enzymes. This tool can be implemented to significantly reduce the number of potential candidates of xylanases with a specific thermal activity profile before engaging the wet-lab experiments. Another potential usage for such tools is in facilitating the 20 engineering of enzymes through directed evolution to obtain biocatalysts with higher thermal stability targeted at particular industrial purposes. In case of sufficient data availability, a possible direction for future works would undoubtedly be developing similar tools to predict the structural, functional, and thermodynamic properties of other enzyme families. The TAXyl web-service is available and provides users with a reasonably accurate approximation of any GH10 or GH11 xylanase thermal activity status.

Supporting Information
Dataset which was used for this study. (.xlsx)