iNP_ESM: Neuropeptide Identification Based on Evolutionary Scale Modeling and Unified Representation Embedding Features

Neuropeptides are biomolecules with crucial physiological functions. Accurate identification of neuropeptides is essential for understanding nervous system regulatory mechanisms. However, traditional analysis methods are expensive and laborious, and the development of effective machine learning models continues to be a subject of current research. Hence, in this research, we constructed an SVM-based machine learning neuropeptide predictor, iNP_ESM, by integrating protein language models Evolutionary Scale Modeling (ESM) and Unified Representation (UniRep) for the first time. Our model utilized feature fusion and feature selection strategies to improve prediction accuracy during optimization. In addition, we validated the effectiveness of the optimization strategy with UMAP (Uniform Manifold Approximation and Projection) visualization. iNP_ESM outperforms existing models on a variety of machine learning evaluation metrics, with an accuracy of up to 0.937 in cross-validation and 0.928 in independent testing, demonstrating optimal neuropeptide recognition capabilities. We anticipate improved neuropeptide data in the future, and we believe that the iNP_ESM model will have broader applications in the research and clinical treatment of neurological diseases.


Introduction
Neuropeptides, molecular entities resembling proteins yet smaller in scale, function within the nervous system as either neurotransmitters or neuromodulators.Synthesized and secreted by neurons, these molecules can modulate the functioning of other neurons and various target cells [1][2][3].These substances play a critical role in numerous physiological processes, including governing the flexibility and responsiveness of neural circuits [4], influencing emotional states such as mood, anxiety, and depression [2], orchestrating tissue repair and regeneration [5], and contributing to the pathogenesis of certain neurological conditions such as autism spectrum disorder [6].
Neuropeptides are distinguished by their significant structural heterogeneity, which includes a wide array of molecular sizes and variants.Furthermore, the concentration of neuropeptides in vivo is exceedingly low, typically within the picomolar to nanomolar range, presenting technical challenges for their detection and quantification [7,8].It is imperative to employ advanced analytical methodologies, such as mass spectrometry, nuclear magnetic resonance spectroscopy (NMR), and immunohistochemistry, to investigate neuropeptides.These methods necessitate substantial technical expertise and can be both resource-intensive and time-consuming [7][8][9].So the development of robust, interpretable computational frameworks for predicting neuropeptide structures remains a focal point of ongoing research endeavors.
With the advent of neuropeptide repositories, the application of data-centric computational paradigms has commenced within the domain of neuropeptide prognostication.In 2019, a machine learning-based neuropeptide prediction model called NeuroPIpred was first reported by Piyush Agrawal et al. [10], designed to predict insect neuropeptides.Wang et al. collected a database called NeuroPep [11], followed by the development of many advanced neuropeptide machine learning algorithms.In 2020, Bin et al. developed a dual-layered stacked model called PredNeuroP [12] by combining nine feature descriptors with five different machine learning algorithms for feature optimization.In 2021, Hasan designed a predictor, NeuroPred-FRL [13], in which a feature representation learning approach based on a two-step feature selection method chooses from 11 different encodings and 6 classifiers.Jiang also developed a stacked model, NeuroPpred-Fuse [14], integrating various features and selection methods.In 2023, Wang et al. first utilized the latest dataset, NeuroPep2.0[15], adopting the protein language model to extract features and a convolutional neural network to strengthen local features, developing the NeuroPred-PLM model [16].Despite the optimistic outcomes manifested by these machine learning constructs, they are invariably plagued by a constellation of challenges, including, but not limited to, a constrained quantity of neuropeptide predictions, the incorporation of redundant features, and a notable deficit in model interpretability [10,13,14,[16][17][18].
Owing to the advancements in natural language processing (NLP) and deep learning (DL), protein language models (PLM) have rapidly emerged, which consider individual amino acids and their polymers as tokens, like words in the language [19].It is adapted to derive fixed-sized vectors, called embeddings, for representing a given protein.From a theoretical standpoint, protein language models offer distinct advantages over traditional bioinformatics approaches [20].Firstly, it obviates the need for the laborious steps of querying, alignment, and training by employing a singular forward-pass learning mechanism, significantly expediting the feature extraction process.Secondly, it leverages the vast diversity of protein corpora, comprising billions of sequences, to foster generalizability and optimize performance in downstream predictive tasks [21][22][23][24][25][26][27].By adopting pre-training methods from natural language processing, these models can understand the properties and behaviors of proteins without relying on explicit structural data, achieving remarkable results in protein structure prediction, functional annotation, and the prediction of protein-ligand reactions [28][29][30][31][32][33][34][35][36].
In the current study, we have developed a new machine learning methodology, iNP_ESM, for the identification of neuropeptides.The feature extraction was performed with the use of protein language models, namely, Evolutionary Scale Modeling (ESM) and Unified Representation (UniRep) [21,25].The neuropeptide identification algorithm implements the Support Vector Machine (SVM) algorithm.We subsequently optimized the model through feature fusion and selection techniques, and the optimized model was named iNP_ESM.Compared to the existing state-of-the-art NeuroPred-PLM, iNP_ESM demonstrated significant improvements in 10-fold cross-validation metrics, achieving an accuracy (ACC) of up to 0.937, a Matthews correlation coefficient (MCC) of up to 0.873, and an F1 score of up to 0.934.In independent test metrics, iNP_ESM achieved an ACC of up to 0.928, an MCC of up to 0.856, and an F1 score of up to 0.928.We hope our work will fill this gap and advance the development of machine learning models for neuropeptide prediction.

Results
The modeling process is illustrated in Figure 1.Initially, we input the amino acid sequences of neuropeptides into the pre-trained protein language models, obtaining a 1280dimensional feature from ESM and a 1900-dimensional feature from UniRep.Subsequently, these features undergo fusion and selection processes before being input into an SVM machine learning model.Finally, after a comparative analysis of performance metrics, an optimized model is developed, which is designated as iNP_ESM.
Subsequently, these features undergo fusion and selection processes before being input into an SVM machine learning model.Finally, after a comparative analysis of performance metrics, an optimized model is developed, which is designated as iNP_ESM.This fused feature can be directly input into an SVM model.Alternatively, after dimensionality reduction through feature selection to 120 dimensions, the reduced feature can also be input into the SVM model.Following a series of optimizations and performance comparisons, the iNP_ESM model is finalized.

Identifying the Optimal Baseline Models
To identify the optimal feature encoding method and machine learning algorithm, we combined six different features extracted from neuropeptide sequences with seven different machine learning algorithms, resulting in 42 baseline models.The scores of 10fold cross-validation for each encoding under different algorithms are presented in Figure 2. It is evident that, except for the slightly lower scores of UniRep with GNB, the ESM and UniRep features consistently outperformed other features across various machine learning algorithms.Table 1 compares the accuracy of these baseline models based on 10-fold cross-validation.It was found that within each set of machine learning algorithms, the top-ranking feature was always ESM or UniRep.Specifically, UniRep achieved the highest scores with KNN and RF, at 0.895 and 0.908, respectively, while ESM achieved the highest scores with SVM, GNB, LDA, LGBM, and LR, at 0.764, 0.906, 0.920, 0.916, and 0.924, respectively.Notably, for both ESM and UniRep, the algorithm that consistently achieved the best performance was LGBM and SVM, with the best ACC scores for ESM being 0.920 and 0.924 under LGBM and SVM, respectively.This not only highlights ESM and UniRep.This fused feature can be directly input into an SVM model.Alternatively, after dimensionality reduction through feature selection to 120 dimensions, the reduced feature can also be input into the SVM model.Following a series of optimizations and performance comparisons, the iNP_ESM model is finalized.

Identifying the Optimal Baseline Models
To identify the optimal feature encoding method and machine learning algorithm, we combined six different features extracted from neuropeptide sequences with seven different machine learning algorithms, resulting in 42 baseline models.The scores of 10-fold cross-validation for each encoding under different algorithms are presented in Figure 2. It is evident that, except for the slightly lower scores of UniRep with GNB, the ESM and UniRep features consistently outperformed other features across various machine learning algorithms.Table 1 compares the accuracy of these baseline models based on 10-fold cross-validation.It was found that within each set of machine learning algorithms, the top-ranking feature was always ESM or UniRep.Specifically, UniRep achieved the highest scores with KNN and RF, at 0.895 and 0.908, respectively, while ESM achieved the highest scores with SVM, GNB, LDA, LGBM, and LR, at 0.764, 0.906, 0.920, 0.916, and 0.924, respectively.Notably, for both ESM and UniRep, the algorithm that consistently achieved the best performance was LGBM and SVM, with the best ACC scores for ESM being 0.920 and 0.924 under LGBM and SVM, respectively.This not only highlights ESM and UniRep.  LDA parameters: {'default'}, 4 LGBM parameters: {'n_estimators': 500}, 5 LR parameters: {'penalty': 'l2', 'C': 10, 'max_iter': 1000, 'solver': 'sag'}, 6 RF parameters: {'n_estimators': 500}, 7 SVM parameters: {'kernel': 'rbf', 'gamma': 'auto', 'C': 1, 'probability': 'True'}, 8 the best value of each column is underline and in bold.
As the best encoding methods for extracting neuropeptide features, but also indicate that LGBM and SVM are the most suitable machine learning algorithms for these features.Table S1 supplements this by comparing the independent test accuracy of the baseline models.Here, the UniRep feature achieved the highest score with KNN and RF at 0.885 and 0.896, while the ESM feature attained the highest scores across the other six algorithms: 0.777, 0.898, 0.914, 0.902, 0.892, and 0.918.These results demonstrate the superiority of ESM and UniRep in extracting neuropeptide features.Furthermore, we performed parameter optimization for the LGBM and SVM algorithms, with the results shown in Tables S2 and S3.

Feature Fusion Optimization
In general, integrating multiple features can capture more comprehensive information from a single dataset, thereby enhancing the robustness of the model.Based on the abovestated baseline models, we fused the features obtained from the 1280D ESM and the 1900D UniRep to get a new 3180D feature.We then applied the LGBM and SVM algorithms to this new feature, creating the new model ESM+UniRep_F3180.The results indicated that the models achieved better performance metrics under both algorithms.Figure 3 compares the cross-validation and independent test average scores of the ESM+UniRep_F3180 models with the optimized single-feature models.Tables S2 and S3 provide supplementary details of the 10-fold cross-validation and independent test scores for the ESM+UniRep_F3180 and single-feature models.Comparing the average values, the LGBM model using the fused feature showed improvements in ACC scores by 0.9% to 1.2%, MCC by 1.8% to 2.4%, Sn by 0.2% to 0.9%, Sp by 1.5% to 1.6%, Pre by 1.4%, F1 by 0.8% to 1.2%, and AUC by 0.3% to 0.6%.For the SVM model with the fused feature, ACC improved by 0.5% to 0.7%, MCC by 1.0% to 1.4%, Sn by 0.9% to 1.4%, and AUC by 0.4%, while Sp did not improve, Pre increased by 0.1%, and F1 by 0.5% to 0.8%.These results demonstrate that the strategy of fusing ESM and UniRep effectively enhances the model's ability to learn neuropeptide features.Additionally, the ESM+UniRep_F3180 model performed better with the SVM algorithm than with LGBM.Except for Sn, the average values for ACC, MCC, Sp, Pre, and F1 were higher.This suggests that SVM is a more suitable machine learning algorithm for the ESM+UniRep feature.Therefore, we selected SVM for constructing models in subsequent feature calculations.

Feature Selection Optimization and Visualization Analysis
Due to the large dimensionality of the ESM+UniRep_F3180, we applied feature selection strategies, successfully improving the model's cross-validation metrics.Initially, we input the features of ESM+UniRep_F3180 into the LGBM algorithm, leveraging its builtin function to compute and rank the importance of each feature.Subsequently, we conducted a stepwise feature selection with a size of 5, selecting the top 5, 10, 15, 25, and so on, up to 300 features, and modeled them using SVM.It is important to note that SVM was chosen because, as previously observed, it performed better than LGBM with the fused feature.The changes in ACC and MCC scores during cross-validation and independent testing at each optimization step are shown in Figure 4. Firstly, as the number of features increased, the average ACC and average MCC rose quickly.When the number of features reached 120, both the average ACC and average MCC peaked at 0.929 and 0.859, respectively.Beyond this point, metrics showed a declining trend.Therefore, we selected the top 120 features to construct a new model named ESM+UniRep_F120 using SVM.Table 2 compares the metrics of the SVM-based models after feature fusion and selection.Compared to single-feature models, the ESM+UniRep_F120 model showed improved cross-validation metrics (ACC increased by 0.5~0.8%,MCC by 0.9~1.7%),validating the effectiveness of our optimization strategy.Additionally, comparing the two fused-feature models, we found that ESM+UniRep_F120 achieved higher cross-validation scores (ACC and MCC increased by 0.3% and 0.6%, respectively), while its independent test scores decreased (ACC and MCC decreased by 0.6% and 1.1%, respectively).We suppose this could be due to the reduction of redundant features, which enhanced learning on the training set and improved model robustness, but slightly reduced adaptability to entirely new data due to the loss of a significant number of features.

Feature Selection Optimization and Visualization Analysis
Due to the large dimensionality of the ESM+UniRep_F3180, we applied feature selection strategies, successfully improving the model's cross-validation metrics.Initially, we input the features of ESM+UniRep_F3180 into the LGBM algorithm, leveraging its built-in function to compute and rank the importance of each feature.Subsequently, we conducted a stepwise feature selection with a size of 5, selecting the top 5, 10, 15, 25, and so on, up to 300 features, and modeled them using SVM.It is important to note that SVM was chosen because, as previously observed, it performed better than LGBM with the fused feature.The changes in ACC and MCC scores during cross-validation and independent testing at each optimization step are shown in Figure 4. Firstly, as the number of features increased, the average ACC and average MCC rose quickly.When the number of features reached 120, both the average ACC and average MCC peaked at 0.929 and 0.859, respectively.Beyond this point, metrics showed a declining trend.Therefore, we selected the top 120 features to construct a new model named ESM+UniRep_F120 using SVM.Table 2 compares the metrics of the SVM-based models after feature fusion and selection.Compared to single-feature models, the ESM+UniRep_F120 model showed improved cross-validation metrics (ACC increased by 0.5~0.8%,MCC by 0.9~1.7%),validating the effectiveness of our optimization strategy.Additionally, comparing the two fused-feature models, we found that ESM+UniRep_F120 achieved higher cross-validation scores (ACC and MCC increased by 0.3% and 0.6%, respectively), while its independent test scores decreased (ACC and MCC decreased by 0.6% and 1.1%, respectively).We suppose this could be due to the reduction of redundant features, which enhanced learning on the training set and improved model robustness, but slightly reduced adaptability to entirely new data due to the loss of a significant number of features. Additionally

Discussion
The development of the iNP_ESM model has significantly contributed to the field of neuropeptide identification through machine learning.By integrating ESM and UniRep protein language models with an SVM algorithm, we achieved superior performance metrics compared to existing models.Our results demonstrated marked improvements in evaluation metrics in both cross-validation and independent testing.The success of iNP_ESM can be attributed to several key factors.Firstly, the fusion of ESM and UniRep features provided a comprehensive representation of neuropeptide sequences, capturing a broader range of biological properties and interactions.This fusion enhanced the model's ability to distinguish neuropeptides from non-neuropeptides more accurately.Secondly, our feature selection process allowed us to optimize the model by retaining the most informative features while reducing redundancy, particularly when reducing the feature to 120 dimensions.Our study also highlighted the importance of visualization in model optimization.The use of UMAP for dimensionality reduction and visualization confirmed the superior clustering of neuropeptide features in the fused models, further enhancing interpretability-a common challenge in machine learning models.
Compared to NeuroPred-PLM, which utilizes only ESM representations, our model integrates a more diverse set of features (UniRep) while employing a reduced dimensionality (120D).Unlike complex stacked models such as NeuroPred-FRL, we implemented a simpler feature fusion and selection strategy, streamlining the modeling process and achieving superior results.However, the use of simplified features in our model can decrease accuracy in independent tests (Table 3).Achieving a balance between feature reduction and predictive efficiency remains an area for further exploration.
Despite these advancements, several challenges remain.The reliance on high-quality, comprehensive datasets is paramount for training robust models [37][38][39][40][41].While the Neu-roPep 2.0 dataset significantly improved our training process, the availability of more extensive and diverse neuropeptide data would further enhance model accuracy and generalizability.Additionally, there is still a need for more transparent methodologies that can provide deeper insights into the biological significance of the predictive features.Future work should focus on expanding the dataset, incorporating new neuropeptide sequences, and exploring other advanced protein language models to further refine feature extraction.

Dataset
Gathering datasets is a crucial step in machine learning, and we have utilized the latest NeuroPep 2.0 dataset [15].NeuroPep 2.0 contains 11,417 unique neuropeptide records, nearly double that of its initial version.Neuropeptides included in the database must be verified through scientific literature or database annotations.These records are derived from 544 articles and 3688 precursor proteins from the UniProtKB database, covering 924 species and distributed across 81 neuropeptide families.In terms of data processing techniques, our methodology aligns with those employed by contemporary leading-edge algorithms such as NeuroPred-PLM.Initially, we exclude neuropeptides that do not conform to the specified length criteria of 5 to 100 amino acids.Subsequently, to avoid model overfitting, the remaining dataset was processed using CD-HIT with a 0.9 threshold to remove sequences with more than 90% similarity to other sequences [42].Following these filtration processes, a total of 4463 neuropeptides were retained.For the purpose of an independent test, all selected neuropeptides were derived from the additions made in the NeuroPep 2.0 dataset, with 444 neuropeptides, constituting 10% of the total, chosen at random for this subset.Owing to the absence of experimentally verified negative instances, sequences were sourced from UniProt whose length distribution is similar to the positive examples [43], resulting in 4463 negative examples.From this collection, 444 sequences were randomly allocated to compose the non-neuropeptide control group for the test.

Feature Extraction Methods
In this paper, five encoding methods were utilized to represent neuropeptides for comparison.These methods include ESM, UniRep, TAPE_BERT [23], LM, BiLSTM, and Soft Symmetric Alignment Embedding (SSA) [44].The training was conducted utilizing a 16-GB NVIDIA GPU.

ESM
The training of ESM involved unsupervised learning on billions of protein sequences in UniRef50, enabling the model to capture a wide range of biological properties [25,45] The model produces 1280-dimensional output vectors.ESM makes use of a Masked Language Modeling Objective, in which a fraction of amino acids from the protein sequence are randomly masked while the model is forced to predict the masked amino acids.This task can be represented by the formula: x i represents the masked amino acids, x \M denotes the sequence excluding the masked positions, and M is the set of positions of the masked amino acids.
A key component of ESM is the use of the Transformer architecture.The Transformer model is designed with the concept of multiple encoders with stacked layers, each comprising a self-attention mechanism and a feedforward neural network.This approach was chosen because the Transformer architecture can structurally capture long-range dependencies and complex relational patterns in the data, which is crucial for understanding the intricate interactions that govern protein function and structure [46].The self-attention mechanism, which is central to the Transformer model, enables the model to capture complex relationships between various positions.The key to the self-attention mechanism can be simplified by the following formula: Here, Q, K, and V represent the query, key, and value matrices.d k represents the dimension of the key vectors.This mechanism allows the model to capture global dependencies at every position in the sequence.The ESM model parameters used in this experiment can be obtained from the referenced literature [25].

UniRep
UniRep has been trained using millions of unlabeled protein sequences from the UniRef50 [21].The architecture of the UniRep model is the mLSTM recurrent neural network.One of the aims of UniRep training is to increase the prediction accuracy of the next amino acid, leading to decreased cross-entropy loss.This predictive task compels the UniRep model to incorporate representations of "cell state" or "hidden state".For any specific protein sequence, the UniRep model's output comprehensively captures various dimensions of the sequence's structural and functional attributes, thereby providing a holistic characterization of the protein.The core of the mLSTM network is defined by the following equations [29]: Here, ⊗ represents the element-wise multiplication of matrices.m t is the current multiplication interim state.X t is the input at the current timestep.W is the weight matrix.h t−1 is the hidden state from the last time step.σ is the sigma activation function.f t is the output of the forget gate, which represents the previous memory retention.i t is the activation value of the input gate.C t is the cell state at the current time step, updated with the prior cell state C t−1 and new candidate cell state C t .h t is the current hidden state.o t is the activation of the output gate.The UniRep model parameters can be obtained from the referenced literature [21].

TAPE_BERT
To evaluate the performance of various extraction methodologies, features were also encoded using TAPE_BERT, SSA, LM, and BiLSTM.TAPE (Tasks Assessing Protein Embeddings) is a benchmark testing framework [23].TAPE utilizes the Pfam database [47] and employs architectures such as Transformer, LSTM, and ResNet to perform large-scale pre-training on protein sequences through self-supervised learning methods.Experimental results demonstrate that this self-supervised pre-training significantly enhances model performance in tasks such as protein structure prediction, remote homology detection, fluorescence prediction, and stability prediction.TAPE_BERT is based on the BERT architecture, and the formulas describing its principles are the same as those used to describe the Transformer in the previous introduction of ESM.The code is available in the paper [23].4.2.4.SSA, LM, and BiLSTM SSA (Soft Symmetric Alignment) is a framework for learning protein sequence embeddings [44].SSA is trained on the Pfam database [47] and uses BiLSTM encoders along with a soft symmetric alignment mechanism to learn structural information from protein sequences.Experiments show that the soft symmetric alignment mechanism significantly improves the accuracy of structure similarity predictions and performs well in tasks such as transmembrane region prediction.The formula for calculating the similarity score is as follows [44]: where z i and z ′ j are the embedding vectors of two protein sequences, a ij is an element of the alignment matrix representing the alignment weight between positions i and j of the sequences, and A is the total length of the alignment: Additionally, we also extracted features LM and BiLSTM from the architecture of SSA embedding for comparison.These codes are available in the paper [44].

Feature Selection Methods
Feature selection is crucial for effectively constructing learning models through dimensionality reduction, removal of irrelevant or redundant attributes, reduction in training time, and augmentation of model performance and interpretability [56][57][58][59].Typical examples of feature selection methods include Lasso and Ridge Regression [60], Random Forests [61], and Gradient Boosting Machines [62].In previous studies on polypeptide sequences conducted in our laboratory [29], we tested various feature selection methods, including LGBM, ANOVA, and Chi-Squared.Ultimately, we chose the LGBM model due to its efficiency, simplicity, and ability to provide clear ranking results [63].LGBM provides built-in functions to compute split importance.Specifically, during the construction of decision trees, LGBM records the features used for each node split.It calculates an importance score for each feature by adding the total number of times that feature is selected as a splitting node.Next, we sort the important features in descending order and choose a threshold to filter the important features.

Feature Visualization Methods
Data dimensionality reduction is a common data preprocessing technique that reduces the number of features in the data while keeping most of the vital information intact [64].In machine learning, dimensionality reduction techniques are frequently coupled with data visualization to facilitate the understanding and interpretation of data and results.Notable dimensionality reduction visualization techniques include PCA (Principal Component Analysis) [65], t-SNE (t-distributed Stochastic Neighbor Embedding) [66], and UMAP (Uniform Manifold Approximation and Projection) [67,68].UMAP visualizes the reduction by creating a graph of neighborhoods in a high-dimensional space and an optimization algorithm for the recreation of this graph in a lower-dimensional space.Compared to traditional methods, UMAP more effectively handles nonlinear data distributions, offers greater scalability, and operates at higher speeds.In this work, we utilized UMAP for dimensionality reduction and visualization.

Figure 1 .
Figure 1.An overview of the iNP_ESM model.Initially, neuropeptide sequences are input into the protein language models ESM and UniRep, generating 1280D ESM features and 1900D UniRep features for each sequence.Subsequently, these features are combined to form a 3180D fused feature.This fused feature can be directly input into an SVM model.Alternatively, after dimensionality reduction through feature selection to 120 dimensions, the reduced feature can also be input into the SVM model.Following a series of optimizations and performance comparisons, the iNP_ESM model is finalized.

Figure 1 .
Figure 1.An overview of the iNP_ESM model.Initially, neuropeptide sequences are input into the protein language models ESM and UniRep, generating 1280D ESM features and 1900D UniRep features for each sequence.Subsequently, these features are combined to form a 3180D fused feature.This fused feature can be directly input into an SVM model.Alternatively, after dimensionality reduction through feature selection to 120 dimensions, the reduced feature can also be input into the SVM model.Following a series of optimizations and performance comparisons, the iNP_ESM model is finalized.

Figure 2 .
Figure 2. Comparison of 10-fold cross-validation metrics for the combination of six feature encoding methods and seven machine learning algorithms.Here, UniRep is represented in dark green, ESM in light green, SSA in light yellow, LM in dark yellow, BiLSTM in dark red, and TAPE_BERT in light red.The machine learning algorithms include (A) GNB, (B) KNN, (C) LDA, (D) LGBM, (E) LR, (F) RF, and (G) SVM.

Figure 2 .
Figure 2. Comparison of 10-fold cross-validation metrics for the combination of six feature encoding methods and seven machine learning algorithms.Here, UniRep is represented in dark green, ESM in light green, SSA in light yellow, LM in dark yellow, BiLSTM in dark red, and TAPE_BERT in light red.The machine learning algorithms include (A) GNB, (B) KNN, (C) LDA, (D) LGBM, (E) LR, (F) RF, and (G) SVM.
, we used UMAP to perform dimensionality reduction and visualization on the training sets of different features.Comparing Figure 5A,B with Figure 5C,D confirms that the clustering effect of fused features is superior to that of individual ESM and UniRep features.Comparing Figure 5C,D demonstrates that the 120D fused feature has comparable neuropeptide recognition capability to the 3180D fused feature but with significantly fewer features.These findings corroborate our previous data analysis results.We ultimately named the optimized model ESM+UniRep_F3180 as iNP_ESM_F3180 and the model ESM+UniRep_F120 as iNP_ESM_F120.

Figure 4 .
Figure 4.The variation in (A) accuracy and (B) Matthews correlation coefficient during the feature selection process for ESM+UniRep_F3180 with the number of features.Here, 10-fold cross-validation metrics are represented in green, independent test metrics in red, and the average of crossvalidation and independent test metrics in yellow.LGBM Classifier parameters: {'num_leaves': 32, 'n_estimators': 888, 'max_depth': 12, 'learning_rate': 0.16, 'min_child_samples': 50, 'random_state': 2020, 'n_jobs': 8}.Additionally, we used UMAP to perform dimensionality reduction and visualization on the training sets of different features.Comparing Figure 5A,B with Figure 5C,D confirms that the clustering effect of fused features is superior to that of individual ESM and UniRep features.Comparing Figure 5C, D demonstrates that the 120D fused feature has comparable neuropeptide recognition capability to the 3180D fused feature but with significantly fewer features.These findings corroborate our previous data analysis results.We ultimately named the optimized model ESM+UniRep_F3180 as iNP_ESM_F3180 and the model ESM+UniRep_F120 as iNP_ESM_F120.

Table 1 .
Comparison of 10-fold cross-validation accuracy for six encoding methods across seven machine learning algorithms.

Table 2 .
Comparison of 10-fold cross-validation and independent test metrics for SVM-based models after feature fusion and feature selection.

Table 3 .
Comparison of independent test metrics between iNP_ESM and existing methods.
1The best value of each column is underline and in bold.