NCSP-PLM: An ensemble learning framework for predicting non-classical secreted proteins based on protein language models and deep learning

: Non-classical secreted proteins (NCSPs) refer to a group of proteins that are located in the extracellular environment despite the absence of signal peptides and motifs. They usually play different roles in intercellular communication. Therefore, the accurate prediction of NCSPs is a critical step to understanding in depth their associated secretion mechanisms. Since the experimental recognition of NCSPs is often costly and time-consuming, computational methods are desired. In this study, we proposed an ensemble learning framework, termed NCSP-PLM, for the identification of NCSPs by extracting feature embeddings from pre-trained protein language models (PLMs) as input to several fine-tuned deep learning models. First, we compared the performance of nine PLM embeddings by training three neural networks: Multi-layer perceptron (MLP), attention mechanism and bidirectional long short-term memory network (BiLSTM) and selected the best network model for each PLM embedding. Then, four models were excluded due to their below-average accuracies, and the remaining five models were integrated to perform the prediction of NCSPs based on the weighted voting. Finally, the 5-fold cross validation and the independent test were conducted to evaluate the performance of NCSP-PLM on the benchmark datasets. Based on the same independent dataset, the sensitivity and specificity of NCSP-PLM were 91.18% and 97.06%, respectively. Particularly, the overall accuracy of our model achieved 94.12%, which was 7~16% higher than that of the existing state-of-the-art predictors. It indicated that NCSP-PLM could serve as a useful tool for the annotation of NCSPs


Introduction
As a fundamental mechanism for intercellular communication, protein secretion could occur in all living organisms and has an important role in many physiological processes.The majority of secretory proteins contain an N-terminal signal peptide that allows their translocation into the endoplasmic reticulum via the classical secretory system [1].Nevertheless, several cytoplasmic proteins detected in the extracellular environment lacking a known signal peptide are secreted via the non-classical protein secretion pathway [2].They are usually described as NCSPs and can play diverse roles in various biological processes including intercellular signaling, immune regulation, tissue repair and regeneration, cell communication and human diseases such as neurodegenerative disorders and cancer [3][4][5].
Accurate identification of NCSPs is important for unraveling the complexity of intercellular communication and the underlying mechanisms involved in the above physiological and pathological processes.Since experimental approaches are often costly and time-consuming, computational methods will be required to enable genome-wide annotation of NCSPs with high efficiency and low cost [6].To date, various methods based on machine learning have been developed for predicting NCSPs, including SecretomeP [7], SecretP [8], NClassG+ [9], PeNGaRoo [10], NonClasGP-Pred [11], ASPIRER [12], iNSP-GCAAP [13] and so on.For instance, Bendtsen et al. proposed the first tool termed SecretomeP for predicting NCSPs in mammals by employing six sequence-based features as the input of the neural network [7].The SecretP model trained a support vector machine (SVM) to distinguish the three types of secretory proteins by using both sequence and structural features [8].The NClassG+ tool was designed for identifying NCSPs in Gram-positive bacteria, which adopted the nested k-fold cross-validation (CV) to select the best models from four different sequence transformation vectors and SVMs with linear, polynomial and Gaussian kernel functions [9].Recently, Zhang et al. developed a two-layer LightGBM ensemble learning framework, termed PeNGaRoo, for predicting NCSPs in Gram-positive bacteria by extracting three groups of features, i.e., sequencederived features, evolutionary information-based features and physicochemical property-based features [10].Moreover, the NonClasGP-Pred model improved the performance of NCSPs prediction based on the same datasets with PeNGaRoo by handling the potential prediction bias arising from imbalanced data [11].Additionally, ASPIRER trained a hybrid deep learning-based framework to enhance the identification of NCSPs by combining a whole amino acid sequence-based model and an N-terminal sequence-based model [12].iNSP-GCAAP utilized the global composition of amino acid properties to encode protein sequences and then adopted the random forest algorithm to perform the prediction of NCSPs, which achieved the superior performance than the other state-of-the-art methods [13].
To the best of our knowledge, the PLM technique has not been systematically tested on the prediction of NCSPs.In this study, we proposed a novel computational approach, termed NCSP-PLM, to identify the NCSPs based on their protein sequences by selecting the optimal model from nine different PLM embeddings and three deep learning models.For each of these nine embeddings, we first trained three neural networks, i.e., MLP, attention mechanism and bidirectional long short-term memory network (BiLSTM), and then picked out the best one.Second, we selected the top five models with the accuracies higher than the average accuracy of these nine models.Third, the ensemble classifier was adopted to perform the final prediction of NCSPs using the weighted voting of these five optimal models.Benchmark experiments on the 5-fold CV and the independent test suggested that the proposed NCSP-PLM model outperformed existing tools based on the traditional handcrafted features and the PLM embeddings are particularly useful for the NCSPs prediction.Figure 1 illustrates the flow chart of the NCSP-PLM model.

Benchmark datasets
The critical first step in developing a robust and efficient classification model is the construction of a high-quality benchmark dataset.In this study, we used the benchmark datasets constructed by Zhang et al. [10] to train and evaluate the proposed model.The training dataset includes 141 positive samples (i.e., NCSPs) and 446 negative samples (i.e., cytoplasmic proteins), which was applied to perform the 5-fold CV.In addition, the independent test dataset consists of 34 positive and 34 negative samples, which was employed to compare our model with the other existing tools.
The reasons why we adopted these datasets were chiefly as follows.
(1) All NCSPs were experimentally verified in literature.(2) The sequence similarity was reduced to 80% to avoid the homology bias.(3) There were no overlapping sequences between the training dataset and the independent test dataset.

Pre-trained protein language model embeddings
As protein representations, we directly extracted self-supervised embeddings from pre-trained PLMs without fine-tuning the training data.In this present work, nine popular PLMs were adopted, including ProtVec [17], SeqVec [18], ProSE [19], UniRep [20], Tape [21], ESM-1b [22], ProtBERT [23], ProtT5 [23] and ProteinBERT [24].Given a protein with the length of L, the size of a PLM embedding is L × F, where F denotes the dimension of the individual embedding for each amino acid.To obtain a fixed-length vector representation, we averaged the embedding matrix over the length L. Table 1 summarizes the nine PLM embeddings used in this study.(1) The ProtVec embedding is the first word vector-based protein representation, which was trained on the Swiss-Prot database [33] through a Skip-gram neural network and generated a 100-dimensional vector [17].(2) The SeqVec was trained on the UniRef50 database [34] by using an architecture composed of a convolutional layer and two BiLSTM layers [18].(3) The structure of ProSE is a three-layer BiLSTM similar to the SeqVec structure, with the difference that it uses not only the sequence data but also the structural information of the proteins [19].(4) The UniRep model contains a layer of multiplicative LSTM with 1900 hidden units, which was trained on the UniRef50 database [20].(5) The Tape model aims to leverage the power of transformers to capture long-range dependencies and context in protein sequences [21], trained on the Pfam database [36].(6) The ESM-1b model has 33 transformer layers and was trained on the UniRef50 database by using the masked language modeling objective [22].(7) The ProtBERT and ProtT5 models are based on two auto-encoder transformer structures, trained on data from the BFD [37] and UniRef databases.The difference between the two models is that ProtBERT trained only the encoder component, while ProtT5 consists of both an encoder and a decoder.(8) Unlike the classic transformers, ProteinBERT is a denoising auto-encoder model and contains both local and global representations [24].The details of these nine PLM embeddings were also provided in Supplementary Table S1.

Deep learning model architecture
In this study, we adopted three different deep learning architectures, i.e., MLP, attention mechanism and BiLSTM, to process the PLM embeddings and perform the prediction of NCSPs. Figure 2 shows the overall network structures of these three models.We implemented our models by using TensorFlow (1.15.5) and the specific parameters of these deep networks are available in Supplementary Table S2.As shown in Figure 2(a), the MLP was employed as our baseline model, which consists of an input layer, three hidden layers and an output layer.Additionally, we applied the batch normalization (BatchNorm) to mitigate the overfitting after the hidden layers.In Figure 2(b), an attention layer before the MLP structure was introduced to amplify the influence of key input features.As the output of the attention layer, a weighted feature vector quantifying the importance of the embeddings was obtained and then passed to a dense layer consisting of 512 units.As illustrated in Figure 2(c), we designed a BiLSTM layer with 512 cells before the MLP to process the input PLM embeddings in both forward and backward directions simultaneously.The output of the BiLSTM layer was fed to a flatten layer, followed by two dense layers with 512 cells.With the aim of reducing the overfitting, we also applied the BatchNorm to both the flatten and dense layers.

Imbalanced classification problem solving
The imbalanced proportion of positive and negative samples could affect the prediction accuracy of the classifier.In this study, we explored three approaches to address this issue, i.e., synthetic minority oversampling technique (SMOTE) [38], focal loss [39] and weighted binary cross-entropy (WCE) [40].
SMOTE is an oversampling technique that allows us to create synthetic samples for our minority class on the lines connecting a sample point and one of its K-nearest neighbors [38].Focal loss is an improved version of cross-entropy loss that specifically handles the imbalanced classification problem by assigning higher weights to hard or frequently misclassified instances, while down-weighting the easy instances [39].WCE is also a variant of the binary cross-entropy loss function that assigns different weights to the positive and negative classes to balance their contributions to the loss function [40].The weights are usually inversely proportional to the class frequencies, meaning that the weight of the minority class is higher than the weight of the majority class.

Performance assessment
In this study, the 5-fold CV and the independent test were performed to examine the performance of our models for the prediction of NCSPs.In addition, six common metrics were adopted to report the predictive ability of the proposed model [41,42], including sensitivity (SN), specificity (SP), precision (P), accuracy (ACC), F1 and Matthews correlation coefficient (MCC), defined with the following equations: where TP, FP, TN and FN represent the numbers of the true positive, false positive, true negative and false negative samples, respectively.Additionally, the area under the receiver operating characteristic (ROC) curve (AUC) and the area under the precision-recall (PR) curve (AUPRC) were calculated as another two reliable performance metrics for the comparison with existing algorithms.

Performance of protein language model embedding with different deep learning models
In this section, we employed three deep learning models, i.e., MLP, attention mechanism and BiLSTM, to compare the performance of nine PLM embeddings for the prediction of NCSPs.For each embedding, three neural networks were trained on the benchmark dataset, resulting in 27 base models.The results of the independent tests were shown in Figure 3 and those of the 5-fold CV were illustrated in Supplementary Figure S1.As seen from Figure 3, different embeddings achieved the best ACC values using different deep learning models.Specifically, the UniRep, ESM-1b, ProtBERT and ProteinBERT embeddings exhibited the outstanding ability of identifying the NCSPs by utilizing the attention mechanism model, with the ACC values of 0.7794, 0.8382, 0.7500 and 0.7647, respectively.The MLP models trained by the ProtVec, Tape and ProtT5 embeddings, respectively, outperformed the attention mechanism and BiLSTM models, with the AUC values of 0.8244, 0.7638 and 0.9325.The BiLSTM models obtained the highest ACC values (i.e., 0.6912 and 0.7647) when using the SeqVec and ProSE embeddings.Moreover, ProtT5 performed better than the other embeddings in terms of ACC, MCC, F1, AUC and AUPRC.
To further select the optimal models, the best ACC values for these nine models selected from 27 base models were plotted in Figure 4.The average ACC of nine models was 0.765.Four embeddings were discarded in the subsequent analysis due to their below-average ACC values.In other words, ProSE+BiLSTM, ProtT5+MLP, UniRep+Attention, ESM-1b+Attention and ProteinBERT+Attention were selected to build the ensemble classifier for the identification of the NCSPs.

Performance of ensemble approaches
In this section, the independent test was performed to assess the performance of the ensemble models, which adopted the soft voting strategy to integrate the output of the 5 optimal base models by assigning different weights.For the sake of simplicity, the weights of ProSE+BiLSTM, UniRep+Attention, ESM-1b+Attention and ProteinBERT+Attention were equally set to 1 due to their comparable levels.Moreover, ProtT5+MLP was assigned higher weights to strengthen its influence in the final results because of its remarkable performance.The five metrics, including SN, SP, ACC, MCC and F1, were adopted to evaluate the performance of these models, and the corresponding results were listed in Table 2.
As can be seen from Table 2, all ensemble models achieved the better and more stable performance compared with the corresponding individual models, indicating the effectiveness of the soft voting strategy.Besides, the ensemble model obtained the highest ACC, MCC and F1 values when ProtT5+MLP had a weight of 3.However, the ACC value witnessed a downward trend when increasing the weight of ProtT5+MLP higher than 3, indicating that the excessively high weight setting may lead to the overreliance on a single model and thus harm the overall performance.

Effect of different strategies for handling sample imbalance
In this section, we investigated the effect of three different strategies for solving the data imbalance problem, including SMOTE, focal loss and WCE.Table 3 summarized the comparison results on the independent test dataset.The corresponding ROC and PR curves were shown in Figure 5.  Referring to Table 3, the SN values were always lower than the SP values in any case, caused by the low proportion of the positive samples in the training dataset compared to the negative samples.In addition, the SMOTE technique unexpectedly performed poorly in terms of ACC and MCC, suggesting that synthetic examples generated by the SMOTE did not retain the specific characteristics of the NCSPs.In contrast, the focal loss and WCE techniques, which were based on the cross-entropy loss function, markedly improved the model's performance.The WCE method was superior to the focal loss method in terms of all evaluation metrics except for SP.Hence, we adopted the WCE method as the final scheme to handle the imbalanced classification in this study.

Comparison with existing methods
To the best of our knowledge, there are only four computational tools for the identification of the NCSPs on the same training dataset and the independent test dataset, including PeNGaRoo [10], NonClasGP-Pred [11], ASPIRER [12] and iNSP-GCAAP [13].As mentioned above, these models relied on a variety of handcrafted features to train different supervised learning algorithms for predicting the NCSPs.Table 4 presented a comparison of our NCSP-PLM model with these methods using eight evaluation indices.The ROC and PR curves of NCSP-PLM were illustrated in Figure 6.As shown in Table 4, the proposed NCSP-PLM predictor outperformed the listed state-of-the-art methods in terms of SN (0.9118), SP (0.9706), P (0.9688), ACC (0.9412), MCC (0.8839), F1 (0.9394), AUC (0.9758) and AUPRC (0.9623).This indicated that the performance of traditional protein representations can be reached or surpassed by the PLM embeddings for the NCSPs prediction task.Additionally, the NonClasGP-Pred tool achieved the balanced SN and SP values, which addressed the data imbalance issue by generating ten balanced datasets.Moreover, the ASPIRER and iNSP-GCAAP models yielded the comparable SP values higher than 0.97.However, the SN values of these two methods were lower than 0.65, probably caused by the data imbalance.

Conclusions
In this study, we presented a novel approach called NCSP-PLM for predicting the NCSPs in Gram-positive bacteria.First, we provided a comparative analysis of nine different PLM embeddings with three deep learning models, and picked out the five optimal base models.Then, we constructed the ensemble learning framework using the weighted soft voting scheme to improve the performance of the proposed model and adopted the WCE technique to handle the data imbalance issue.Finally, benchmark experiments demonstrated that NCSP-PLM performed remarkably well in the NCSPs identification task and obtained a significant performance boost over current state-of-the-art methods based on traditional protein feature representations.The source code and all the datasets are freely available at https://github.com/hollymmm/NCSP-PLM.
There are two aspects that highlight the novelty of our model: (1) The knowledge derived from the pre-trained PLMs was extracted as feature embeddings and adopted to predict the NCSPs for the first time; and (2) the comparison of nine PLMs was made to develop the most of their potential for the annotation of NCSPs.In our future endeavors, we aspire to continually improve our model through three major avenues.First, to mitigate the risk of overfitting, we will gather additional NCSP samples from published work and build a larger dataset for training our model.Second, we will explore the combined use of multi-view features to enhance the prediction of NCSPs such as sequence-derived features, PSSM-based features, physicochemical property-based features and PLMs-based features.Third, we will provide a user-friendly web server accessible to the public, offering more than just the source code of the model.

Use of AI tools declaration
The authors declare that they have not used Artificial Intelligence (AI) tools in the creation of this article.

Figure 1 .
Figure 1.The flow chart of the NCSP-PLM model.

Figure 2 .
Figure 2. The network structures of three deep learning models.(a) The MLP model processes PLM embeddings through three dense layers.(b) The Attention model adds the attention mechanism before two dense layers.(c) The BiLSTM model uses a flatten layer after the output of BiLSTM, followed by two dense layers.

Figure 4 .
Figure 4.The line chart shows the ACC values for nine PLM embeddings.

Figure 5 .
Figure 5.The ROC and PR curves based on three different balancing strategies.(a) ROC curves; and (b) PR curves.

Figure 6 .
Figure 6.The ROC and PR curves of NCSP-PLM based on the independent test.(a) ROC curves; and (b) PR curves.

Table 1 .
The summary of the nine PLM embeddings adopted in this study.

Table 2 .
Performance of the soft voting by using different weights.

Table 3 .
Effect of three balancing strategies.

Table 4 .
Performance comparison with existing methods using the independent test.