Identifying Neuropeptides via Evolutionary and Sequential Based Multi-Perspective Descriptors by Incorporation With Ensemble Classification Strategy

Neuropeptides (NPs) are a kind of neuromodulator/ neurotransmitter that works as signaling molecules in the central nervous system, and perform major roles in physiological and hormone regulation activities. Recently, machine learning-based therapeutic agents have gained the attention of researchers due to their high and reliable prediction results. However, the unsatisfactory performance of the existing predictors is due to their high execution cost and minimum predictive results. Therefore, the development of a reliable prediction is highly indispensable for scientists to effectively predict NPs. In this study, we presented an automatic and computationally effective model for identifying of NPs. The evolutionary information is formulated using a bigram position-specific scoring matrix (Bi-PSSM) and K-spaced bigram (KSB). Moreover, for noise reduction, a discrete wavelet transform (DWT) is utilized to form Bi-PSSM_DWT and KSB_DWT based high discriminative vectors. In addition, one-hot encoding is also employed to collect sequential features from peptide samples. Finally, a multi-perspective feature set of sequential and embedded evolutionary information is formed. The optimum features are chosen from the extracted features via Shapley Additive exPlanations (SHAP) by evaluating the contribution of the extracted features. The optimal features are trained via six classification models i.e., XGB, ETC, SVM, ADA, FKNN, and LGBM. The predicted labels of these learners are then provided to a genetic algorithm to form an ensemble classification approach. Hence, our model achieved a higher predictive accuracy of 94.47% and 92.55% using training sequences and independent sequences, respectively. Which is $\sim $ 3% highest predictive accuracy than present methods. It is suggested that our presented tool will be beneficial and may execute a substantial role in drug development and research academia. The source code and all datasets are publicly available at https://github.com/shahidawkum/Target-ensC_NP.

smaller and less complex than the proteins. Compared to traditional neurotransmitters, NPs have more receptor recognition sites [2]. As a result, the NPS is highly selective for a specific target and has minimal side effects. In the immune system NPs, act as neurotransmitters and behave like hormones in the endocrine system. NPs perform a key part in developmental processes and other biological activities [3]. Whenever the nervous system adapts to new challenges like stress, injury, and drug abuse NPs are especially important indicated by many studies [4], [5]. NPs induced neural activity as well as numerous other features of non-neuronal cells, i.e. social behavior, food uptake, and energy usage. Matured NPs are retained in closely packed vesicles and released under controlled conditions in response to a stimulus [6], [7]. Binding to a G protein-coupled receptor initiates a signaling pathway [8]. Prepropeptides are the precursor of NPs that undergoes alternative splicing and produce one or many bioactive peptides. To produce functionally active neuropeptides, neuropeptide precursors (NPPs) go through several regulated cleavages. Prominently, these cleavage sites are identified by a cluster of basic amino acids [9]. To treat a variety of neurological disorders these NPs have paved the way for the development of novel therapeutic strategies. In nematodes, approximately 250 neuropeptides have been identified [10]. Echinoderms and cicadas are two well-known sources of neuropeptides. The neurosecretory glands of cicadas are rich source of neuropeptides [11]. In insects, over 30 different families of NPs have been identified with their diverse roles and structure. The majority of these molecules have an impact on the insect's physiology. Though, the activities of these molecules are dependent on the age and the type of species [12].
A unique set of challenges were faced while designing drugs for Alzheimer's disease (AD) [13]. The key biological feature of AD is the altered amyloid-beta biochemistry which is considered one of the potential drug targets for AD. However, due to their low intrinsic toxicity, peptide-based drugs may be a viable option for treating the symptoms of a variety of AD diseases. Neurological conditions such as stroke, pain, brain tumors, psychiatric disorders, and neurodegeneration are treated with effective and precise peptidebased drugs [14]. Many methods have been developed to identify neuropeptides. Liquid chromatography-tandem mass spectrometry, genetic analysis, and receptor binding assay were the traditional approaches for identifying neuropeptides [15]. The experimental process is considered very costly and laborious. Due to the recent success of machine learning applications in drug development and discovery for the effective analysis of bioactive peptides. Therefore, considering its significance, several machine learning based computational models have been developed for the prediction of NPs [9]. Jiang et al. presented a stacking-based ensemble model called NeuroPpred-Fuse for predicting NPs [16]. The peptide samples were trained via six different sequential representation techniques. Further, to decrease the vector size of the hybrid features, a feature selection is also employed. The proposed model reported an accuracy of 90.60%. Similarly, Hasan et al. proposed a NeuroPred-FRL predictor for the prediction of NPs [17]. Whereas, the numerical descriptors were formulated from the peptide samples via evolutionary, physiochemical properties, and sequential descriptors. The formulated features were then passed through a 2-step feature selection to gather the best features. Finally, the formulated vectors were measured via the random forest. Subsequently, Kang et al. presented NeuroPP for discrimination of NPs [6]. NeuroPP used frequency-based extraction schemes namely, AAC, TPC, and DPC to represent training samples. Furthermore, the optimal features were selected via ANOVA-based feature selection. Moreover, Bin et al. used a Binary profile, Composition, and Physicochemical properties-based hybrid feature vector for the prediction of NPs [18]. The predictive results were examined via several hypothesis learners and then an ensemble learning algorithm is utilized to further improve the predictive results.
Additionally, the existing computational models were developed via conventional machine learning models to examine the predictive performance of the extracted descriptors. Though, these methods were not considered cost-effective and had low prediction performance. Therefore, to handle such situations, it is required to develop an automatic and computationally efficient predictor to correctly discriminate NPs and non-NPs. The evolutionary structure information was explored from the amino acid sequences using novel bigram-PSSM extended with discrete wavelet transform (Bi-PSSM_DWT), and k-spaced bigrams extended with discrete wavelet transform (KSB_DWT). Apart from the evolutionary descriptors, one-hot features were also formulated via (one-hot encoding). Furthermore, to develop a cost-effective model, we applied the SHapley Additive exPlanations (SHAP) approach to choosing optimal features from the multi-perspective hybrid vector of evolutionary features such as Bi-PSSM_DWT + KSB_DWT, and one-hot sequential features [19]. SHAP-bruta interprets the significance of each feature in a multi-perspective vector. To train and evaluate the model, various hypothesis learners such as ETC [20], SVM [21], [22], ADA [23], XGB [24], FKNN [25], and LGBM [26]. The predicted labels of these individual classifiers were provided to the genetic algorithm (GA) to develop an ensemble classifier [27] to improve the predictive outcomes of the model. The graphical abstract of our proposed model is illustrated in Figure 1.

A. DATASET
To develop an automatic predictor, the selection of a valid training dataset is an essential step. To effectively examine the predictive analysis of our predictor, we used the same training samples that were previously presented in the PredNeuroP predictor [17], [18]. Initially, the dataset comprised 5948 laboratory-evaluated positive samples (NPs) VOLUME 11, 2023 from diverse taxa-derived NeuroPep databanks [1]. Among the whole sequences, the samples whose sequence lengths are higher than 100 residues and less than 5 residues are eradicated. Furthermore, on the remaining sequences, a CD-HIT tool was also employed to eradicate homologous peptide samples. Where the threshold value of 0.9 was used to remove peptide samples, whose similarities are greater than 90%. Hence, 2435 NPs are selected. On the other hand, a similar redundancy removal approach was used on negative samples (non-NPs), and finally, 2435 non-NPs are selected. Moreover, 80% of the whole dataset is used for the training dataset (NPs =1940 and non-NPs =1940) and the remaining 20% of peptide sequences are employed for testing. The overfitting and generalization ability of the proposed training model was measured using independent sequences. The independent dataset comprises 495 NPs and 495 non-NPs. Additionally, it was ensured that the sequences of the training data were not used in testing data.

1) ONE-HOT ENCODING
One hot encoding is a sparse formulation technique that has been extensively utilized to numerically represent the peptide sequences. Apart from the other techniques, one hot encoding represents binary features without affecting the sequence ordering of the amino acids in a peptide sample. Whereas, each residue of a peptide sequence is numerically converted to a feature vector having dimensions of 20 features. The final feature set is generated by assigning ''1'' against matched residue and for unavailable amino acid residues '0' will be placed. The working procedure of one hot encoding using peptide sample ''APLMGFQHVR'' is graphically illustrated in Figure 2. In addition, to effectively train the machine learning models, a feature vector of equal dimension is required. Hence, the length of the peptide samples is organized by adding some dummy alphabets (Padding) [28]. Keeping the same procedure, the peptide sequences of the whole training dataset are represented in equal length. However, it was also investigated that adding these dummy alphabets has no biological or functional effect on a peptide sequence. In other words, padding allows us to generate a fixed-length input vector of a protein sample regardless of its original length. Additionally, padding can increase the efficiency of the training model by allowing the algorithm to process batches of sequences in parallel, rather than processing each sequence individually.

2) K-SEPARATED BIGRAMS (KSB)
KSB was initially presented by Saini et al.which computes the association among those amino acid residues that are non-adjacent in a protein sample. The bigram probabilities are obtained from the sequential evolution probabilities in a PSSM Matrix [29]. Where k-spaced bigrams in a non-adjacent manner are separated by K amino acid residues in the sample, while k represents the positional distance among the bigrams [30]. The complete mechanism can be summarized via the following equation (1); Where R represents the PSSM matrix having H number of rows, H denotes the size of the amino acid sample in matrix R, and 20 columns mean 20 valid amino acids. The transition of the pth amino acid to the qth amino acid can be represented as follows: The above equation represents a matrix T(k) which consists of 400 features for amino acid transition for a single value of K. In the KSB formulation scheme, the protein samples are examined using different values of K, such as K = 1 2, and 3, as shown in Table 1. However, keeping the issue of computational cost and its highest predictive rates of the classifiers, we consider the feature vector using K=2.

3) BIGRAM POSITION-SPECIFIC SCORING MATRIX (BI-PSSM)
Bi-PSSM examines the evolutionary feature extraction strategy that computes the intrinsic pattern of the protein samples via different alignments of various protein families [31]. Along a protein sequence, Bi-PSSM replaces the occurring frequencies of the amino acid residues at a particular position [22], [32]. The resultant vector of a Bi-PSSM matrix contains the negative and positive scores of the amino acid residues. The negative scoring value shows the low occurring frequency of the amino acids and the positive value represents the high frequent occurrence of the amino acid substitution in an alignment [33]. The resultant PSSM feature space ''K'' can be shown as follows: where k i,j denotes the i th residue of the j th amino acid along a sequence. L is the length of a biological sample and twenty are the number of amino acids in a protein.

4) DISCRETE WAVELET TRANSFORM (DWT)
DWT is a transformation filter-based noise compression and elimination approach. DWT divides an input signal of the protein into two sub-parts namely wavelets [34]. While, one wavelet contains high-frequency coefficients namely detailed coefficients, and 2 nd wavelet keeps low-frequency coefficients called the approximation coefficients [35]. Moreover, it is also observed from recent studies, that low-frequency wavelet is more informative than the high-frequency wavelet. Hence, to extract highly effective information, the low-frequency wavelet is further divided into several levels as given in Figure 3. Where H F represents high-frequency coefficients, and L F are the low-frequency coefficients. In DWT, the input is divided into several scales (levels). Whereas, the detailed coefficients and approximation coefficients of a signal can be represented by 2 k , where k denotes the number of decomposed levels. DWT can be formulated as follows: where z(a) denotes the input signal, B(s, t) show the transform values/ coefficients for the specific position on the wavelet periods and signal. ψ( a−t s ) represents the wavelet function, while s is the scaling variable and t denotes the translation variable.
The detailed coefficients and approximation coefficients for a signal z(a) can be formulated as: where c[m], S, and R represent the input signal of the peptide sequence, low pass filter, and high pass filter, respectively. W i,L [x], and W i,H [x] denotes the detailed coefficient, and approximation coefficient of input samples, respectively. In this computational model, the Bi-PSSM features are transformed using DWT for signal de-noising. We evaluated DWT up to 5 levels and generated 260, 520, 780, 1040, and 1300 feature vectors for level-1, level-2, level-3, level-4, and level-5, respectively. The predictive results of the model were examined via five different level of features. Among which level-3 features shows produce significant results. The features of level-1 and level-2 achieved lower performance due to less informative patterns as compared to level-3. Similarly, level-4 and level-5 also depicted lower results due to redundant motifs that impair the model performance. Therefore, we select DWT upto level-3 to form a novel feature extraction approach called Bi-PSSM-DWT. It was also observed that further decomposition of DWT levels leads to similar and redundant features which may affect the model performance.

C. SHAP FEATURE SELECTION
Undoubtedly, determining the biological importance of the formulated numerical descriptors is not an easy task. The classification methods are also called ''black boxes'' owing to their intricate internal mechanisms. To comprehend and recognize the significance of individual features of the extracted space is a challenging task for a machine learning model [36]. Shapley Additive exPlanations (SHAP) interpretation is a global technique to evaluate the significance of each numerical feature based on aggregations of SHAP values [37]. The interpretable evaluation using a classification model also deals the issues that occur due to the lack of feature directivity [38]. In this paper, the predictive results of the XGB model are higher as compared to other models [39]. The procedure for selecting the optimal features via the SHAP algorithm can be described as follows: Initially, we select an objective function 'K', then the Shapley value 'δ D ' of each extracted descriptor D ∈ F was computed. Finally, only high-ranked features 'R' were selected, where R<d = |F|. The resultant Boruta-SHAP plot showing the high-ranked features was summarized in Figure 4. Where each row denotes the ranked feature and each point corresponds to the SHAP value of each instance. The red point indicates the high-ranked features, while the bluer points indicate the smaller value of the feature; the abscissae represent the SHAP values. Keeping the same procedure, the entire visualization of the entire model was measured via SHAP interpolation. The positive SHAP-value of a feature predicts that features diving towards NPs class and the negative SHAP-value represents the prediction to the non-NPs class.

D. ENSEMBLE LEARNING
Ensemble learner is an optimized classification algorithm that has been extensively employed for computing and predicting biological sequences via machine learning and deep learning because of its high generalization abilities and prediction results. The key objective of ensemble learning is to concatenate the predicted labels of the individual classification algorithms to develop an ensemble learning model that can enhance the predicted outcomes of a classifier with a minimal error rate. Instead of traditional classifiers, ensemble classification is highly reliable due to its minimum variance happens due to erroneous results of classical machine learners. Therefore, different ensemble learning models have been utilized for the prediction of different biological types i.e., anticancer peptide [40], subcellular localization [41], antifreeze proteins [42], antiviral peptides [34], Recombination spots [27], antifungal peptides [35], nucleosome positioning [43], and enhancer functions [44]. Consequently, we developed a genetic algorithm (GA) based ensemble learning model to further examine the predicted labels of the individual classifiers obtained via different extracted feature vectors. It is a heuristic learning approach that has been effectively applied in the bioinformatics area to solve different prediction problems with significant predictive results [45]. In a GA-based ensemble model that randomly chooses a specific population from the whole chromosomes and then different operators of genetic algorithm are employed to obtain the best performance results [46], [47].
At first, we calculated the prediction labels of five different traditional machine learning models, such as ETC, LGBM, SVM, XGB, and ADA. Then all the predicted labels are then provided to GA to develop an ensemble model as follows: In eq. (6), EnC i denotes the proposed ensemble model, and ⊕ signifies the fusing operator utilized to combine the predicted labels of the single learner. The EnC i model using different classification learners can be formulated as follows: Let us consider a machine learning model 'ML' for a protein sample 'R' is: where ML 1 , ML 2 , ML 3 , ML 4 , ML 5 represents the individual classifier and C 1 , C 2 denotes the predicted classes 49028 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
Finally, the predictive results EnC i using GA are measured as: where GA_EnC i shows the GA-based ensemble classification model, 'Max' denotes the higher predictive result, and W 1 , . . . , W 5 represents the optimal weight adjustment for an individual classifier.

E. FRAMEWORK OF THE PROPOSED MODEL
In this study, we presented an ensemble learning-based prediction model for the prediction of neuropeptides. Initially, the training sequences are formulated using onehot encoding-based sequential features. In addition to the sequential features, Bi-PSSM_DWT, and KSB_DWT are also applied to extract embedded evolutionary features from the peptide sequences. Additionally, we used a Multi-perspective Descriptors strategy, by combining the feature vectors of the aforementioned methods. The Multi-perspective vector consists of 1140 features, representing 20 features of one hot encoding, 720 features of Bi-PSSM_DWT, and 400 features of KSB_DWT. The training cost of the hybrid vector will be high due to the feature dimension of the training vector. Therefore, to reduce the computational cost of the proposed approach, we applied XGB-based SHAP feature selection to obtain 158 optimal features from the whole vector. In the next phase, six different machine-learning models are trained using the extracted features. While training the models, a train and split ratio of 80:20 is used to divide the training dataset. Moreover, to form an ensemble learning model, the predicted labels of the individual classifiers are provided to the genetic algorithm to boost the prediction results. Our predictor reported the highest prediction results than existing models using the training as well as the independent dataset. The framework of our proposed model is illustrated in Figure 1.

F. PREDICTION MEASUREMENT PARAMETERS
In learning algorithms, several parameters are utilized to evaluate the prediction abilities of a learning model [48]. Where the evaluation parameters determine a prediction model and whether the required objectives of a research problem are effectively addressed or not [22], [25]. Therefore, various prediction metrics have been applied in the literature to calculate the predictive results of a machine-learning model [49], [50], [51]. However, collecting the optimal parameters highly depends on the distribution of samples in a classifier. In order to compute the performance rates, initially, a confusion matrix is generated. Where a table is maintained by keeping the actual labels of a problem and its predicted labels. By properly maintaining the confusion matrix, the predictive results of our study are examined via the following evaluation parameters.
where NP + denotes the positive sequences and NP − denotes non-negative sequences. Similarly, NP − + shows the falsenegative predictions, and NP + − , represents an error of false positive, the model falsely determines the true instances as false.

III. RESULTS AND DISCUSSIONS
In this study, the predictive rates of the extracted vectors are examined via a k-fold CV test. Where the 10-fold CV is applied to randomly divide the features into 10 folds of equal size [49]. Among the 10 folds, 9 folds are employed for model training and the samples of one fold are kept for model testing. Additionally, a Stratified looping method is also used by randomly splitting training 100 times, and then the mean results are calculated to achieve reliable outcomes. In the below subsections, the predicted outcomes of the numerically formulated vectors using training samples and test samples using various classification models.

A. PARAMETER SETTING OF GENETIC ALGORITHM
In GA, choosing the best parameters is a crucial step that leads to achieving the maximum predictive outcomes of a machine learning problem. Initially, the chromosomes of GA were represented in bit-string form. A population of ''80'' was randomly chosen from the whole size to obtain the best results. Whereas the high population may boost predictive rates, it can directly increase the execution time of the model. Additionally, a tournament-selection approach was used to pick potential parents from the existing population. For the production of off-springs, we used a rank scaling parameter. Additionally, an intermediate crossover method is used having a value of 0.7 and a uniform distribution is selected to mutate the genetic diversity for the next generations. Hence, an improved GA model was ended with the optimal parameters. The complete details of the optimal parameters selection are provided in Table 2. After applying these parameters higher predictive rates of the training model are achieved as given in Figure 5.

B. RESULTS ANALYSIS OF CLASSIFICATION MODELS BEFORE FEATURE SELECTION
The predictive outcomes of the learning models using formulated feature spaces are given in Table 3. The performance of all feature vectors is examined by each learning by computing its predictive accuracy (ACC), sensitivity (Sen), specificity Previous models have proved that learning models increase performance outcomes via an optimized feature vector [52], [53], [54]. In this regard, the SHAP feature selection approach is implemented to choose the optimal feature space and summarized the results in Table 4. The optimal feature vector is then evaluated by ETC, FKNN, Ada, XGB, LGBM, SVM, and Ensemble-GA classifiers using the 10-fold CV. ETC yielded an accuracy of 86.34%, a sensitivity of 87.93%, a specificity of 84.74%, an MCC of 0.73, and an AUC of 0.94. Compared with the ETC classifier, the performance outcomes of FKNN and Ada are not satisfactory, while XGB reflected better performance. Onward, LGBM achieved 86.08% accuracy and SVM attained 83.17% accuracy. On the other hand, our proposed ensemble learning model achieved the highest prediction results having an accuracy of 94.47%, a sensitivity of 97.32%, a specificity of 93.81%, an MCC of 0.91, and an AUC of 0.97. These results confirm that the Ensemble strategy can discriminate NPs more accurately.

D. PREDICTION COMPARISON OF OUR MODEL WORK WITH PRESENT STUDIES
The predictive outcomes of our predictor and its comparison with the existing models via training and independent set are given in Table 5. Our proposed model via training sequence achieved an accuracy of 94.47%, a sensitivity of 97.32%, a specificity of 93.81%, an AUC of 0.98%, and an MCC of 0.91, respectively via training sequences. Our model reported significant results by improving the accuracy by 2.57%, sensitivity by 7.82%, MCC by 0.07%, and AUC by 0.02 then NeuroPred-FRL [17]. A prediction model is said to be reliable if it has a high generalization power for unseen data (independent data). In this VOLUME 11, 2023 49031 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  regard, we used an independent dataset to validate the effectiveness of the proposed study, the detailed predictive result of the independent set is illustrated in Table 5. It shows that our proposed predictor obtained the highest outcomes than existing approaches. Our model improved by ∼2.15% accuracy, ∼5.64% sensitivity, ∼ 6% MCC, and ∼0.02 % AUC than NeuroPred-FRL [17]. Similarly, our model boosted 2.15% accuracy, 5.64% sensitivity, 6% MCC, and 0.01% AUC than NeuroPpred-Fuse [16]. The current study surpassed all other existing tools on all evaluation parameters. The achieved outcomes demonstrate the efficiency of the proposed study.

IV. CONCLUSION AND FUTURE INSIGHTS
NPs play critical roles in a variety of biological processes and the pharmacological industry. In this study, a successful attempt has been performed for the accurate prediction of NPs using GA-based Ensemble learner and SHAP interpretation for the selection of optimal features from the heterogeneous feature set. The proposed approach reported remarkable predictive rates over the existing machine learning models applied for the prediction of NPs. The improved results of our predictive model are due to various reasons i.e. the suitable sequence formulation technique, selection of optimal descriptors using novel Shap analysis, and an effective model training algorithm. The achieved results confirm that our proposed model will be effectively performed a key role in identifying NPs in drug development due to their superior discriminative and generalization abilities. In our future model, we will try to establish a publically available web server for the proposed work and make further efforts to develop more capable approaches, such as feature selection or advanced deep neural networks to further improve the predictive results of NPs.

CONFLICTS OF INTEREST
The authors declare no conflict of interest.