PupStruct: Prediction of Pupylated Lysine Residues Using Structural Properties of Amino Acids

Post-translational modification (PTM) is a critical biological reaction which adds to the diversification of the proteome. With numerous known modifications being studied, pupylation has gained focus in the scientific community due to its significant role in regulating biological processes. The traditional experimental practice to detect pupylation sites proved to be expensive and requires a lot of time and resources. Thus, there have been many computational predictors developed to challenge this issue. However, performance is still limited. In this study, we propose another computational method, named PupStruct, which uses the structural information of amino acids with a radial basis kernel function Support Vector Machine (SVM) to predict pupylated lysine residues. We compared PupStruct with three state-of-the-art predictors from the literature where PupStruct has validated a significant improvement in performance over them with statistical metrics such as sensitivity (0.9234), specificity (0.9359), accuracy (0.9296), precision (0.9349), and Mathew’s correlation coefficient (0.8616) on a benchmark dataset.


Introduction
Post-translational modifications (PTM) are referred to as changes of protein composition through the addition of small molecules to specific sites (amino acid residues) on the body of a protein.
While there have been a lot of studies done on these PTMs, another modification named pupylation [19,20] has attracted much attention in the research community. The process whereby prokaryotic ubiquitin-like protein (Pup) [21,22] attaches to substrates for degradation via an isopeptide other methods was used. Finally, PupStruct was compared with two benchmark predictors ( [36] and [38]) which showed a significantly improved performance over them. PupStruct is able to predict pupylated lysines with 0.9234 sensitivity, 0.9359 specificity, 0.9296 accuracy, and 0. 8616 Mathew's correlation coefficient.

Materials and Methods
We propose a computational method named PupStruct which employs nine structural features, including accessible surface area, secondary structure (helix, strand, and coil), and backbone torsion angles. The following sections discuss the benchmark dataset, different features, feature exaction for each lysine, and the support vector machine classifier used for pupylation site prediction.

Protein Dataset
As stated in the introduction, for this study, we have taken the protein sequences from PupDB databases [44]. It contained 153 protein sequences whose lysine residues are either pupylated or non-pupylated. We examined each protein sequence and retrieved whether it was composed of pupylation or non-pupylation residues. We attained 181 positive lysines (pupylated) and 2290 negative lysines (non-pupylated) which were used for this study. The next section explains various structural features computed from each of the protein sequences.

Accessible Surface Area (ASA)
ASA refers to the accessible area of each amino acid to a solvent of the protein in 3D configuration [65][66][67]. Since the value of an amino acid involves the protein configuration, the predicted ASA value of individual amino acids displays vital information regarding the protein structure. We executed SPIDER2 on each protein sequence to compute an estimated numeric ASA value for each amino acid in the protein with known 3D structures [48]. It is wise to note that the predicted ASA value entirely depends on the sequence information which is mainly used by SPIDER2 for computation.

Secondary Structure
This property presents significant information on the local 3D structure of proteins. This can be inferred as amino acid's contribution to each of the defined local structures of proteins, namely helix (ph), strand (pe), and coil (pc), as is shown in Figure 1a. Again, we executed SPIDER2 to predict the prospect contribution of each amino acid to the three mentioned local structures, namely ph, pe and pc, which results in three discrete numerical vectors of these local structures [68]. Furthermore, SPIDER2 also gives the local structure with the highest probability as one L × 3 matrix, where L depicts the protein length, and the three columns are the corresponding probabilities contribution to each local structure ph, pe and pc. Hence, to simplify, we denote this matrix as SSPre [69].

Local Backbone Torsion Angles
The secondary structure gives important information on local configuration of amino acids of protein [70], whereas torsion angles between neighboring amino acids supplement predicted ASA and Genes 2020, 11, 1431 4 of 13 secondary structure with vital information about the local structure of proteins. Since the predicted secondary structure is a distant output, the backbone torsion angle φ and Ψ continuously provides information on local amino acids interaction along the protein backbone [71,72]. Recently, two new angles are identified based on the dihedral angles θ, between three Cα atoms (Cα i-1 -Cα i -Cα i+1 ) and τ, rotated about the Cα 1 -Cα i+1 bond [50]. To attain these four angles, we executed SPIDER2 [49] on every protein sequence and achieved four numerical vectors, namely φ, Ψ, θ and τ. The illustration of φ, Ψ, θ and τ is shown in Figure 1b. Genes 2020, 11, x FOR PEER REVIEW 4 of 14 two new angles are identified based on the dihedral angles θ, between three Cα atoms (Cαi-1 -Cαi -Cαi+1) and τ, rotated about the Cα1 -Cαi+1 bond [50]. To attain these four angles, we executed SPIDER2 [49] on every protein sequence and achieved four numerical vectors, namely ϕ, Ψ, θ and τ. The illustration of ϕ, Ψ, θ and τ is shown in Figure 1b.
(a) (b) Figure 1. Illustrations of the secondary structure of a protein and local backbone torsion angles in amino acids (a) the helix, strand and coil for a protein (picture source: freepik.com) while (b) illustrates torsion angles associated with the protein backbone. Dihedral angle for the different bonds are discussed above.

Feature Extraction for Lysine Residues
Structural features are used to identify the pupylated and non-pupylated sites by employing 6 upstream and 6 downstream amino acids from and including the lysine residue K as shown in Figure  2 which adds the window size equal to 13. The mirroring effect [18,47,[73][74][75] was employed to fill the amino acids in the absence of 6 amino acids either upstream or downstream from the lysine residue, i.e., if the lysine residue is located near the N or C terminus as shown in Figure 3. To obtain the best window size, we constructed training dataset using 11-to 41-residue window sizes and trained the PupStruct predictor but the best result was obtained by window size 13.
Let us consider peptide S, consisting of 6 upstream and 6 downstream amino acids, including lysine reside K in the middle, that can be stated as: Where A-i (for 1 ≤ ≤ 6) are upstream and Ai (for 1 ≤ ≤ 6) are downstream amino acids. Therefore, the lysine residue consists of 13 amino acids in total, including K. Each peptide S will contain a pupylated or non-pupylated lysine which means the K can have a class label x as x = {0, 1} where x = 1 then S denotes pupylated residue and if x = 0 then S denotes non-pupylated residue. Moreover, each amino acid Ai (for -6 ≤ ≤ 6; A0 = K) can be deliberated by the structural features as: , ℎ, , , , Ψ, , }

Feature Extraction for Lysine Residues
Structural features are used to identify the pupylated and non-pupylated sites by employing 6 upstream and 6 downstream amino acids from and including the lysine residue K as shown in Figure 2 which adds the window size equal to 13. The mirroring effect [18,47,[73][74][75] was employed to fill the amino acids in the absence of 6 amino acids either upstream or downstream from the lysine residue, i.e., if the lysine residue is located near the N or C terminus as shown in Figure 3. To obtain the best window size, we constructed training dataset using 11-to 41-residue window sizes and trained the PupStruct predictor but the best result was obtained by window size 13.
Let us consider peptide S, consisting of 6 upstream and 6 downstream amino acids, including lysine reside K in the middle, that can be stated as: Where Ai (for 1 ≤ i ≤ 6) are upstream and A i (for 1 ≤ i ≤ 6) are downstream amino acids. Therefore, the lysine residue consists of 13 amino acids in total, including K. Each peptide S will contain a pupylated or non-pupylated lysine which means the K can have a class label x as x = {0, 1} where x = 1 then S denotes pupylated residue and if x = 0 then S denotes non-pupylated residue. Moreover, each amino acid A i (for −6 ≤ i ≤ 6; A 0 = K) can be deliberated by the structural features as: It is worth noting that the structural features ASA, ph, pe, pc, φ, Ψ, θ and τ are numeric vales and each represents a sole value for each amino acid A i . Thus, A i can be expressed in an 8-dimensional feature vector. The numeric values were normalized and then placed in a vector form. This implies that each segment S (of 13 amino acids) is represented by 104 structural features (of 13 amino acids x 8). These structural features of lysine are used to predict pupylated or non-pupylated sites in line with the peptide S.  It is worth noting that the structural features ASA, ph, pe, pc, ϕ, Ψ, θ and τ are numeric vales and each represents a sole value for each amino acid Ai. Thus, Ai can be expressed in an 8-dimensional feature vector. The numeric values were normalized and then placed in a vector form. This implies that each segment S (of 13 amino acids) is represented by 104 structural features (of 13 amino acids x 8). These structural features of lysine are used to predict pupylated or non-pupylated sites in line with the peptide S.

Support Vector Machine for Classification
The support vector machine (SVM) [76,77] is engaged in both regression and classification applications and is also used in many state-of-the-art predictors for pupylation sites [35][36][37][38][39]41,43]. The literature shows that the SVM produces a lower prediction error compared to other classifiers when large numbers of features are considered, as in this study there are 104 features. It is also proven that SVM has produced best results and mostly used in the areas of bioinformatics research including genomics, protein function prediction, protease functional site recognition, chemogenomics, transcription initiation site prediction and gene expression data classification [78][79][80]. SVM operates by discovering the maximum difference among the two hyperplanes demonstrating linear boundaries of two different classes. The dimension of the hyperplane is influenced by the number of features, therefore, kernel functions which can be either polynomial, radial or linear was employed to deal with non-linear boundaries between classes [81][82][83]. We used the LIBSVM [84] classifier developed Chih-Chung Chang and Chih-Jen Lin on the Matlab platform developed the MathWorks, Inc from Natick, Massachusetts, United States, with a radial basis kernel function to determine the  It is worth noting that the structural features ASA, ph, pe, pc, ϕ, Ψ, θ and τ are numeric vales and each represents a sole value for each amino acid Ai. Thus, Ai can be expressed in an 8-dimensional feature vector. The numeric values were normalized and then placed in a vector form. This implies that each segment S (of 13 amino acids) is represented by 104 structural features (of 13 amino acids x 8). These structural features of lysine are used to predict pupylated or non-pupylated sites in line with the peptide S.

Support Vector Machine for Classification
The support vector machine (SVM) [76,77] is engaged in both regression and classification applications and is also used in many state-of-the-art predictors for pupylation sites [35][36][37][38][39]41,43]. The literature shows that the SVM produces a lower prediction error compared to other classifiers when large numbers of features are considered, as in this study there are 104 features. It is also proven that SVM has produced best results and mostly used in the areas of bioinformatics research including genomics, protein function prediction, protease functional site recognition, chemogenomics, transcription initiation site prediction and gene expression data classification [78][79][80]. SVM operates by discovering the maximum difference among the two hyperplanes demonstrating linear boundaries of two different classes. The dimension of the hyperplane is influenced by the number of features, therefore, kernel functions which can be either polynomial, radial or linear was employed to deal with non-linear boundaries between classes [81][82][83]. We used the LIBSVM [84] classifier developed Chih-Chung Chang and Chih-Jen Lin on the Matlab platform developed the MathWorks, Inc from Natick, Massachusetts, United States, with a radial basis kernel function to determine the

Support Vector Machine for Classification
The support vector machine (SVM) [76,77] is engaged in both regression and classification applications and is also used in many state-of-the-art predictors for pupylation sites [35][36][37][38][39]41,43]. The literature shows that the SVM produces a lower prediction error compared to other classifiers when large numbers of features are considered, as in this study there are 104 features. It is also proven that SVM has produced best results and mostly used in the areas of bioinformatics research including genomics, protein function prediction, protease functional site recognition, chemogenomics, transcription initiation site prediction and gene expression data classification [78][79][80]. SVM operates by discovering the maximum difference among the two hyperplanes demonstrating linear boundaries of two different classes. The dimension of the hyperplane is influenced by the number of features, therefore, kernel functions which can be either polynomial, radial or linear was employed to deal with non-linear boundaries between classes [81][82][83]. We used the LIBSVM [84] classifier developed Chih-Chung Chang and Chih-Jen Lin on the Matlab platform developed the MathWorks, Inc from Natick, Massachusetts, United States, with a radial basis kernel function to determine the margin between pupylated and non-pupylated lysine residues. The radial basis kernel function was fine-tuned with a gamma set to 0.5 and cost value set to 3 (nu-SVR).

Results
Having a computational method which aims to predict pupylation sites requires a severe assessment of its performance. This section discusses the statistical metrics, evaluation strategy, Genes 2020, 11, 1431 6 of 13 balancing dataset, oversampling dataset, and the comparison of the proposed PupStruct with other recent state-of-the-art predictors from the literature.

Performance Measures
In this study, we have incorporated five metrics to compare the performance of PupStruct with other state-of-the-art predictors in terms of predicting pupylated and non-pupylated lysines. These five metrics are sensitivity (Sn), specificity (Sp), accuracy (Acc), precision (Pre) and Matthew's correlation coefficient (Mcc), which are widely used in the literature [35,[37][38][39]85] Sensitivity, which is one of the key measures, evaluates the correctness of identifying pupylation sites. The predictor attaining high sensitivity shows that it can accurately detect the pupylated lysines (positive instances). In other words, sensitivity value 0 shows the predictor's inability to detect any pupylated lysine residues (true positives), whereas the value 1 depicts a predictor's capability to correctly identify all pupylated lysines. Specificity gauges the predictor's capability to detect non-pupylation sites (true negative). It varies between 0 and 1 where value 0 shows predictor's inability to predict non-pupylation sites and value 1 indicates predictor's ability to predict non-pupylation sites. Accuracy calculates the total number of correctly classified pupylated and non-pupylated lysine residues which ranges between 0 and 1 where 0 means a least accurate predictor and 1 means the best accurate predictor. Precision, which is another assessment measure, is a fraction of correctly identified pupylated sites over the sum of correctly identified pupylated and non-pupylated sites. Mathew's correlation coefficient (MCC) scales the classification quality of the predictor, which ranges from −1 to +1. A predictor with (MCC) value −1 implies a totally negative correlation, whereas +1 means a completely positive correlation.
Considering Equations (3)-(7) for each metric, let's look at a dataset with +P as a number of pupylated sites and −P as a number of non-pupylated sites. Therefore, each metric can be expressed as: Speci f icity = −P −P + + −P − (4) Precision = +P + + −P + +P + −P (5) Accuracy = +P + + −P + +P + −P (6) where +P + is number of pupylated sites classified correctly, +P − represents the number of pupylated sites incorrectly classified, −P + is the number of non-pupylated sites predicted correctly and −P − represents the number of incorrectly predicted non-pupylated sites by the predictor. The perfect predictor should achieve the highest in all the five metrics. However, at least sensitivity should be greater when comparing it with other predictors. A lower value of sensitivity shows that it cannot correctly predict pupylated lysine residues and, therefore, it is not fit for lysine pupylation detection.

Evaluation Strategy
To accurately evaluate the effectiveness of the PupStruct predictor in terms of the statistical metrics, we used a cross-validation method. Two most common cross-validation approaches are n-fold cross-validation and jackknife. An independent test set is used for evaluation purposes. The jackknife method is considered to be the least arbitrary and yields unique outcomes for a dataset [86] but we deployed the n-fold cross-validation scheme for this study which involves less processing time and also commonly used in the literature [15,35,36,38,39,41]. The n-fold cross-validation technique is employed in the following steps:

1.
Partition data samples randomly into n parts of roughly equal size with roughly similar negative and positive samples on each fold.

2.
Take out one-fold as test set or validation data and the remaining n-1 folds as training data.

3.
Use the training data set to fine-tune the parameters of the predictor.

4.
Use the test set to compute the five statistical metrics. 5. Repeat Step 1 to Step 4 for the remaining n folds and calculate the average of each performance metric.
We carried out 6-, 8-and 10-fold cross-validations to evaluate PupStruct predictor and recorded the result.

Filtering Out the Imbalance Data
Our benchmark dataset comprised 163 protein sequences which has 181 pupylation sites (positive sample set) and 2290 non-pupylation sites (negative sample set). The difference between the number of positive set and negative set of around 12 times creates a huge imbalance between the classes. Although this may be biologically realistic to have a number of non-pupylated lysines greater than pupylated lysines, this inconsistency can cause severe bias in machine learning. We applied the k-nearest neighbors cleaning technique [45] to tackle this problem, which is mostly used in the literature.
For this, we first set the cut-off K value equal to 12 since the negative and positive sample ratio was about 12:1, thus we eliminated any negative sample which had at least one positive sample within its 12-nearest neighbors. We consequently increased the k value until we achieved similar numbers of positive and negative samples. Eventually, the number of negative samples was significantly reduced to 188 samples by k value of 48. After the filtering process, the filtered negative samples and all positive samples were used to perform n-fold cross-validation to evaluate the PupStruct predictor.

PupStruct vs. Other Existing Predictors
The proposed PupStruct was compared with recent two proposed predictors IMP-PUP [36] and PUL-PUP [38]. It can be noted that PUL-PUP also used structural features. The software packages were given for the two methods. It is worth noting that both predictors used the same dataset thus, many of the proteins may be used in their training set. Therefore, the software was re-run and tested using the test set respectively to the set used in PupStruct evaluation process. For PUL-PUP [38], since the code didn't execute, we retrieve the features from the method and used the same train and test set from PupStruct to calculate the performance. The performance reported is based on test data which correspond to the test set kept aside during the n-fold cross-validation procedure means that we keep aside the test set during the n-fold cross-validation procedure. Test data was not used to adjust the training parameters of the model. Table 1 reports the performance of the predictors. It is clearly witnessed that the proposed PupStruct is performing better than all the benchmark predictors in metrics in likes of sensitivity, accuracy and MCC. The sensitivity was improved by 14%, accuracy by 11%, specificity by 7%, and precision by 9%. Moreover, MCC was significantly improved by 21% compared to IMP-PUP [36].
To give more insight of the performance of PupStruct, we generated ROC curve to measure AUC (area under the curve) and calculated the average AUC values for 6-, 8-, and 10-fold cross validations which was recorded at 0.910, 0.915 and 0.911 respectively which indicates stable performance of PupStruct. The results of the ROC-AUC analysis are shown in Figure 4. To give more insight of the performance of PupStruct, we generated ROC curve to measure AUC (area under the curve) and calculated the average AUC values for 6-, 8-, and 10-fold cross validations which was recorded at 0.910, 0.915 and 0.911 respectively which indicates stable performance of PupStruct. The results of the ROC-AUC analysis are shown in Figure 4. This encouraging result demonstrates the proficiency of proposed PupStruct predictor to distinguish the pupylated and non-pupylated lysine residues accurately. It appears that structural information of amino acids provides essential information about nearby modified lysines. Finally, the SVM classifier with a radial basis kernel function appears to discover the maximal separation of both hyperplanes when structural characteristics are employed. All these structural features together with the classifier plays a key role in predicting pupylated and non-pupylated lysine residues.

Discussion
We further analyzed each feature (accessible surface area, secondary structure (helix, strand, and coil), and backbone torsion angles) to gauge their contribution towards the predictor. We used different features to train and test the model and recorded the result for comparison. Initially, we used each group of features for training, that is then secondary structure (helix, strand, and coil) and backbone torsion angles. Next, we used individual features ( , ℎ, , , , Ψ, , ) separately and recorded their result contributing to the predictor. Finally, we used a combination of some of these features which are contributing the most towards the predictor. The result shown in Table 2 is for six-fold cross validation. This encouraging result demonstrates the proficiency of proposed PupStruct predictor to distinguish the pupylated and non-pupylated lysine residues accurately. It appears that structural information of amino acids provides essential information about nearby modified lysines. Finally, the SVM classifier with a radial basis kernel function appears to discover the maximal separation of both hyperplanes when structural characteristics are employed. All these structural features together with the classifier plays a key role in predicting pupylated and non-pupylated lysine residues.

Discussion
We further analyzed each feature (accessible surface area, secondary structure (helix, strand, and coil), and backbone torsion angles) to gauge their contribution towards the predictor. We used different features to train and test the model and recorded the result for comparison. Initially, we used each group of features for training, that is ASA then secondary structure (helix, strand, and coil) and backbone torsion angles. Next, we used individual features (ASA, ph, pe, pc, φ, Ψ, θ, τ) separately and recorded their result contributing to the predictor. Finally, we used a combination of some of these features which are contributing the most towards the predictor. The result shown in Table 2 is for six-fold cross validation.
It is clearly observed from Table 2 that ASA and secondary structure (ph, pe, pc, also known as SSpre) contributes the most towards the performance. It observed that SSPre has contributed the most towards specificity and precision, while ASA contributes the most towards sensitivity and MCC, which are the most important metrics. However, combining the two reduces the performance. Protein's shape is determined by amino acid sequence in the polypeptide chain. When exposed to the cytosol (water-based solution in which proteins floats) or lumen (inside space of a tubular structure), polypeptide chain assumed to localized organization to secondary structure that optimizes interactions between side chains of amino acids with each other and water. The polypeptide backbone folds into spirals (helix) and ribbons (stand). These properties provide very important information about the amino acid and extracting the helix, stand, and coil (SSPre) values contributed the most to the performance of PupStruct [87,88]. In the literature, the accessible surface area (ASA) of a protein is always considered as a determining factor in protein folding and stability work. ASA is surface characterized around a protein by a hypothetical centre of a solvent sphere with the surface of the molecule. Based on the ASA value, amino acid residues can be determined as buried or exposed. This makes ASA a crucial feature contributing towards the performance [88]. When considering the individual feature only, then ASA, coil (pc), followed by Tau from the local torsion angle contributes the most towards the predictor. Eventually, using all the features gave the best result, which is shown in Table 1, which means that each feature demonstrated some contribution towards the predictor.

Conclusions
This study presented a new computational method named PupStruct for identifying pupylation sites in protein sequences. PupStruct utilizes structural information of amino acids around the lysine residue and uses the k-nearest neighbour approach to solve the imbalance data issue. The analysis of which features contribute how much to the predictor was crucial information for training. Finally, the support vector machine (LIBSVM) with a radial basis kernel function to identify maximal separation between pupylated and non-pupylated lysine residue showed that PupStruct performed better than the existing predictors.