Gly-LysPred: Identi ﬁ cation of Lysine Glycation Sites in Protein Using Position Relative Features and Statistical Moments via Chou ’ s 5 Step Rule

: Glycation is a non-enzymatic post-translational modi ﬁ cation which assigns sugar molecule and residues to a peptide. It is a clinically important attri-bute to numerous age-related, metabolic, and chronic diseases such as diabetes, Alzheimer ’ s, renal failure, etc. Identi ﬁ cation of a non-enzymatic reaction are quite challenging in research. Manual identi ﬁ cation in labs is a very costly and time-consuming process. In this research, we developed an accurate, valid, and a robust model named as Gly-LysPred to differentiate the glycated sites from non-glycated sites. Comprehensive techniques using position relative features are used for feature extraction. An algorithm named as a random forest with some preprocessing techniques and feature engineering techniques was developed to train a computa-tional model. Various types of testing techniques such as self-consistency testing, jackknife testing, and cross-validation testing are used to evaluate the model. The overall model ’ s accuracy was accomplished through self-consistency, jackknife, and cross-validation testing 100%, 99.92%, and 99.88% with MCC 1.00, 0.99, and 0.997 respectively. In this regard, a user-friendly webserver is also urbanized to accumulate the whole procedure. These features vectorization methods suggest that they can play a critical role in other web servers which are developed to clas-sify lysine glycation.


Introduction
These Proteins are the organic polymeric nitrogenous compounds. Proteins are the major structural and functional components of every organism in the form of enzymes, antibodies, hemoglobin, etc. Proteins provide energy to our body but not the main source of energy. The athletic population requires a very high consumption of proteins. Every protein is made up of amino acids. There are only 20 amino acids which are constituents of all types of proteins. From these 20, 12 amino acids (11 in children's) are synthesized by our body which named as nonessential. Remaining amino acids are described as essential means are not synthesized by our body and we have to consume these amino acids in our diets. Lack of any essential amino acids results in affecting the growing ability of tissues [1]. That's why the proper intake of protein is also needed for aged people. It is observed that as age increases 40, muscle strength decreases due to loss in muscle mass [2,3]. The loss in muscle mass creates health-related issues such as sarcopenia and osteoporosis. For the recovery of these issues a person should take proper protein in their aged years [4]. Post-translational modification PTM's are the enzymatic or non-enzymatic reactions of amino acid chains. PTM's affect both protein's physiological functions and the structure of the protein.
Determination of PTMs is essential in exposition to elaborate on the processes that grove cellular level, as like cell divide, development, or diversity. PTMs term indicates changing happen in the polypeptide sequence as a result of either the accumulation or exclusion of separate chemical meridian to amino acid residues [5]. These accumulations or exclusions divided the PTM's into two broad types such as covalent post-translational modification and covalent cleavage peptide backbones in protein [6]. The chemical PTMs have been studied for a variety of biochemical changes in many types of proteins that occur in many combinations or signal-dependent method and also define their tertiary or quaternary structures and control their events and purpose. All evidence will support to realize biological occurrences and disorder stage involving these proteins [7].
Yonder alternative splicing of messenger RNA (mRNA) which is used to as a source of protein diversity, post-translation modification (PTMs) of proteins further modulate and extends the range of possible proteins functions by covalently attaching small chemical moieties to selected amino acid residues. Over 200 many types of PTMs have been recognized that outcome many phases of molecular level and metabolic, signal transduction, or protein immovability [8].
Several studies on PTMs have concentrated on precise forms and they are related to proteins task with phosphorylation on behalf of the most dynamically investigates PTM-type [9,10]. PTMs adversely impact biological cellular functions such as metabolism, signal transduction, and protein stability. These chemical modifications include phosphorylation, glycosylation, methylation, acetylation, ubiquitination. That's why the understanding of PTMs is important in the study of cellular biology for disease treatment and prevention [11].
Lysine is a type of essential amino acid that means they are not produced in the human body and if any deficiency occurs in our body we take it from outside and complete our body functions. They are present in lot amounts these are poultry, meats, or milk [12]. Lysine is very important as many biological functions requiring some notable applications include the production of connective tissues such as bone, skin, collagen, or elastin, and the making of carnation in the result of fatty acid converts in energy to healthy growth and development in offspring. And also manage our valuable immune function, mostly with observe antiviral activity [13]. In hyperglycemic conditions produced in our body that reactions are starting non-enzymatic glycosylation and very vital mechanisms are modifying proteins, leading to conformational changes and malfunction of proteins [14]. Schiff base and Amador product and produced when a free amino acid group with the carbonyl group of reduced sugar in the result of proteins bilocation occur. When those proteins are manufactured by this procedure are converted in varied compounded that called Advanced glycation end-products (AGEs) [15].
This complete process is depicted in Fig. 1. Advanced glycation that end product of when amino acid takes place the glycation in feverish collagens and also deposit at the time of glycation. We can check through mass spectrometry and recognized fractional fructose-hydroxy-lysine glycations at each of the helical area cross-linking sites of type I collagen that is elevated in tissue from a diabetic mouse model [16]. That study also provisions to a proposed connotation between glycation and collagen thickening. Perceived reduction in collagen extractability from diabetic goes on to introduce intermolecular AGEs cross-links. A very little consequence on collagen solubility upon pepsin digestion compared with acid abstraction supports the addition of inter-triple-helical cross-links in diabetic mouse tendon. Amassed AGEs yields have been associated with both improved and also declined [17]. Post-translation modification of proteins with reducing sugars and α-carbonyl products of their degradation in the result of glycation [18]. Amadori products can also undergo degradation to form carbonyl compounds, such as methylglyoxal or then undergo extra corrupting, oxidation, reduction, and condensation reaction, leading to an irretrievable AGE establishment [19]. Diabetes or nephropathy diabetic or other diseases are happened by the addition of glycation products [20].
Non-fluorescent proteins crosslinks methylglyoxal-lysine or glyoxalin-lysine these are dimer forms of protein and also change the structure and efficient properties which affects harmfully cellular uptake. AGEs arise below regular physiological state but is boost up in when calcium level in the high state [21]. Reactive oxygen species (ROS) has increased glycation of the enzyme of the structural components of the connective tissue matrix and basement member component [22].
Post translation modification identification in proteins is a very critical issue from 19's to till date. Currently, the focus change from the data science technique to the mixture of machine learning techniques with feature processing and deep learning techniques e.g. GANNphos and DeepPhos are two predictors used to predict the phosphorylation. GANNphos [23]. In the machine learning era, in 2006, the initial predictor was GlyNN which was developed by using the ANN technique by using a dataset of 126 non-glycated (negatives) and 89 glycated (positives) lysine sites from 20 proteins [24]. In 2015 another computational tool named PreGly based on a feature extraction technique composition of k-spaced amino acid pairs (CKASSP) used to predict Lysine glycation with a similar data set to GlyNN [25]. In 2016, focus tend to different feature extraction techniques with the machine learning model. A new predictor Glee-PseAAc was developed by combining the Support Vector Machine (SVM) algorithm and the position-specific amino acid-base features with the use of a rationalized dataset consisting 446 nonglycated and 223 glycated sites from CPLM databank [26]. Initially, experienced data scientists were not  Figure 1: Chemical process of lysine glycation going to trust computational methods due to the ill-famed local minima problem [27,28]. Gradually the situation changed and different sequential and structural bioinformatics detection tools development get started such as X-ray crystallography, NMR tool, etc. [28][29][30][31][32][33][34]. Mass spectrometry, radioactive labeling, matrix statistics, vector projection, and several affinity-based methods were used to predict distinct PTM's sites e.g., ubiquitination, phosphorylation, glycosylation, etc. These methods were costly, laborious, and time-consuming [35][36][37][38][39][40][41][42][43]. After these all techniques, trend changes towards data science techniques to make predictions. Various techniques of data sciences used to develop prediction servers including the following but not limited to: Artificial Neural networks [44][45][46][47][48], backpropagated Neural networks [47,48], Support vector machines [49][50][51][52][53][54], Hidden Markov model [55,56], logistic regression [57], Bayesian theory [58], consensus sequences [59], backpropagated Neural networks [47,48], nearest neighbor [60], random forest [61]. DeepPhos [62], pDeep2 [63], DeepUbi [64], and many others [65] which are based on different Deep learning techniques to make predictions. Currently, the focus change from the data science technique to the mixture of machine learning techniques with feature processing and deep learning techniques, e.g., GANNphos and DeepPhos are two predictors used to predict the phosphorylation. GANNphos [23], DeepPhos [62], pDeep2 [63], DeepUbi [64], and many others [65] which are based on different Deep learning techniques to make predictions. The same tendencies occurred with the prediction of lysine glycation. Initially, Lysine glycation is also predicted by some costly and time-consuming methods such as mass spectrometry [40], matrix statistics [41], vector projection [42,43], etc. In the machine learning era, in 2006, the initial predictor was GlyNN which was developed by using the ANN technique by using a dataset of 126 non-glycated (negatives) and 89 glycated (positives) lysine sites from 20 proteins [24]. In 2015 another computational tool named PreGly based on a feature extraction technique composition of k-spaced amino acid pairs (CKASSP) used to predict Lysine glycation with a similar data set to GlyNN [25]. In 2016, focus tend to different feature extraction techniques with the machine learning model. A new predictor Glee-PseAAc was developed by combining the Support Vector Machine (SVM) algorithm and the position-specific amino acid-base features with the use of a rationalized dataset consisting 446 non-glycated and 223 glycated sites from CPLM databank [26]. In 2017 to 2018, with some advancement there are some recent predictors has been developed with the combination of machine learning approach and some feature extraction techniques to improve the previous predictor's performance such as BPB_GlySite (Combination of SVM algorithm and Bi-Profile Bayes (BPB) based feature extraction technique) [66], Glypre (by combining the SVM and multiple features like an index, position amino acid CKSAAP, conservation) [67], iProtGly-SS (by using structure-based sequence-based features) [68], GlyStruct (a combination of structural properties of amino acid residues and support vector) [69], MDS_GlySitePred (with the combination of SVM and multidimensional scaling feature extraction techniques) [70]. Although Lysine glycation is a complex and multistep process, Identifications of lysine in labs is a time-consuming, operator dependent, and laborintensive task. To overcome these issues, a computational model is developed for lysine glycation predictions with increased accuracy and efficiency. This computational model follows the Chou's 5-step rule [71][72] which are depicted in the Fig. 2.
Chou's first step rule data collection; a stringent and reliable dataset is developed for model's training and testing purposes. In the second step, features are extracted by dataset sequences after some preprocessing and then conversion of these sequences into vectors by using position relative incidence and statistical moments. In Chou's third step learning model; machine learning models would be used to train the network e.g., random forest. The most robust and solid model would choose to make predictions. The fourth step is related to the evaluation and validation of the model by using different evaluation measures such as measuring the accuracy, specificity, and sensitivity. In the last step of this model a web server is developed and accessible publically to end-users.

Materials and Methods
This section describes overall techniques for the predictor. This contains the dataset collection data processing and learning model. In the data of the first phase set is collected from an online well-known publically available database named uniprot. In the second phase feature vectors are generated by using some statistical methods. For leaning model 2-3 model would be trained and the best model with high accuracy would be chosen.

Benchmark Dataset
The astringent and reliable dataset is a base for a computationally accurate and statistically robust predictor. The noisy dataset will alter the classifier's robustness and the predicted accurateness may be disbelieved [73]. An accurate dataset is curated from UniProt: https://www.uniprot.org/ which is consists of 1287 positive sites and 1300 negative sites are obtained. CD-Hit is used to remove the >= 60% redundant data from the given dataset.

Feature Extraction
The formulation of biological sequences into a vector or a discrete model is the most critical problem in computational biology. Different techniques used in past to do this job such as Composition of k-spaced amino acid pairs (CKASSP) [68], position-specific amino acid-base feature extraction [69], Bi-Profile Bayes (BPB) based feature extraction technique [70], multiple features like an index, position amino acid CKSAAP, conservation [71], structure-based sequence features) [72], structural properties of amino acid residues [73] and multidimensional scaling feature extraction technique [74].
The focus on feature extraction is due to the nature of the machine learning model as they cannot handle the sequence samples. The required dataset should be in vector form [74]. To resolve the issue PseAAc [75] is

Statistical Moments Calculation
This approach is used to quantitatively describe the whole dataset. Different data properties are represented with different orders of moments that are used to evaluate the data size and to indicate the eccentricity and orientation of data. Some well-known moments described by statisticians and mathematicians which are based on distribution functions and polynomials [77,78]. In this study Hahn, central and raw moments are considered.
In Hahn moments, Hahn polynomials are used [79] and location and scale variants are calculated. In central moments, location invariant asymmetry, mean and variance are calculated w.r.t centroid [80,81] and for probability distribution in the dataset in raw moment's location variant asymmetry, mean and variance are calculated.
These specific statistical moments provide sensitive information about the sequence order while the scale-invariant moments are not much appropriate. Data is represented by quantified values [82].

Determination of Position Relative Incidence Matrix (PRIM)
To predict protein behavior, protein sequence order information is used as the basis for any mathematical model. Amino acid's relative positions are the essential segment of the protein's physical attributes. It is furthermore vital to quantize the amino acid's relative positions in the polypeptide chain. PRIM extracts this information and form a 20 × 20 matrix which is shown below:

Determination of Reverse Position Relative Incidence Matrix (RPRIM)
Effectiveness and accuracy of machine learning algorithms mainly depend on exactness and thoroughness by which the related features of data can be extracted. Machine learning algorithms are capable to uncover and understand the blur, obscure and hidden features from Data. Within a polypeptide chain, the PRIM matrix extracts information related to the relative positioning of amino acids. This Matrix works similarly to PRIM but in a reverse way. RPRIM helps to discover the more obscure patterns within the polypeptide chains. RPRIM is also a 20 × 20 matrix with 400 coefficients which is shown below: Dimensionality reduction with RPRIM is done by using statistical movements (central, raw, and Hahn moments) calculation which converts 400 coefficients to 24 coefficients.

Frequency Matrix Determination
Frequency represents the distribution of amino acid residues in the sequence inside the primary structure. To measure these frequencies, a frequency matrix is used. The frequency matrix is:

Accumulative Absolute Position Incidence Vector (AAPIV) Generation
In the frequency matrix, accumulation frequency or compositional information of amino acid residues occurrences in the polypeptide chain was computed. Information related to the relative positions of amino acid residues was not considered.

Reverse Accumulative Absolute Position Incidence Vector (RAAPIV)
RPRIM, RAAPIV is also used to uncover the hidden and obscure features from data. RAAPIV revers the primary structure string and then extract AAPIV features.

Operating Algorithm
Random forest algorithm is used to predict the lysine glycation sequences which were developed by Leo Bremen [84]. This algorithm is used for classification by using the concept of ensemble decision/ classification trees and has been employed in many biological problems. Ensemble learnings of Decision trees allow the algorithm to learn and predict simple and complex classifications accurately. According to the inventors random forest does not require plentiful fine-tuning of parameters and provide excellent performance with default parameters [61,[85][86][87]. Decision trees in random forest classification are the foundation of the algorithm and these trees improve accuracy at the time of merging because each tree has a random subset of the feature vectors [88][89][90][91]. Feature vectors of proteins (which contain Statistical Moments Calculation, PRIM, RPRIM, Frequency Matrix Determination, AAPIV, RAAPIV vectors) are propagated down the trees to train the model and to calculate an output matrix is formed in a supervised manner which conformed to two classes (positive or negative) by analyzing the leaf occupancy as shown in Fig. 3. Accuracy is calculated with the prediction of random forest.

Accuracy Measures
Evaluation is an important procedure to develop a solution to any problem. This is used to evaluate the anticipated accuracy of any new model, some testing techniques are used [92]. Following are the Obtained results from this experiment in this section.

Metrics Formulation
Several comparison metrics exist to compare multiple supervised algorithms performance [93,26]. Among all, the most important and common measures are Accuracy(ACC) to measure the overall accuracy, sensitivity (Sn) to measure the sensitivity, specificity (Sp) to measure specificity, Mathew Correlation Coefficient (MCC) to measure stability using true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values [94]. These measures are defined as: This metric is used to define the ratio of correct predictions to the total instances. This metrics range is from 0 to 1. Higher Acc value represents higher performance.

Sn = TP TP þ FN
This metric is used to define the true positive rate of a classifier and tells about the performance of the classifier in a way to calculate correctly predict lysine glycation. These metrics are also ranges from 0 to 1. A higher sensitivity value represents higher performance.
This metric is used to define the false positive rate of a classifier and tells about the performance of the classifier in a way to calculate correctly predict non-glycated peptides. These metrics also range from 0 to 1.
MCC has a range from −1 to +1 which represents the negative correlation and positive correlation. TP is the count of true positives glycated peptides or positive glycated peptides which are correctly predicted by the classifier, TN is the count of true negative glycated peptides or non-glycated peptides which are correctly classified by the classifier, FP is the count of false positives which means the count of incorrectly nonglycated peptides are predicted as positive glycated peptides, FN is the count of false negatives which means the count of incorrectly glycated peptides predicted as non-glycated peptides.

Test Methods
After getting the appropriate metrics to evaluate the classifier, some test methods to score these metrics are also needed. Three methods are most commonly used in statistics to evaluate the predictor which are; Self-consistency, K-fold cross or subsampling validation, and Jackknife testing [95]. In self-consistency same dataset is used to train and test the model. Self-consistency testing has an Accuracy, Sensitivity, Specificity, and MCC of 100%, 100%, 100%, and 1.0 respectively. Receiver Operating Characteristics (ROC) and Area under the Curve (AUC) are also used to evaluate the model's performance. ROC plots the sensitivity or true positive rate as a function of specificity or false positive rate for all possible folds. AUC value shows the performance of the predictor. Closer value to 1 depicts the better performance of the model [26,54]. ROC of the self-consistency testing is given in Fig. 4.
In the absence of obvious validation set to check the appropriateness of the proposed method, crossvalidation testing is being used. In cross-validation (subsampling) dataset is divided into k distinct folds and k is always kept constant during the test process. The process is repeated k-times for each fold and accuracy is calculated intended for each iteration. Finally, average of the all calculated accuracies is used as a cross-validation result. We performed k-fold cross-validation by using k = 10. An average result of 10-fold cross-validation has an Accuracy, Sensitivity, Specificity, and MCC of 99.88%, 99.84%, 99.74%, and 0.997 respectively. The results of 10-fold cross-validation are depicted in Fig. 5.
All cross validations 10-fold detailed measures results are also presented in the following Tab. 1.
In jackknife testing, one instance from the dataset was selected for testing, and the lasting all instance used to train the model. In other words, if you have N size of dataset then N−1 size will be used for training purposes and remaining 1 will be used for testing on N−1 size trained dataset. In the same way all instances are tested without keeping them in the training dataset [96]. Jackknife testing has been used widely to examine the various predictors' quality by many investigators [97][98][99]. Jackknife testing has 99.92% accuracy, 99.8% Specificity, 100% sensitivity, and 0.99 MCC. Detailed Results of Jackknife's accuracy are depicted in Fig. 6.

Comparison with Existing Methods
The proposed solution is compared with some pre-build classifiers which are GlyNN [24], PreGly [25], and GlyPseAAc [26] and with iProtGly-SS [68]. Following is the comparison of 10-fold cross-validation results from all predictors as shown in Tab. 2.

Web Server
A user-friendly and easily accessible web server is developed for the end-users initially on local host where they can input their sequences and check either the sequence belongs to lysine glycation or not. The interface of the under-construction webserver is displayed in Fig. 7. This presented web server is our future work and will established with some new concept. Jackniffe 2586 folds Figure 6: Accuracy of Jackknife testing

Conclusion
Glycation is a type of non-enzymatic PTM which assigns sugar molecule and residues to a peptide. It is more substantial and clinically important attributes to numerous chronic diseases and age-related, metabolic such as diabetes, Alzheimer's, renal failure, etc. The bulk of the dataset is first used to train the network. The method proposes a Position Relative Incidence Matrix, Reverse Position Relative Incidence Matrix, Frequency Matrix, Accumulative Absolute Position Incidence Vector, and Reverse Accumulative Absolute Position Incidence Vector for feature extraction. Experiment results also show that the presented methodology provides high throughput and accuracy than the previous predictors. In this research, using Chou's 5 step rule, we developed a model named Gly-LysPred for the identification of lysine glycation sites from non-lysine glycation sites based on Random Forest (RF) which save a lot of time, money and also not operator dependent. Different verification and validation testing such as Self Consistency, 10-Fold Cross-Validation and Jackknife Testing being performed. Overall model's accuracy was accomplished through self-consistency, jackknife, and cross-validation testing 100%, 99.92%, and 99.88% with MCC 1.00, 0.99, and 0.997 respectively. This method is ultimately more accurate cost, absolutely effective, and use high throughput technique for the identification of lysine glycation sites from existing.