LAIPT: Lysine Acetylation Site Identification with Polynomial Tree

Post-translational modification plays a key role in the field of biology. Experimental identification methods are time-consuming and expensive. Therefore, computational methods to deal with such issues overcome these shortcomings and limitations. In this article, we propose a lysine acetylation site identification with polynomial tree method (LAIPT), making use of the polynomial style to demonstrate amino-acid residue relationships in peptide segments. This polynomial style was enriched by the physical and chemical properties of amino-acid residues. Then, these reconstructed features were input into the employed classification model, named the flexible neural tree. Finally, some effect evaluation measurements were employed to test the model’s performance.


Introduction
Post-translational modification (PTM) is one of the most significant processes in the field of biology. More than 650 types of post-translational modification were reported across several decades of efforts. Among these types of post-translational modification, several modifications have the ability to reverse their processes. PTM provides a fine-tuned control of protein function in various types of cells in the field of disease research and drug design [1][2][3][4]. For example, the well-known tumor suppressor p53 is subject to many post-translational modifications, which have the ability to alter its localization, stability, and other related functions, thus ultimately modulating its response to various forms of genotoxic stress [5][6][7][8][9][10]. Therefore, p53 drives both the activation and repression of a large number of promoters, which ultimately define its tumor suppressor abilities. This tumor suppressor is a critical transcription factor in the field of post-translational modification [11]. With these reversible modifications, protein structures change and their functions are enriched to some degree. As one of the most typical and classical reversible types of modification, lysine acetylation was reported about half a century ago [1,2]. Acetylation occurs on the ε-amino group of lysine residues; it was noted that three enzymes take part in this process. Whereas lysine deacetylases (KDACs) remove the acetyl groups of proteins, lysine acetyl transferases (KATs) transfer the acetyl group across proteins [3][4][5][6]. Considering the key role of lysine acetylation in several diseases and novel drug designations, a great deal of experimental approaches were proposed and introduced to identify the acetylation sites of lysine residues in protein sequences. These experimental approaches, including radioactivity chemical methods, chromatin immune precipitation (ChIP), and mass spectrometry, play their roles in various degrees [7,8]. Unfortunately, these experimental methods can hardly meet the need of identifying sites, and they are time-consuming and expensive. Considering this issue, effective identification methods,

Comparison with Other Features
In order to evaluate the performance of the polynomial form features, several state-of-the-art methods were chosen for comparison, including binary encoding, amino acid composition (AA composition), grouping AA composition, physico-chemical property, k nearest neighbor features, and secondary tendency structure. The details of these comparisons are shown in Tables 1-3.

Comparison with Other Models
In order to more objectively evaluate the performance of the proposed feature description and employed classification model, we compared it with several state-of-the-art methods, including DBD (Breakdowns of B)-Threader, iDNA-Prot, and other similar tools in the field of sequence classification and post-translational modification. Details of these comparisons are shown in Tables 4-6. In order to show the proposed model's stability and generalization, we utilized the ROC (receiver operating characteristic curve) curve to show the classification results. Meanwhile, some cross-validation methods (fourfold, sixfold, eightfold, and 10-fold) were also utilized. The detailed ROC curves for each species are shown in Figures 1-3.

Performance Using Differences Bandwidths
In this work, the bandwidths of sliding windows played a significant role in the feature size. On the one hand, the lack of a bandwidth can waste computational resources and result in ineffective feature description. On the other hand, different species may have unique bandwidths in this classification model. Therefore, we tested bandwidths ranging from 21 to 31, with an interval of 2. Detailed results for each of these bandwidths in the selected species are shown in Table 7. In order to more objectively show the results, we compared them with other machine learning methods, including SVM, NN, and RF. From the above table, we can easily determine that the most appropriate bandwidths for Homo sapiens, Mus musculus, and Escherichia coli were 25, 25, and 23, respectively Furthermore, the FNT model performed better than the three other machine learning methods in the majority of measurements among these bandwidths.

Performance of Polynomial Feature Description
In this section, we discuss the parameter selection of polynomial feature description. The three proposed feature description methods constitute five parameters, one of which is described by Equation (1), two of which are described by Equation (2), and two of which are described by Equation (3). The three proposed methods were compared, taking into account the coefficients a 1 and a 2 , and the constants b 1 , b 2 , and c. We defined a 1 and a 2 in the range [−10, 10], and three constants in the range [−100, 100], to test the performance of the employed classification method. We determined that the most appropriate parameters of the three proposed features were as follows: c = 57.6, a 1 = 4.1, a 2 = −2.7, b 1 = 27.1, and b 2 = 67.1. The three proposed features performed differently. The abovementioned classification models were also used for comparison. In order to reduce the usage of unnecessary computational resources, the most appropriate bandwidths determined previously were used. The Details of the results are shown in Table 8.

Materials and Methods
Because of the ubiquity and universality of lysine acetylation at the protein level, we can find several acetylated proteins in various databases, including NCBI (National Center for Biotechnology Information), Uniprot, and other related proteomics databases. In this study, we selected about 30,000 protein sequences, which contain more than 111,200 acetylation sites among them [49]. These proteins could be extracted from the Protein Lysine Modification Database (PLMD) version 3.0 [50]. PLMD is one of the most well-known and commonly used post-translational modification site databases, and it contains more than 20 types of lysine modification in more than 170 species at the protein level. Generally, this database can be treated as the largest available acetylation database; thus, it was employed as the benchmark dataset in this work. Unfortunately, overestimation may be one of the most significant limitations when using machine learning. In order to overcome this shortcoming, CD-HIT (Cluster Database at High Identity with Tolerance) was utilized to remove some homologous sequences [51][52][53][54]. In this work, we utilized a threshold of 40% similarity with this tool. Following this process, we obtained 59,532 proven acetylated modification sites from 20,527 protein sequences. These protein data were used to construct the training, testing, and independent datasets. During this classification process, we defined the proven acetylated sites as positive samples and the non-proven modifications as negative samples. Detailed information of the employed datasets is shown in Table 9, and details with regards to the construction of datasets are shown in Figure 4. Because of the ubiquity and universality of lysine acetylation at the protein level, we can find several acetylated proteins in various databases, including NCBI (National Center for Biotechnology Information), Uniprot, and other related proteomics databases. In this study, we selected about 30,000 protein sequences, which contain more than 111,200 acetylation sites among them [49]. These proteins could be extracted from the Protein Lysine Modification Database (PLMD) version 3.0 [50]. PLMD is one of the most well-known and commonly used post-translational modification site databases, and it contains more than 20 types of lysine modification in more than 170 species at the protein level. Generally, this database can be treated as the largest available acetylation database; thus, it was employed as the benchmark dataset in this work. Unfortunately, overestimation may be one of the most significant limitations when using machine learning. In order to overcome this shortcoming, CD-HIT (Cluster Database at High Identity with Tolerance) was utilized to remove some homologous sequences [51][52][53][54]. In this work, we utilized a threshold of 40% similarity with this tool. Following this process, we obtained 59,532 proven acetylated modification sites from 20,527 protein sequences. These protein data were used to construct the training, testing, and independent datasets. During this classification process, we defined the proven acetylated sites as positive samples and the non-proven modifications as negative samples. Detailed information of the employed datasets is shown in Table 9, and details with regards to the construction of datasets are shown in Figure 4.   In this work, we employed the general dataset as the training and testing datasets. In order to evaluate the generalization and stability, we employed three species incorporating lysine acetylation sites as the independent datasets.
After constructing the available datasets, some peptide segments were extracted from the whole protein sequences. In order to reduce the unnecessary usage of storage space and computational resources, some peptides with a central lysine residue were extracted in this work. We made use of sliding windows to extract peptide segments with a size of 2n + 1 [55], where n is the length of the upstream or downstream fragment, and 1 is the position of the central lysine residue in the segment. In this work, the length of the upstream fragment was equal to that of the downstream fragment, and n ranged from 10 to 15. Thus, the whole length of the sliding window was between 21 and 31. In the next section, we discuss the performances of the various selected lengths of sliding window.

Encoding of Protein Fragments
Several different types of features for quantifying biological sequences were presented across many years of protein research, such as amino-acid composition, position special scoring matrix, physico-chemical properties, and other related features [56][57][58] These features can demonstrate sequence information in various aspects, and they play various roles in protein sequence analysis. However, few features can demonstrate the relationships of amino-acid residues. In this paper, each peptide was treated as a sample. According to biological concepts, neighboring amino-acid residues present both coordination and individual functions. On this basis, we tried utilizing some of these functions to describe the relationships in this work.
We propose a polynomial method to describe the relationships between the central lysine residue and the neighboring amino-acid residues. Several forms of polynomial styles exist, such as the constant form, linear function form, quadratic function form, cubic function form, and so on. For example, we show the curves of these four forms in Figure 5.
In this work, we employed the general dataset as the training and testing datasets. In order to evaluate the generalization and stability, we employed three species incorporating lysine acetylation sites as the independent datasets.
After constructing the available datasets, some peptide segments were extracted from the whole protein sequences. In order to reduce the unnecessary usage of storage space and computational resources, some peptides with a central lysine residue were extracted in this work. We made use of sliding windows to extract peptide segments with a size of 2n + 1 [55], where n is the length of the upstream or downstream fragment, and 1 is the position of the central lysine residue in the segment. In this work, the length of the upstream fragment was equal to that of the downstream fragment, and n ranged from 10 to 15. Thus, the whole length of the sliding window was between 21 and 31. In the next section, we discuss the performances of the various selected lengths of sliding window.

Encoding of Protein Fragments
Several different types of features for quantifying biological sequences were presented across many years of protein research, such as amino-acid composition, position special scoring matrix, physico-chemical properties, and other related features [56][57][58] These features can demonstrate sequence information in various aspects, and they play various roles in protein sequence analysis. However, few features can demonstrate the relationships of amino-acid residues. In this paper, each peptide was treated as a sample. According to biological concepts, neighboring amino-acid residues present both coordination and individual functions. On this basis, we tried utilizing some of these functions to describe the relationships in this work.
We propose a polynomial method to describe the relationships between the central lysine residue and the neighboring amino-acid residues. Several forms of polynomial styles exist, such as the constant form, linear function form, quadratic function form, cubic function form, and so on. For example, we show the curves of these four forms in Figure 5. In Figure 5, L1, L2, L3, L4, and L5 follow Equations (4)-(8), respectively.
From Figure 5, we can easily determine that both L2 and L4 are even functions, while the other curves are odd functions. Considering that the upstream and the downstream fragments played the same role in the selected peptide segments, the even functions were selected for this work; therefore, we utilized three types of functions. The first one was the constant function, whereby all amino-acid residues in the peptide segments have the same influence, as described in Equation (1). The second function followed Equation (2), and the third function followed the Equation (3).
where the parameters a 1 , a 2 , b 1 , b 2 , and c 1 were optimized in this work. It was noted that both Equations (1) and (2) could hardly be described as linear functions. Thus, the center of the last two functions was designated as the origin point, i.e., the classified modification sites in the peptide segments. Regions to the left and right part of this origin point were designated as the upstream and the downstream segments, respectively. The influence of each neighboring amino-acid residue is defined below.
According to Equation (1), the relationship between a neighboring amino-acid residue and the central lysine is shown in Equation (9).
where influ1 contains 2n + 1 elements in each sample, and c 1 is the relationship between each amino-acid residue in the selected peptide segment. In this function, every amino-acid residue has the same influence; thus, the amino-acid composition can be regarded as a special form of this style. According to Equation (2), the relationship between the neighboring and central residues are shown in Equation (10).
where influ2 also contains 2n + 1 elements, and each value of influ2 follows the discrete values of Equation (2) and has the range [−n, n].
According to the Equation (3), the relationship between two amino-acid residues is shown in Equation (11).
in f lu 3 = [a 1 n 2 + b 1 , · · · a 1 + b 1 , b 1 , a 1 + b 1 · · · , a 1 n 2 where he influ3 also contains 2n + 1 elements, and each value of influ3 follows the discrete values of Equation (11) and has the range [−n, n]. After demonstrating the fundamental relationship of amino-acid residues within the classified peptide, the next step was to enrich the related properties of amino-acid residues. In this step, physical, chemical, evolutional, structural, and other related information was enriched using the three styles proposed above.

Physico-Chemical Properties
Physico-chemical properties are widely and successfully utilized in the identification of protein post-translational modifications, including ubiquitination, phosphorylation, and others [59,60]. These properties can help determine the fundamental characteristics of proteins in several aspects. One of the most well-known and widely utilized databases is AAIndex [61,62], which contains a great deal of physico-chemical and biochemical information for each amino-acid residue and some amino-acid compositions. The latest version of this database describes 544 properties of amino acid residues. Among these properties, following previous efforts and research [62], we selected several of them, which are listed in Table 10.
Considering the abovementioned elements, we minimized the presence of useless information; therefore, the area under the receiver operating characteristic (ROC) curve (AUC) was used to evaluate the measurements in this work. AA composition of ejecta of single-spanning proteins 13 QIAN880101 Weights for alpha-helix at the window position of −6

Prediction Algorithm
The computational identification of modification sites focuses on classification models in the field of machine learning. In this thesis, we employed machine learning models, including the flexible neural tree. We employed three machine learning methods for the three elements in the classification. The first element involved the bandwidth of the sliding windows in the classified peptide segments, the second involved the parameters of polynomial feature description, and the third involved the selection of different combinations. Therefore, the classification model was designed to deal with these three elements; the detailed outline of this algorithm is demonstrated in Figure 6. Considering the abovementioned elements, we minimized the presence of useless information; therefore, the area under the receiver operating characteristic (ROC) curve (AUC) was used to evaluate the measurements in this work.

Prediction Algorithm
The computational identification of modification sites focuses on classification models in the field of machine learning. In this thesis, we employed machine learning models, including the flexible neural tree. We employed three machine learning methods for the three elements in the classification. The first element involved the bandwidth of the sliding windows in the classified peptide segments, the second involved the parameters of polynomial feature description, and the third involved the selection of different combinations. Therefore, the classification model was designed to deal with these three elements; the detailed outline of this algorithm is demonstrated in Figure 6.  The flexible neural tree (FNT) was proposed by Chen [63,64], and it can be treated as an alternative tree neural network. Therefore, this model can be utilized to deal with the issues of classification and prediction in the field of machine learning. The typical structure of an FNT is shown in Figure 7.
The flexible neural tree (FNT) was proposed by Chen [63,64], and it can be treated as an alternative tree neural network. Therefore, this model can be utilized to deal with the issues of classification and prediction in the field of machine learning. The typical structure of an FNT is shown in Figure 7. From the above figure, we can easily determine that the model contains three types of layers-the input layer, the hidden layer, and the output layer. The network function of this model is shown in Equations (12) (13) where wj is the weight of the j-th input element, and yj is the j-th element of the input sample. Both mi and ni are parameters in this network.

Performance Measurements
Some well-known methods exist in the field of machine learning for evaluating performance measurements,. In this work, some typical measurements, including sensitivity, specificity, accuracy, F1 scores, and Matthew's correlation coefficients (MCCs) [65,66], of the identified modification sites were used. Furthermore, the AUC [67] was also employed to test the performance of imbalanced classification problems, whereby the negative sample size was much bigger than the positive sample size.
In this classification problem, samples can be defined as two types-positive samples and negative samples. Positive samples refer to peptide segments where the central lysine is acetylated, while negative samples refer to peptide segments where the central lysine is not. According to the definitions of the classified samples, there can be four outcomes. If a positive sample is classified as true, this can be deemed a true positive (TP). If a positive sample is classified as false, this can be deemed a false positive (FP). Following this concept, a negative sample classified as true is a true negative (TN), and a negative sample classified as false is a false negative (FN). According to the number of TP, TN, FP, and FN, we can easily obtain measures of sensitivity, specificity, accuracy, F1 scores, and MCC. From the above figure, we can easily determine that the model contains three types of layers-the input layer, the hidden layer, and the output layer. The network function of this model is shown in Equations (12) and (13).
where w j is the weight of the j-th input element, and y j is the j-th element of the input sample. Both m i and n i are parameters in this network.

Performance Measurements
Some well-known methods exist in the field of machine learning for evaluating performance measurements. In this work, some typical measurements, including sensitivity, specificity, accuracy, F1 scores, and Matthew's correlation coefficients (MCCs) [65,66], of the identified modification sites were used. Furthermore, the AUC [67] was also employed to test the performance of imbalanced classification problems, whereby the negative sample size was much bigger than the positive sample size.
In this classification problem, samples can be defined as two types-positive samples and negative samples. Positive samples refer to peptide segments where the central lysine is acetylated, while negative samples refer to peptide segments where the central lysine is not. According to the definitions of the classified samples, there can be four outcomes. If a positive sample is classified as true, this can be deemed a true positive (TP). If a positive sample is classified as false, this can be deemed a false positive (FP). Following this concept, a negative sample classified as true is a true negative (TN), and a negative sample classified as false is a false negative (FN). According to the number of TP, TN, FP, and FN, we can easily obtain measures of sensitivity, specificity, accuracy, F1 scores, and MCC.
Speci f icity = TN TN + FP (16) where P is the number of positive samples and N is the number of negative samples. Nevertheless, in Equations (14)- (18), there is a lack of intuitiveness, and they can hardly be described as easy to understand for the majority of researchers in the field of biology. The interpretation of MCC in particular is not at all intuitive in this form, although this measurement plays a key role in the evaluation of the classification model's stability. Therefore, we made use of the concept based on Chou, proposed at the beginning of this century. In this concept, the total number of positive samples can be defined as N + , and the total number of negative samples can be defined as the N − . Then, the number of misclassified positive samples can be treated as the N + − , and the number of misclassified negative samples can be treated as the N − + . With this definition, TP, TN, FP, and FN can be described in Equations (19)- (22).
The interpretations of each performance metric in Equations (23)- (27) are far more intuitive and easier to understand for biological researchers. For instance, when samples can be correctly classified, whereby all positive samples are classified as true and all negative samples are classified as false, we get N − + = 0 and N + − = 0, and the sensitivity and specificity are both equal to 1. Meanwhile, the accuracy is equal to 1 and MCC is also equal to 1 in such a situation. On the contrary, if all positive samples are classified as false and all negative samples are classified as true, N − + and N + − are both equal to 1, and the sensitivity and specificity are both equal to 0. Furthermore, the accuracy is equal to 0, and the MCC is equal to −1 in this situation. In a random classification issue, N − + = 0.5N − and N + − = 0.5N + . Thus, the accuracy is equal to 0.5 and MCC is equal to 0 in this situation. This definition method has several advantages [68][69][70][71]; however, utilizing these five measurements can hardly meet required performance in a scenario of imbalanced classification. Therefore, we made use of ROC and precision recall. ROC can be shown by the relationship between the true positive rate (TPR) and the false positive rate (FPR) in the classification. Meanwhile, precision recall can be demonstrated by the relationship between the precision and recall.

Conclusions
In this article, we proposed a lysine acetylation site identification with polynomial tree method (LAIPT), making use of the polynomial style to demonstrate the amino-acid residue relationships in peptide segments. The polynomial style was enriched by the physico-chemical properties of amino-acid residues. Then, these reconstructed features were input into the employed classification model, named the flexible neural tree. Finally, some effect evaluation measurements were employed to test the model's performances. We demonstrated that the three employed species modification sites constituted unique feature descriptions. In the future, we hope to determine more useful forms of feature description and to utilize effective classification models to deal with them. We hope that the algorithm described herein has the ability to deal with other types of protein post-translational modification sites in various species.

Conflicts of Interest:
The authors declare no conflicts of interest.