iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC

Carbonylation is a posttranslational modification (PTM or PTLM), where a carbonyl group is added to lysine (K), proline (P), arginine (R), and threonine (T) residue of a protein molecule. Carbonylation plays an important role in orchestrating various biological processes but it is also associated with many diseases such as diabetes, chronic lung disease, Parkinson's disease, Alzheimer's disease, chronic renal failure, and sepsis. Therefore, from the angles of both basic research and drug development, we are facing a challenging problem: for an uncharacterized protein sequence containing many residues of K, P, R, or T, which ones can be carbonylated, and which ones cannot? To address this problem, we have developed a predictor called iCar-PseCp by incorporating the sequence-coupled information into the general pseudo amino acid composition, and balancing out skewed training dataset by Monte Carlo sampling to expand positive subset. Rigorous target cross-validations on a same set of carbonylation-known proteins indicated that the new predictor remarkably outperformed its existing counterparts. For the convenience of most experimental scientists, a user-friendly web-server for iCar-PseCp has been established at http://www.jci-bioinfo.cn/iCar-PseCp, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. It has not escaped our notice that the formulation and approach presented here can also be used to analyze many other problems in computational proteomics.

In vivo, PTM is one of the most efficient biological mechanisms for regulating physiology as well as for expanding the genetic code. But when body's well-designed proteolysis or other repair systems are overwhelmed by excess reactive oxygen species (ROS) [18], the oxidative stress may occur [18], weakening the damage-repairing ability. This may also bring about varieties of PTMs on proteins, including nitration, carbonylation, sulfhydration and glutathionylation [19]. Among these PTMs, the protein carbonylation has been used as a biomarker for severe oxidative protein damage due to its relative early formation, stability, and irreversibility [20,21]. Actually, protein carbonylation is an early stage of diseases induced by external oxidative stress, aging and obesity [22,23]. It may cause numerous major human diseases, including Alzheimer's disease, diabetes, Parkinson's disease, chronic renal failure, chronic lung disease, sepsis and so forth [24,25]. Therefore, the information of carbonylation sites in proteins is indispensable not only for in-depth understanding many important biological processes but also for precisely aiming targets in developing effective drugs against the aforementioned diseases.
Mass spectrometry is one of the most common techniques to analyze the carbonyl level of a protein and determine its carbonylation sites [26,27]. So far four types of amino acid residues have been found more prone to carbonylation; they are lysine (K), proline (P), arginine (R), and threonine (T) [24,[28][29][30]. But it would take much longer time and need more labors to utilize the conventional experimental techniques alone to determine the carbonylation sites in proteins [27,31]. Facing the rapid growth of biological sequences, we are challenged to develop automated methods as a complimentary approach to experimental methods.
Actually, some investigators have made efforts to do so. Maisonneuve et al. [29], based on their spectrometry analysis, proposed some empirical rules to identify the hot spots of carbonylation. Recently, Lv et al. [32] and Xu et al. [33] developed two different bioinformatical tools to predict the protein carbonylation sites. These methods did have contribution in stimulating the development of this area. Since the topic's importance as well as the urgency of demanding more powerful high throughput tools in this area, further efforts aiming at prediction of protein carbonylation sites are definitely needed.
Here, we are to develop a new and more powerful predictor by (1) using the Monte Carlo sampling approach to optimize the training dataset, (2) incorporating the vectorized sequence-coupling model into the general PseAAC, and (3) installing the random forest (RF) algorithm to operate the prediction system.
As shown in many recent relevant papers [11,12,14,17,[34][35][36][37][38][39][40], to establish a biological sequence-based statistical predictor that not only can be easily used by most experimental scientists to get their desired results but also can inspirely stimulate theoretical scientists to create various other prediction methods, we should observe the Chou's 5-step rules or guidelines [41]: (1) benchmark dataset preparation; (2) mathematical representation of biological sequence samples; (3) calculation algorithm; (4) cross-validation; (5) web-server establishment. Below, let us to address the five guidelines oneby-one. To match the rubric style of the Oncotarget journal, however, the order in addressing them may be changed.

RESULTS AND DISCUSSION
A novel web-server predictor and its user guide A new and more powerful predictor, called iCar-PseCp, has been established for predicting the protein carbonylation sites. Moreover, to maximize users' convenience, the point-to-point instructions are given below.
(1) Click the web-server at http://www.jci-bioinfo. cn/iCar-PseCp, the top page of the iCar-PseCp will be prompted on your computer screen ( Figure 1).
(2) In the input box (Figure 1), enter your query protein sequences, which can be done by either typing or copying/pasting manner. The entered query protein sequences should be in the FASTA format. Not familiar with FASTA? Just click the button of Example.
(3) You can see the prediction results by clicking the Submit button. If you use the Sequence_K in the Example window as the input and check on the K button, after 15 seconds or so since your submitting, you will see the following on your screen: Sequence_K contains 9 K residues, of which 5 are predicted to be of carbonylation site and they are at the sequence positions 2, 14, 41, 68 and 95. If you use the Sequence_P as the input and check on the P button, you will see: Sequence_P contains 10 P residues, of which 5 are of carbonylation site and at positions 95, 122, 142, 145, and 149. If you use the Sequence_R as the input and check on the R button, you will see: Sequence_R contains 8 R residues, of which 3 are of carbonylation site and at the positions 14, 41, and 75. If you use the Sequence T as the input and check on the T button, you will see: Sequence_T contains 7 T residues, of which 1 is of carbonylation site and at the positions 14. Compared with experimental observations, the above (9 + 10 + 8 + 7) = 34 predicted results contain no false positive result ( N -+ = 0 ) but 5 false positive results ( ) N + − = 5 , which are the 2nd and 13th K residues in sequence_K, the 142th and 145th P residues in sequence_P, and the 75th R residue in sequence_R. In other words, the total number of carbonylation sites involved in the above predictions is N + = 3 + 3 + 1 + 1 = 8, while the total number of noncarbonylation sites investigated is N − = 6 + 7 + 6 + 6 = 25. Substituting these data into Eq.9, we have Sn = 100%, Sp = 80.00% and Acc = 84.80%, and MCC = 0.7018, quite consistent with the rates reported in Table 1 via the rigorous cross validation on the 250 benchmark proteins.
(4) If you have a lot of query protein sequences and need a lot of computational time, you can choose to use the batch prediction. To do so, just use the Browse button to select the desired file (in FASTA format of course) and follow the online instruction.
(5) The benchmark dataset used in this study is available by clicking the button of Supporting Information on the top of Figure 1.
(6) To see the key papers used to develop this server, just click on the button of Citation. www.impactjournals.com/oncotarget

Result comparison and analysis
The success scores achieved by the iCar-PseCp predictor via the 10-fold target cross validation for K-, P-, R-, and T-type carbonylation are shown in Table 1. Meanwhile, the corresponding rates by PTMPred [33] and CarSpred [32] are also listed there. As we can see from Table 1, compared with its counterparts, although the Acc values obtained by the iCar-PseCp are within the ± 4%, its Sn and Sp values are more than 20% and 5-9% higher than those by PTMPred and CarSpred, indicating that the results predicted by the previous methods [33][34] contain much more false negative and positive events. Particularly, the MCC values achieved by iCar-PseCp are about 2 or 3 times higher than those of its counterparts, indicating that the new proposed predictor is significantly more stable.
Graphical approach is a useful vehicle for analyzing complicated biological systems as demonstrated by a series of previous studies (see, e.g., [42][43][44][45][46][47][48][49][50][51]. Here, to provide an intuitive comparison, the graph of Receiver Operating Characteristic (ROC) [52,53] was utilized to show the advantage of iCar-PseCp over the PTMPred [33] and CarSpred [32]. In Figure 2 the red and green graphic lines are the ROC curves for the PTMPred and CarSpred, respectively; while the blue graphic line for the proposed predictor iCar-PseCp. The greater the area under the AUC is, the better the predictor will be [52][53]. As we can see from Figure 2, the area under the blue curve is remarkably greater than that under the red or green line, once again indicating that the proposed predictor is indeed much better than PTMPred and CarSpred predictors. Therefore, iCAR-PseCp will become a very useful bioinformatics tool for relevant basic research and drug development as well.
Why can the proposed method enhance the prediction quality so significantly? First, the coupling effects among the amino acids around the carbonylation sites are taken into account via the conditional probability approach, which has been proved to be indeed very useful in a series of previous studies [57][58][59][60]. Second, the predictor is trained by a balanced benchmark dataset via Monte Carlo sampling, and hence many false prediction events as occurring in the cases of PTMPred [33] and CarSpred [32] trained by very imbalanced and skewed datasets can be completely avoided.  The area under the curve of Figure.2; the greater the AUC value is, the better the corresponding predictor will be [52,53].
For facilitating description later, the Chou's peptide formulation was adopted. It was used for studying enzyme specificity [57], signal peptide cleavage sites [70], hydroxyproline and hydroxylysine sites [8], methylation sites [7], nitrotyrosine sites [9], protein-protein interaction [71], and protein-protein binding sites [72]. According to Chou's scheme, a potential carbonylation site-containing peptide sample can be generally expressed by where the symbol  denotes the single amino acid code K, P, R, or T, the subscript ξ is an integer, R −ξ represents the ξ-th upstream amino acid residue from the center, the R +ξ the ξ-th downstream amino acid residue, and so forth. The (2ξ + 1) -tuple peptide sample P ξ (  ) can be further classified into the following two categories: if its center is a carbonylation site other wis se where P ξ + ( )  denotes a true carbonylation segment with K, P, R, or T at its center, P ξ − ( )  a false segment with K, P, R, or T at its center, and the symbol ∈ means "a member of" in the set theory.
In literature the benchmark dataset usually consists of a training dataset and a testing dataset: the former is used for training a model, while the latter for testing the model. But as pointed out in a comprehensive review [73], there is no need to artificially separate a benchmark dataset into the two parts if the prediction model is examined by the jackknife test or subsampling (K-fold) cross-validation since the outcome thus obtained is actually from a combination of many different independent dataset tests. Thus, the benchmark dataset  ξ �  ( ) for the current study can be formulated as The detailed procedures to construct the benchmark dataset are as follows. (1) As done in [74], slide the (2ξ + 1) -tuple peptide window along each of the aforementioned 230 + 20 = 250 protein sequences used by [32], and collected were only those peptide segments that have K, P, R, and T at the center. (2) If the upstream or downstream in a protein sequence was less than ξ or greater than L−ξ where L is the length of the protein sequence concerned, the lacking amino acid was filled with a dummy residue X. (3) The peptide segment samples thus obtained were put into the positive subset  ξ + ( )  if their centers have been experimentally annotated as the carbonylation sites; otherwise, into the negative subset  ξ − ( )  . (4) Using the CD-HIT software [75], the aforementioned samples were further subject to a screening procedure to winnow those that had ≥ 30% pairwise sequence identity to any other in a same subset.
Note that the length of peptide samples and their number thus generated would depend on the ξ value. But preliminary tests had indicated that it would be most promising when ξ = 7 or the sample's length was 2ξ + 1 = 15. Accordingly, hereafter we only consider the case of ξ = 7; i.e., the samples with 15 amino acid residues. Thus, the benchmark datasets thus obtained for  ξ= ( ) 7 K ,  ξ= ( ) 7 P ,  ξ= ( ) 7 R , and  ξ= ( ) 7 S are given in Supporting Information S1, S2, S3, and S4, respectively. Listed in Table 2 is a summary of their sizes.

Incorporate sequence-coupled information into general pseudo amino acid composition
With the avalanche of biological sequence generated in the post-genomic age, one of the most important problems in computational biology is how to formulate a biological sequence with a discrete model or a vector, yet still considerably keep its sequence order information or essential feature. This is because all the existing machinelearning algorithms can only handle vector but not sequence samples, as elaborated in [15].
Because it has been widely and increasingly used, recently three powerful open access soft-wares, called 'PseAAC-Builder' [78], 'propy' [79], and 'PseAAC-General' [92], were established: the former two are for generating various modes of Chou's special PseAAC; while the 3rd one for those of Chou's general PseAAC [41], including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as "Functional Domain" mode (see Eqs.9-10 of [41]), "Gene Ontology" mode (see Eqs.11-12 of [41]), and "Sequential Evolution" or "PSSM" mode (see Eqs.13-14 of [41]). Inspired by the successes of using PseAAC to deal with protein/peptide sequences, three web-servers [94][95][96] were developed for generating various www.impactjournals.com/oncotarget feature vectors for DNA/RNA sequences. Particularly, recently a powerful web-server called Pse-in-One [97] has been developed that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users' studies.
According to the general PseAAC [41], the peptide sequence of Eq.1 can be formulated as ( ) In Eq. ( ) R are of non-conditional probability since the right neighbor of R −1 and the left neighbor of R +1 are always  (namely Lys, Pro, Arg, or Thr, respectively).
All these probability values can be easily derived from the positive training subsets taken from Supporting Information S1, S2, S3, and S4, respectively as done in [98]. Likewise, the components in Eq.6 are the same as those in Eq.5 except for that they are derived from the corresponding negative training subsets therein.

Expanding positive samples by Monte Carlo approach
As we can see from the Supporting Information S1, S2, S3, and S4, the negative subset  ξ − ( )  in each of them is much larger than its corresponding positive one  ξ + ( )  in number of samples. Although this might reflect the real world in which the non-carbonylation sites are always the majority compared with the carbonylation ones, a predictor trained by such a highly skewed benchmark dataset would inevitably have the bias consequence that many carbonylation sites might be mispredicted as non-carbonylation ones. Therefore, it is important to find an effective approach to minimize this kind of bias consequence. To realize this, we adopted the Monte Carlo simulation [99,100] to expand the samples of positive subset. The concrete procedures are as follows.
Step 2. For simplicity, let us formulate the probability thus obtained according to the alphabetical order of the single-letter code of the 20 native amino acids (note that the dummy amino acid X introduced in the Benchmark Dataset section was treated as the 21st amino acid); i.e., Step 3. Generate a random number  between 0 and 1; if then the k-th amino acid is drawn for an expanded positive sample at its i-th subsite. For example, if k = 2 and i = −7, then the amino acid thus drawn should be C for the left 1st sequence position (cf. Eq.1); if k = 19 and i = −6, then the amino acid drawn should be W for the left 2nd sequence position; if k = 20 and i = +7, then the amino acid drawn should be Y for the right last sequence position; and so forth.
Step 4. Repeat the above steps until the number of positive (the original plus the expanded) samples is the same as the negative samples.
At first glance, the rationale of the above Monte Carlo sampling procedure seems like a circular argument. But it is correct as elucidated in [54]. Particularly, these expanded positive samples were used only for training a model but not used for testing it, as well be further discussed later.

Random forests algorithm
The random forests (RF) algorithm is a powerful algorithm and has been used in many areas of computational biology (see, e.g. [11, 12, 71, 72, 101−104]). The detailed procedures of RF and its formulation have been very clearly described in [105], and hence there is no need to repeat here.
For the current study, all the involved peptide samples were converted into a 14-D (dimensional) vector according to Eq.4, and then entered into the RF operation engine as the input. And the output would indicate whether the center residue  of the query peptide is a "carbonylation site" or "non-carbonylation site". Note that, in using the current prediction method, one must observe the self-consistency principle: if the center residue of a query peptide is  = K then the corresponding training data must be taken from  ξ= ( ) 7 K if the center residue of a query peptide is  = P, then the training data must be taken from and  ξ= ( ) 7 P ; and so forth (see Eq.3).
As pointed out in the Introduction section, one of the keys in establishing a useful predictor is how to properly evaluate its anticipated success rates. To realize this, we need to consider the following two things: one is what metrics or scales should be adopted to quantitatively measure its prediction quality; the other is what validation method should be utilized to calculate or derive the metrics values. Below, we are to address the two problems.

A set of four metrics
The following four metrics are usually used in literature to measure the quality of binary classification: (1) overall accuracy or Acc; (2) Mathew's correlation coefficient or MCC; (3) sensitivity or Sn; and (4) specificity or Sp (see, e.g., [106]). Unfortunately, the conventional formulations for the four are not intuitive and that most experimental scientists feel difficult to understand them, particularly for the one of MCC. Interestingly, by using the Chou's symbols and derivation in studying signal peptides [107], the aforementioned four metrics can be easily converted into a set of following equations [5,35]: where N + represents the total number of carbonylation sites investigated whereas N − + the number of true carbonylation sites incorrectly predicted to be of non-carbonylation site; N − the total number of the non-carbonylation sites investigated whereas N + − the number of non-carbonylation sites incorrectly predicted to be of carbonylation site.
According to Eq.9, it is crystal clear to see the following. When N − + = 0 meaning none of the true carbonylation sites are incorrectly predicted to be of noncarbonylation site, we have the sensitivity Sn = 1. When N − + = N + meaning that all the carbonylation sites are incorrectly predicted to be of non-carbonylation site, we have the sensitivity Sn = 0. Likewise, when N + − = 0 meaning none of the non-carbonylation sites are incorrectly predicted to be of carbonylation site, we have the specificity Sp = 1; whereas N + − = N − meaning that all the non-carbonylation sites are incorrectly predicted to be of carbonylation sites, we have the specificity Sp = 0. When N − + = N + − = 0 meaning that none of carbonylation sites in the positive dataset and none of the non-carbonylation sites in the negative dataset are incorrectly predicted, we have the overall accuracy Acc = 1 and MCC = 1; when N − + = N + and N + − = N − meaning that all the carbonylation sites in the positive dataset and all the non-carbonylation sites in the negative dataset are incorrectly predicted, we have the overall accuracy Acc = 0 and MCC = −1; whereas when N − + = N + /2 and N + − = N − /2 we have Acc = 0.5 and MCC = 0 meaning no better than random guess. Therefore, using Eq.9 has made the meanings of sensitivity, specificity, overall accuracy, and Mathew's correlation coefficient much more intuitive and easierto-understand, particularly for the meaning of MCC, as concurred recently by many investigators (see, e.g., [14,16,38,39,71,72,[108][109][110][111][112][113]).
Note that, however, the set of equations defined in Eq.9 is valid only for the single-label systems. For the multi-label systems whose emergence has become more frequent in system biology [114][115][116] and system medicine [117], a completely different set of metrics are needed as elaborated in [118].

Target cross-validation
With a good set of metrics to measure the predictor's quality, the next thing to consider is what kind of validation method should be adopted to calculate the metrics values.
The following three cross-validation methods are often used in statistics to derive the metrics values for a predictor: independent dataset test, subsampling (or K-fold cross-validation) test, and jackknife test [119]. Among these three, however, the jackknife test is deemed the least arbitrary that can always yield a unique outcome for a given benchmark dataset as elucidated in [41] and demonstrated by Eqs.28-32 therein. Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors (see, e.g., [84][85][86][87][120][121][122][123][124][125][126][127]). However, to reduce the computational time, in this study we adopted the K-fold cross-validation, as done by most investigators with SVM and random forests algorithms as the prediction engine.
When conducting the K-fold cross-validation for the current predictor iCAR-PseCp, however, some special consideration is needed. This is because a dataset, after expanding by Monte Carlo sampling, may contain many hypothetical positive samples. It would be fine to use such an expanded dataset to train a prediction model, but certainly not for validation. This is because the validation should be made on a testing dataset that only contains experiment-confirmed samples without any added hypothetical samples [14,104]. To ensure this, a special cross-validation, the so-called target cross-validation [113], has been introduced here. During the target cross-validation process, only the experiment-confirmed samples are picked out from the testing dataset for validating and scoring [11]. The detailed procedures of the target K-fold cross-validation (without losing the generality, let us consider K = 10) can be described as follows.
Step 1. Before expanding the positive samples, both the original positive and negative subsets were randomly divided into 10 parts with about the same size. For example, for  ξ= ( ) 7 K in Supporting Information S1, after such evenly division we have and    where the symbol  means that the divided 10 datasets are about the same in size, and so are their subsets.
Step 2. One of the 10 sets, say  ξ= ( ) ( ) 7 1 K , was singled out as the testing dataset and the remaining nine sets as the training dataset.
Step 3. Based on the training dataset, use Eqs.4-6 to derive the sequence-coupled information. Also, based on the same training dataset, use Monte Carlo sampling to expand its positive subset making it have the same size as the negative subset.
Step 4. Use the sequence-coupled information and the expanded training dataset obtained in Step 3 to train the model and perform the prediction for each of the samples in the testing dataset.
Step 5. Repeat Steps 2-4 until all the 10 divided sets had been singled out one-by-one for testing validation.
Step 6. Substituting the average scores obtained from the above 10-round tests into Eq.9 to calculate Sn, Sp, Acc, and MCC.
It is crystal clear to see from the above steps that the validation was made only for experiment-confirmed samples, and that none of information from the testing datasets was ever used to train the predictor.

CONCLUSIONS
The iCar-PseCp predictor is a new bioinformatics tool for identifying the carbonylation sites in proteins. Compared with the existing predictors in this area, its prediction quality is much better, with remarkably more stability and less false predictions. For the convenience of most experimental scientists, we have provided its webserver and a step-by-step guide, by which users can easily obtain their desired results without the need to go through the detailed mathematics. The reason of including them in this paper is for the integrity of the new prediction method, and that these techniques, such as sequence-coupled approach and Monte Carlo sampling, may be of use as well in developing other tools in computational biology.
We anticipate that iCar-PseCp will become a very useful high throughput tool, or at the very least, a complementary tool to the existing methods for predicting the protein carbonylation sites.