iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins

As one of the most important and universal posttranslational modifications (PTMs) of proteins, S-nitrosylation (SNO) plays crucial roles in a variety of biological processes, including the regulation of cellular dynamics and many signaling events. Knowledge of SNO sites in proteins is very useful for drug development and basic research as well. Unfortunately, it is both time-consuming and costly to determine the SNO sites purely based on biological experiments. Facing the explosive protein sequence data generated in the post-genomic era, we are challenged to develop automated vehicles for timely and effectively determining the SNO sites for uncharacterized proteins. To address the challenge, a new predictor called iSNO-AAPair was developed by taking into account the coupling effects for all the pairs formed by the nearest residues and the pairs by the next nearest residues along protein chains. The cross-validation results on a state-of-the-art benchmark have shown that the new predictor outperformed the existing predictors. The same was true when tested by the independent proteins whose experimental SNO sites were known. A user-friendly web-server for iSNO-AAPair was established at http://app.aporc.org/iSNO-AAPair/, by which users can easily obtain their desired results without the need to follow the mathematical equations involved during its development.

Actually, many efforts have been made to identify the SNO sites with experimental approaches, such as BST (biotin switch assay) (Jaffrey et al., 2001), SNOSID (Derakhshan, Wille & Gross, 2007;Greco et al., 2006), and SNO-RAC (Forrester et al., 2009). Although considerable knowledge about the SNO sites could be obtained by these methods, it is both time-consuming and laborious by means of the experimental approaches alone. Facing the explosion of protein sequences generated in the post genomic era, we are challenged to develop computational method for fast and reliably identifying the SNO sites in proteins.
Recently, several computational methods have been proposed in this regard (Li et al., 2011;Li et al., 2012;Xue et al., 2010;Xu et al., 2013). Each of these methods has merit and did play a role in stimulating the development of this area. However, they also each have their own limits. For example, by incorporating the position specific amino acid propensity into the general form of pseudo amino acid composition (Chou, 2001a) or Chou's PseAAC (Lin & Lapointe, 2013), the authors in a recent article (Xu et al., 2013) presented a predictor called iSNO-PseAAC, which can yield higher success rates than the other existing methods for predicting SNO sites. However, in the iSNO-PseAAC predictor, only the position propensity of each of the constituent amino acids was considered without taking into account any of their correlation. In other words, all the amino acids in the proteins were treated independently. However, in the real world, they are not independent of each other but bear some sort of correlation. And incorporating the correlation effects could really improve the prediction quality accordingly, such as in identifying the peptide cleavage sites by signal peptidase (Chou, 2001d), investigating the specificity of GalNAc-transferase (Chou, 1995), predicting the protein cleavage sites by HIV-protease (Chou, 1993), as well as using the information thus obtained to develop peptide-drugs against HIV/AIDS and SARS (Du, Sun & Chou, 2007;Du et al., 2005;Gan et al., 2006;Shen & Chou, 2008) based on Chou's distorted key theory (Chou, 1996). Motivated and encouraged by these studies, here we are to develop a new method for identifying the protein SNO sites by incorporating some sequence correlation effects.
As shown by a series of recent publications (Chen et al., 2013;Chen et al., 2012b;Xiao et al., 2013) and summarized in a comprehensive review (Chou, 2011), to establish a really useful statistical predictor for a sequence-based system, one needs to engage the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the sequence samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us describe how to engage these procedures one by one.

Benchmark dataset
In this study the benchmark dataset was derived from the S-nitrosylated database (version 1.0) (Chen et al., 2010) at http://dbsno.mbc.nctu.edu.tw/, from which 1,530 proteins in human and mouse species and their SNO sites were downloaded. The corresponding peptide fragments for these SNO sites were derived from UniProt database (release 2012 08). To facilitate description later, let us adopt Chou's formulation for peptides here that was used for studying signal peptide cleavage sites (Chou, 2001c;Chou, 2001d). According to the formulation, a peptide with cysteine located at its center ( Fig. 1) can be written as where the subscript ξ is an integer, R −ξ represents the ξ -th downstream amino acid residue from cysteine (C), R ξ the ξ -th upstream amino acid residue, and so forth (Fig. 2). Peptides with the profile of Eq. (1) can be further classified into the following two categories: (1) SNO peptide if its center is a SNO site; (2) non-SNO peptide if its center is a non-SNO site, as can be formulated by where ∈ represents "a member of " in the set theory. After some preliminary trials and also considering the practice of previous investigators (Li et al., 2011;Li et al., 2012;Xue et al., 2010;Xu et al., 2013), we choose ξ = 10 to construct the benchmark dataset for P of Eq. (1). If the upstream or downstream in a protein was less than 10, the lacking residues were filled with the dummy code Z. The peptides thus obtained are subject to a screening procedure to winnow those that have ≥40% sequence identity to any other. Finally, we obtained 2,381 SNO peptides and 11,755 non-SNO peptides. Now let us construct the training or learning dataset S L as defined by where ∪ represents the "union" in the set theory, S + L contains 2,300 samples randomly picked from the aforementioned 2,381 SNO peptides, while S − L 2,300 samples randomly picked from the 11,755 non-SNO peptides. For readers' convenience, the 2,300 peptide sequences in the positive learning dataset S + L and 2,300 peptide sequences in the negative learning dataset S − L , along with their sequence positions (sites) in the parent proteins coded in "UniProt IDs", are given in Supplemental Information S1.
Moreover, for the purpose of demonstration later, let us also construct an independent dataset S T given by where S + T contains the remaining 81 samples in the aforementioned 2,381 SNO peptides, while S − T contains 100 samples randomly picked from the 11,755 non-SNO peptides but none of them occurs in S − L . Likewise, the 81 peptide sequences in the positive testing dataset S + T and 100 peptide sequences in the negative testing dataset S − T are given in Supplemental Information S2.
According to a recent review (Chou, 2011), the general form of Chou's PseAAC for a protein or peptide P can be formulated by where T is the transpose operator, while is an integer to reflect the vector's dimension. The value of as well as the components ψ u (u = 1,2,..., ) in Eq. (5) will depend on how to extract the desired information from the protein or peptide sequence. Below, let us describe how to extract the useful information from the learning dataset S L to define the peptide samples via Eq. (5) for the current study.
Since the length of each peptide in the training dataset S L is 21 (cf. Supplemental Information S1), Eq. (1) for P can be simplified to a more convenient form given by P = R 1 R 2 ...R 9 R 10 R 11 R 12 ...R 20 R 21 (6) where R 11 = C and R i (i = 1,2,...,21;i = 11) can be any of the 20 native amino acids or the dummy code Z as defined above. Hereafter, let us use the numerical codes 1,2,3,...,20 to represent the 20 native amino acids according to the alphabetic order of their single letter codes, and use 21 to represent the dummy amino acid Z. Accordingly, the number of possible different dipeptides will be 21 × 21 = 441, and the number of dipeptide subsite positions on the sequence of Eq. (6) will be (21 − 2 + 1) = 20. Now, let us introduce the following 441 × 20 matrix Z 0 , the so-called PSDP (positionspecific dipeptide propensity) matrix to define the component of Eq. (5) where the element and In Eq. (8), is the occurrence frequency of the i-th dipeptide (i = 1,2,...,441) at the j-th subsite on the sequence of Eq. (6) (or the j-th column in the positive learning dataset S + L ) that can be easily derived using the method described in (Chou, 2001d) from the sequences in Supplemental Information S1; while F − 0 (D 0 i |j) is the corresponding occurrence frequency but derived from the negative learning dataset S − L . In order to extract more information, let us expand the propensity matrix from the dipeptide (or the residue pair formed by the nearest residues) to the pair formed by the next nearest amino acid residues (Fig. 3). Since the number of possible such amino acid pairs is still 21 × 21 = 441, but the number of their subsite positions on the sequence of Eq. (6) is reduced to (21 − 3 + 1) = 19, the corresponding position-specific propensity matrix should be given by where the element where D 1 i has the same meaning as D 0 i in Eq. (9) but instead of dipeptide it represents the pairs of amino acids separated by one residue between them along a protein sequence. Likewise, F + 1 (D 1 i |j) and F − 1 (D 1 i |j) also have the similar meaning as F + 0 (D i |j) and F − 0 (D i |j) in Eq. (8), and can be easily derived from the sequences in Supplemental Information S1 as well.
Now, let us define a new matrix Z by merging Z 0 and Z 1 ; i.e., where the symbol ⊕ represents the orthogonal sum (Chou & Shen, 2007). Thus, the peptide P of Eq. (6) can be uniquely defined via the general form of PseAAC (cf. Eq. (5)) with its dimension = 20 + 19 = 39 and its u-th component given by where R u is any residue in the u-th position of the peptide P (cf. Eq. (6)).

PREDICTION ALGORITHM
Suppose P + and P − are the standard vectors or norms for the peptide sequences in S + L and S − L , respectively. And they are defined by where where N + is the total number of SNO peptides in the learning dataset, and ψ + u,k the u-th component for the k-th SNO peptide in the PseAAC space (cf. Eqs. (5) and (13)); whereas N − and ψ − u,k have the same meanings but are for the non-SNO peptides. For a query peptide P as formulated by Eq. (5), suppose D(P,P + ) is its similarity to the norm of SNO peptides, and D(P,P − ) its similarity to the norm of non-SNO peptides, as formulated by Thus, the prediction rule for the query peptide P can be formulated as P ∈ SNO peptide, if D(P,P + ) > D(P,P − ) non-SNO peptide, otherwise.
If there was a tie between D(P,P + ) and D(P,P − ), the query peptide would be randomly assigned between the SNO peptide and non-SNO peptide categories. However, this kind of tie case rarely happened and actually never happened in our study. The predictor established via the above procedures is called iSNO-AAPair, where "i" stands for the 1st character of "identify", while "AAPair" means that the amino acid coupling effects were taken into account within the pairs formed by the nearest residues as well as the pairs formed by the next nearest residues along the peptide sequence.

Figure 4 A flowchart showing the prediction process of iSNO-AAPair.
A flowchart of the predictor is given in Fig. 4 to illustrate how iSNO-AAPair was working during the process of prediction.

RESULTS AND DISCUSSION
How to objectively evaluate the performance of a predictor and how to make it easy to access by public (Chou & Shen, 2009) are two important factors that are directly associated with its application value. Below, let us address these problems.

Four different metrics for measuring the prediction quality
In literature the following metrics are often used for examining the performance quality of a predictor where TP represents the number of the true positive; TN, the number of the true negative; FP, the number of the false positive; FN, the number of the false negative; Sn, the sensitivity; Sp, the specificity; Acc, the accuracy; MCC, the Mathew's correlation coefficient. To most biologists, however, the four metrics as formulated in Eq. (18) are not quite intuitive and easier-to-understand, particularly for the Mathew's correlation coefficient. Here let us adopt the formulation proposed recently (Chen et al., 2013;Xu et al., 2013) in terms of the Chou's symbol (Chou, 2001d); i.e., where N + is the total number of the SNO peptides investigated while N + − the number of the SNO peptides incorrectly predicted as the non-SNO peptides; N − the total number of the non-SNO peptides investigated while N − + the number of the non-SNO peptides incorrectly predicted as the SNO peptides (Chou, 2001b).
It can be clearly seen from Eq. (19) that when N + − = 0 meaning none of the SNO peptides were incorrectly predicted to be a non-SNO peptide, we have sensitivity Sn = 1. When N + − = N + meaning that all the SNO peptides were incorrectly predicted to be the non-SNO peptides, we have sensitivity Sn = 0. Likewise, when N − + = 0 meaning none of the non-SNO peptides was incorrectly predicted to be the SNO peptide, we have specificity Sp = 1; whereas N − + = N − meaning all the non-SNO peptides were incorrectly predicted as the SNO peptides, we have specificity Sp = 0. When N + − = N − + = 0 meaning that none of SNO peptides in the positive dataset and none of the non-SNO peptides in the negative dataset was incorrectly predicted, we have overall accuracy Acc = 1 and MCC = 1; when N + − = N + and N − + = N − meaning that all the SNO peptides in the positive dataset and all the non-SNO peptides in the negative dataset were incorrectly predicted, we have overall accuracy Acc = 0 and MCC = −1; whereas when N + − = N + /2 and N − + = N − /2 we have Acc = 0.5 and MCC = 0 meaning no better than random prediction. As we can see from the above discussion based on Eq. (19), the meanings of sensitivity, specificity, overall accuracy, and Mathew's correlation coefficient have become much more intuitive and easier-to-understand.
It is instructive to point out that the set of metrics as given in Eq. (18) or Eq. (19) is valid only for the single-label systems as in the current case. For the multi-label systems whose emergence has become increasingly frequent in system biology (Chou, Wu & Xiao, 2011;Chou, Wu & Xiao, 2012) and system medicine (Chen et al., 2012a;Xiao et al., 2013), a different set of metrics as defined in Chou (2013) is needed.

Cross-validation to evaluate the anticipated success rates
In statistical prediction, the following three cross-validation methods are often used to evaluate the anticipated accuracy of a predictor: independent dataset test, subsampling (K-fold cross-validation) test, and jackknife test (Chou & Zhang, 1995). However, as elucidated by a review article (Chou, 2011), among the three cross-validation methods, the jackknife test is deemed the least arbitrary and most objective because it can always yield a unique result for a given benchmark dataset, and hence has been increasingly used and widely recognized by investigators to examine the accuracy of various predictor (see, Chen & Li, 2013;Khosravian et al., 2013;Mei, 2012;Mohabatkar et al., 2013;Mohabatkar, Mohammad Beigi & Esmaeili, 2011;Wan, Mak & Kung, 2013;Zhang et al., 2008b). However, to reduce computational time, here let us adopt the 10-fold cross-validation to examine the prediction accuracy as done by many investigators for PTM sites prediction with SVM (Chang et al., 2009;Kim et al., 2004;Wong et al., 2007;Xu et al., 2013). The cross-validations were performed 50 times for different subsampling combinations, followed by averaging their outcomes. The outcomes thus obtained on the benchmark dataset S L (cf. Supplemental Information S1) for the four metrics as defined in Eq. (19)  indicating that the accuracy is quite high for all the four metrics.

Independent dataset test
As a demonstration to show how the current predictor is used for practical application, let us use the iSNO-AAPair predictor trained by the data in S L (Eq. (3)) to predict the peptides in S T (cf. Eq. (4)). As mentioned in the Materials and Methods section, the independent dataset S T contain 81 SNO and 100 non-SNO peptides (cf. Supplemental Information S2). To avoid the memory bias, none of the peptide in S T occurs in S L ; i.e., S L ∩ S T = ∅, where the symbols ∩ and ∅ represent "intersection" and "empty set" in the set theory, respectively. The results thus obtained are given below   et al. (2012) and that in Li et al. (2011) were not listed because the former had no web-server and latter's web-server did not work. b The method proposed in Xue et al. (2010) where the threshold parameter was set at "medium" to get its highest overall accuracy. c The method proposed in Xu et al. (2013). indicating that the results obtained by the independent dataset test are quite consistent with those by the 10-fold cross-validation, particularly for the overall accuracy Acc and the Mathew's correlation coefficient MCC.

Comparison with the other methods
Among the existing methods for identifying the SNO sites in proteins, the web server for the method proposed in Li et al. (2011) did not work, andthe method in Li et al. (2012) had no web-server at all. Therefore, the comparison was made among the following three methods: GPS-SNO (Xue et al., 2010), iSNO-PseAAC (Xu et al., 2013, and the current iSNAO-PseAAPair. Listed in Table 1 are the corresponding results obtained by the aforementioned three methods on the independent dataset test S T (cf. Supplemental Information S2), respectively. As we can see from Table 1, the overall accuracy (Acc) achieved by iSNO-AAPair was remarkably (about 30%-35%) higher than those by its counterparts GPS- SNO (Xue et al., 2010) and iSNO-PseAAC (Xu et al., 2013). Furthermore, iSNO-AAPair was also superior to its counterparts in the other three metrics (Sn, Sp, and MCC). Particularly for MCC, the rate achieved by iSNO-AAPair was significantly (about 30%-55%) higher than those by its counterparts, indicating that the high accuracy achieved by iSNO-AAPair was not an artifact but a true result, and hence it would be much more stable, consistent, and reliable in practical applications.
Also, in practical applications, the input should be entire protein sequences. To avoid memory bias, let us randomly pick 14 protein sequences whose experimental SNO sites are known but none of them occurs in the training dataset S L . The sequences of such 14 proteins as well as SNO site (red) and non-SNO site (blue) are given in Supplemental Information S3. The detailed results by the three methods in identifying the SNO sites for the 14 independent proteins are given in Supplemental Information S4. For clarity, these results are summarized in Table 2 from which we can see that iSNO-AAPair outperformed iSNO-PseAAC and GPS-SNO not only in the overall accuracy Acc, but also in MCC, indicating iSNO-AAPair not only performed better but also more stable than its counterparts.  It is anticipated that iSNO-AAPair may become a useful vehicle for identifying the SNO sites in proteins, or at the very least play an important complementary role to the existing predictors in this area.

Web server
For the convenience of the vast majority of biological scientists, a web-server for iSNO-AAPair was established. Here, let us give a step-by-step guide on how to use the web-server to get the desired results without the need to follow the mathematic equations that were presented just for the integrity in developing the predictor.
Step 1. Open the web server at http://app.aporc.org/iSNO-AAPair/ and you will see the top page of the predictor on your computer screen, as shown in Fig. 5. Click on the Read Me button to see a brief introduction about iSNO-AAPair predictor and the caveat when using it.
Step 2. Either type or copy/paste the query protein sequences into the input box shown at the center of Fig. 5. The input sequence should be in the FASTA format. Example sequences in FASTA format can be seen by clicking on the Example button right above the input box. For more information about FASTA format, visit http://en.wikipedia.org/wiki/Fasta format.
Step 3. Click on the Submit button to see the predicted result. For example, if you use the query protein sequences in the Example window as the input, after clicking the Submit button, you will see on your screen the predicted SNO site positions and the corresponding sequences segments with the form as formulated by Eq. (1). All these results are fully consistent with the experimentally verified results. It takes about a few seconds for the above computation before the predicted results appear on the computer screen; the greater number of query proteins and the longer each sequence, the more time is usually needed.
Step 4. As shown on the lower panel of Fig. 5, you may also choose the prediction by entering your desired input file via the "Browse" button. The input file should also be in FASTA format but can contain as many protein sequences as you want.
Step 5. Click on the Citation button to find the relevant papers that document the detailed development and algorithm of iSNO-AAPair.
Step 6. Click on the Data button to download the benchmark datasets used to train and test the iSNO-AAPair predictor. Caveats. To obtain the predicted result with the anticipated success rate, the entire sequence of the query protein rather than its fragment should be used as an input. A sequence with less than 50 amino acid residues is generally deemed as a fragment.