Latex allergies stretch beyond rubber gloves.

Due to the importance of protein phosphorylation in cellular control, many researches are undertaken to predict the kinase-specific phosphorylation sites. Referred to our previous work, KinasePhos 1.0, incorporated profile hidden Markov model (HMM) with flanking residues of the kinase-specific phos-phorylation sites. Herein, a new web server, Kinase-Phos 2.0, incorporates support vector machines (SVM) with the protein sequence profile and protein coupling pattern, which is a novel feature used for identifying phosphorylation sites. The coupling pattern [XdZ] denotes the amino acid coupling-pattern of amino acid types X and Z that are separated by d amino acids. The differences or quotients of coupling strength C XdZ between the positive set of phosphor-ylation sites and the background set of whole protein sequences from Swiss-Prot are computed to determine the number of coupling patterns for training SVM models. After the evaluation based on k-fold cross-validation and Jackknife cross-validation, the average predictive accuracy of phosphorylated serine, threonine, tyrosine and histidine are 90, 93, 88 and 93%, respectively. KinasePhos 2.0 performs better than other tools previously developed. The proposed web server is freely available at


INTRODUCTION
Protein phosphorylation, which is an important reversible mechanism in post-translational modifications, is involved in many essential cellular processes including cellular regulation, cellular signal pathways, metabolism, growth, differentiation and membrane transport (1). Phosphorylation of substrate sites at serine, threonine and tyrosine residues of eukaryotic proteins is performed by members of the protein kinase family. Additionally, phosphorylation on histidine plays an important role in signal transduction in prokaryotes known as two-component histidine kinase (2). It is estimated that one-third of proteins are phosphorylated and around half of kinome are disease-or cancer-related by chromosomal mapping (3). Experimental identifications of kinase-specific phosphorylation sites on substrates in vivo and in vitro are the foundation of understanding the mechanisms of phosphorylation dynamics and important for the biomedical drug design (4). However, these experiments are often time-consuming, labor-intensive and expensive. Therefore, in silico prediction of phosphorylation sites with high predictive performance could be a promising strategy to conduct preliminary analyses and could heavily reduce the number of potential targets that need further in vivo or in vitro confirmation.
With the recent exponential increase in protein phosphorylation sites identified by mass spectrometry (MS), many researches are undertaken to identify the kinasespecific phosphorylation sites. Our previous work, KinasePhos 1.0, incorporated profile hidden Markov model (HMM) for identifying kinase-specific phosphorylation sites, whose overall predictive accuracy is $87% (5,6). NetPhos (7) developed neural networks to predict phosphorylation sites on serine, threonine and tyrosine residues; however, it cannot provide information on the kinases involved and NetPhosK (8) applied an artificial neural network algorithm to predict 17 PK groups-specific phosphorylation sites. DISPHOS (9) took advantage of the position-specific amino acid frequencies and disorder information to improve the discrimination between phosphorylation sites and non-phosphorylation sites. Scansite 2.0 (10) identified short protein sequence motifs that are recognized by modular signaling domains, phosphorylated by protein serine/threonine, tyrosine kinases or mediate specific interactions with protein or phospholipid ligands. PredPhospho (11) predicts phosphorylation sites limited to four protein major kinase families, such as CDK, CK2, PKA and PKC, and four protein kinase groups (AGC, CAMK, CMGC and TK) with predictive accuracy 83-95 and 76-91%, respectively. GPS (12,13), is a group-based phosphorylation site predicting and scoring platform which clustered the 216 unique protein kinases in 71 groups. PPSP (4) developed an approach based on Bayesian decision theory for predicting the potential phosphorylation sites accurately for around 70 protein kinase groups.
This work proposes a kinase-specific phosphorylation site prediction server which incorporates support vector machines (SVM) with two features, i.e. protein sequence profiles surrounding the modified sites and coupling patterns surrounding the modified sites. The coupling pattern of proteins, which is first used for analyzing the protein thermostability (14). In this work, we incorporate the protein coupling pattern as a feature for training computer models for identifying phosphorylation sites. After evaluating the computational models by k-fold cross-validation and Jackknife cross-validation, the overall predictive accuracy of KinasePhos 2.0 is $91%, which is better than the previous version and the other tools previously developed. The details of the proposed method and predictive performance are described below.  Table 1. Since the flanking sequences (position À4 $ þ4) of the phosphorylation sites (position 0) are graphically visualized as sequence logos (17), the conservation of amino acids in the phosphorylation sites can be observed. The 9-mer sequences (À4 $ þ4) of kinase-specific phosphorylation sites are extracted and constructed as training sets. Table S1 (See Supplementary Data) summarizes the statistics of 60 kinase-specific phosphorylation sites in the data set constructed.

Feature extraction
To avoid the overestimation of the predictive performance, the redundant training sequences should be discarded. After the construction of non-redundant training set of kinase-specific phosphorylation sites, two features, i.e. sequence of surrounding catalytic sites and coupling pattern of surrounding catalytic sites, are extracted. As to sequence surrounding catalytic sites, 9-mer sequences (À4 $ þ4) of kinase-specific phosphorylation sites are encoded in three ways: BLOSUM62 profile encoding (the corresponding row number of amino acids in BLOSUM62 matrix), reduced alphabet (sparse encoding with fewer letters) (18) and 20-dimensional vector (each amino acid is mapped to a 20-dimensional vector), as given in Table S2. It was found that amino acids have a great variety of properties such as mass, polarity, hydrophobicity, so many groupings are possible (19). With the hydrophobicity (20), for instance, the 20 amino acids are reduced into three classes, such as polar (R,K,E,D,Q,N), neutral (G,A,S,T,P,H,Y) and hydrophobic (C,V,L,I,M,F,W).
The coupling pattern of surrounding catalytic sites is extracted from the flanking sequences of kinase-specific phosphorylation sites. Let [XdZ] denote the coupling pattern of amino acids X and Z that are separated by d amino acids. Since the protein sequence is directional, the sign of d is determined by the relative positions of X and Z. For example, as shown in Figure 1, a coupling pattern [R3Q] occurs in the training set, another coupling pattern [Q-3R] also occurs. Herein, we would not consider the coupling pattern with minus symbol. Let N(XdZ) be the number of occurrences of the coupling pattern [XdZ] in training sequences and the conditional probability R XdZ is where NðXdÁÞ ¼ P Y NðXdYÞ and Y 2 {20 types of amino acid}. The coupling strength C XdZ between X and Z of the pattern [XdZ] is given by where P(Z) is the probability of the occurrence of amino acid Y. If C XdZ ! 1, then X and Z are positively correlated with respect to the distance d, and they are negatively correlated if C XdZ 51.
The differences of coupling strength C XdZ between the training set of phosphorylation sites and the background set, which is extracted from all 9-mer sequences centering at residue serine, threonine, tyrosine and histidine in Swiss-Prot protein sequences, are computed and used to determine the number of coupling patterns trained by SVM. The higher differences of C XdZ mean that the coupling pattern [XdZ] is the most important feature for separating the training set from the background set; therefore, the values of differences of the coupling strength C XdZ between training set and background set should be tuned for determining the number of coupling patterns used to train a SVM model. Each coupling pattern is a dimension of features used in SVM. For instance, when set up the cutoff value of the differences of C XdZ between training set and background set to 1.5, there are about 400 coupling patterns which is higher than the cutoff; thus, the number of dimensions trained by SVM is about 400, which is equal to the number of selected coupling patterns.

Model creation and evaluation
This work incorporates support vector machine (SVM) with the protein sequences and profiles of coupling pattern for training the predictive models for kinase-specific phosphorylation site prediction. A public SVM library, namely LIBSVM (21), is applied for training the predictive models. The SVM kernel function of radial basis function (RBF) is selected. In general, the experimental kinasespecific phosphorylation sites are defined as the positive set, while all other residues (S, T, Y or H) in the phosphorylated proteins are regarded as the negative set. K-fold crossvalidation is used to evaluate the predictive performance of the models trained from the large data sets including PKA, PKC and MAPK, and Jackknife cross-validation is applied for models trained from the data size smaller than 30. We balance the positive set and negative set and the sizes of positive set and negative set are equal during the crossvalidation processes. The cross-validation is performed for 30 times. The following measures of predictive performance of the trained models are defined: Precision  It notices that the sum of serine, threonine, tyrosine and histidine in Swiss-Prot is not equal to 6832, because there are several phosphorylation sites located on other kinds of residue. *The entries which contain residues annotated as 'phosphorylation' in the 'MOD_RES' are extracted and the entries annotated as 'by similarity', 'potential' and 'probable' are excluded.
Moreover, several parameters of the models including the values of differences of coupling strengths, the SVM cost values and SVM gamma values are optimized for maximizing the predictive accuracy. Finally, the parameters of the trained model with the highest predictive accuracy in each data set, were selected and used to provide the prediction service on the web.

PREDICTION PERFORMANCE
For finding the best predictive performance of SVM models in each kinase-specific group, the SVM models trained with various features such as coupling pattern (CP), sequence and the combination of coupling pattern and sequence are evaluated based on cross-validation. As shown in Figure 2, the average precision (Prec), sensitivity (Sn), specificity (Sp) and accuracy (Acc) of the SVM models trained with various features are calculated for phosphoserine, phosphothreonine, phosphotyrosine and phosphohistidine. Two methods are used to extract the coupling patterns, i.e. 'CP difference' and 'CP ratio'. 'CP difference' indicates the coupling strength of training set subtracted the coupling strength of background set, and 'CP ratio' indicates the coupling strength of training set divided the coupling strength of background set. As to the feature of sequence profile, there are various coding methods used for encoding amino acids surrounding the phosphorylation sites, such as reduced alphabet (3-classes, 7-classes and 8-classes), BLOSUM62 profile encoding and 20-dimensional vector. Because the average predictive performance of the kinase-specific phosphorylation sites with small training set may be overestimated, the SVM models of kinase-specific group whose data size is smaller than 20 training sequences are not considered. Figure 2 gives the average predictive accuracies of models trained with coupling patterns (CP difference or CP ratio) of phosphoserine, phosphothreonine, phosphotyrosine and phosphohistidine are 86, 93, 88 and 93%, respectively. The overall predictive performance of SVM models trained with the features of coupling patterns, whose accuracy is close to 90%, is performing better than the SVM models trained only with sequence profiles (Seq).
Since the features of coupling patterns (CP ratio) and sequences (7-classes) with best predictive performance are combined, the average predictive accuracy of SVM models trained with the combined features of phosphoserine is 89%, which is slightly better than the SVM models trained only with coupling patterns. However, the average predictive performance of the SVM models trained with the combined features of phosphothreonine, phosphotyrosine and phosphohistidine is close to the SVM models trained only with coupling patterns. The overall predictive accuracy of SVM models trained with the combined features of coupling patterns and sequences is close to 91%. In addition, the method of KinasePhos 1.0 is evaluated based on the data set constructed in this work. The average predictive accuracies of phosphoserine, phosphothreonine, phosphotyrosine and phosphohistidine are 84, 88, 84 and 83%, respectively.
Since the SVM models trained with various features, the most accurate model of each kinase-specific phosphorylation sites are selected and used to implement a prediction server. As shown in Table S3, the trained features, SVM Cost value, SVM Gamma value, precisions, sensitivity, specificity and accuracy of the selected models are presented for 37 kinase-specific groups with at least 20 experimentally verified phosphorylation sites. In the column of trained features, the value in the parentheses behind the coupling pattern (CP) is the value of difference or quotient of coupling strength between the training set against the background set. The average predictive accuracies of phosphoserine, phosphothreonine, phosphotyrosine and phosphohistidine are 90, 93, 88 and 93%, respectively.

WEB INTERFACE
After evaluating the trained models for identifying kinasespecific phosphorylation sites, the model with the highest predictive accuracy for each data set was selected. Users can submit their uncharacterized protein sequences and select the kinase-specific models for predicting phosphorylated serine, threonine, tyrosine or histidine. Although only 37 kinase groups containing at least 20 experimental phosphorylation sites were used to evaluate the predictive performance, the web server provides 60 predictive models of the kinase-specific groups with at least 10 experimental phosphorylation sites. As depicted in Figure 3, the web server locates the predictive phosphorylation sites and the involved catalytic protein kinases. In order to reveal the characteristics of the phosphorylation sites including the phosphorylated residues and surrounding sequences, the training phosphorylation sites and constructed sequence logos corresponding to each protein kinase are also provided graphically on the web interface. Moreover, users can download the predicted results with tab-delimited format for further analyses. The web server can accurately and efficiently predict the kinasespecific phosphorylation sites in the input protein sequences.

DISCUSSIONS AND CONCLUSION
The models trained with various features, including sequence profiles and coupling patterns, were evaluated by 5-fold and Jackknife cross-validation, the predictive performance of the models trained with coupling patterns are better than the models trained with sequence profiles. In general, the previous works of phosphorylation site prediction focused on residues serine, threonine and tyrosine; like our previous work (KinasePhos 1.0). Herein, KinasePhos 2.0 first considers phosphohistidine from Phospho.ELM and Swiss-Prot, which contain one and 42 phosphorylated histidine, respectively.
Moreover, the proposed web server is compared with several previously developed phosphorylation prediction tools, such as DISPHOS (9), PredPhospho (11), GPS (12,13), PPSP (4) and KinasePhos 1.0 (5,6). As given in Table 2, the number of kinases, sensitivity and specificity of prediction and the overall predictive performance of these tools are compared. GPS, PPSP, PredPhospho, KinasePhos 1.0 and the proposed methods all support the identification of kinase-specific phosphorylation sites. Although only the kinase groups containing at least 20 experimental phosphorylation sites were selected to evaluate the average predictive performance, the web server of KinasePhos 2.0 provided the predictive models of 60 kinase-specific groups with at least 10 experimental phosphorylation sites. Because the average predictive performance of serine, threonine and tyrosine of GPS and PPSP cannot be obtained, the predictive performance of three representative kinases such as PKA, PKC and CK2 are compared. As given in Table 2, the predictive performances of three representative kinases in KinasePhos 2.0 are comparable with PredPhospho, GPS, PPSP and KinasePhos 1.0. In particular, KinasePhos 2.0 provides the predictive model for phosphohistidine, whose predictive accuracy is 93%. The overall predictive accuracy of the kinase-specific groups with at least 20 phosphorylation sites of the proposed method is 91%. However, as given in Table S4, the overall predictive accuracy of the kinase groups which are smaller than 20 experimental phosphorylation sites is 94%.
The protein structural properties, such as accessible surface area (ASA) and secondary structure, can be considered in the future to improve the predictive performance of the models. For instance, ASA may be used for reducing the number of false-positive predictions of phosphorylation sites which locate in buried regions. However, the number of experimental phosphorylation sites located in the protein regions with known structure from PDB (22) is few for each kinase-specific group. Although ASA and secondary structure can be predicted by several published tools such as RVP-net (23) and PSIPRED (24), respectively, the predictive performance of phosphorylation sites may be affected by the predictive structural properties.

AVAILABILITY
The web server of KinasePhos 2.0 will be continuously maintained and updated. The web server is now freely available at http://KinasePhos2.mbc.nctu.edu.tw/