Systematic Analysis of Protein Phosphorylation Networks From Phosphoproteomic Data*

In eukaryotes, hundreds of protein kinases (PKs) specifically and precisely modify thousands of substrates at specific amino acid residues to faithfully orchestrate numerous biological processes, and reversibly determine the cellular dynamics and plasticity. Although over 100,000 phosphorylation sites (p-sites) have been experimentally identified from phosphoproteomic studies, the regulatory PKs for most of these sites still remain to be characterized. Here, we present a novel software package of iGPS for the prediction of in vivo site-specific kinase-substrate relations mainly from the phosphoproteomic data. By critical evaluations and comparisons, the performance of iGPS is satisfying and better than other existed tools. Based on the prediction results, we modeled protein phosphorylation networks and observed that the eukaryotic phospho-regulation is poorly conserved at the site and substrate levels. With an integrative procedure, we conducted a large-scale phosphorylation analysis of human liver and experimentally identified 9719 p-sites in 2998 proteins. Using iGPS, we predicted a human liver protein phosphorylation networks containing 12,819 potential site-specific kinase-substrate relations among 350 PKs and 962 substrates for 2633 p-sites. Further statistical analysis and comparison revealed that 127 PKs significantly modify more or fewer p-sites in the liver protein phosphorylation networks against the whole human protein phosphorylation network. The largest data set of the human liver phosphoproteome together with computational analyses can be useful for further experimental consideration. This work contributes to the understanding of phosphorylation mechanisms at the systemic level, and provides a powerful methodology for the general analysis of in vivo post-translational modifications regulating sub-proteomes.

Protein kinase (PK) 1 -catalyzed phosphorylation is one of the most important and ubiquitous post-translational modifications (PTMs) of proteins. This process temporally and spatially modifies ϳ30% of all cellular proteins and plays a crucial role in regulating a variety of biological processes such as signal transduction and the cell cycle (1)(2)(3). The human genome encodes 518 PK genes (ϳ2% of the genome), with different PKs showing distinct recognition specificities; each PK modifies only a limited subset of substrates, thereby guaranteeing the fidelity of cell signaling (1)(2)(3). It is accepted that short linear motifs (SLMs) around the phosphorylation sites (p-sites) provide primary specificity (2, 4 -6), and a variety of additional contextual factors, including co-localization, coexpression, co-complex, and physical interaction of the PKs with their targets, contribute additional specificity in vivo (7)(8)(9)(10). Aberrances of PKs or key substrates disrupt normal function, rewire signaling pathways, and are implicated in various diseases and cancers (3,11). In this regard, the identification of kinase-specific p-sites and the systematic elucidation of site-specific kinase-substrate relations (ssKSRs) would pro-vide a fundamental basis for understanding cell plasticity and dynamics and for dissecting the molecular mechanisms of various diseases, whereas the ultimate progress could suggest potential drug targets for future biomedical design (8 -10).
Conventional experimental identification of ssKSRs, performed in a one-by-one manner, is labor-intensive, time-consuming and expensive. There are only 3508 known kinasespecific p-sites in the 1390 proteins collected in the Phospho.ELM 8.2 database (released in April 2009) (12). In 2005, Ptacek et al. detected more than 4000 in vitro kinasesubstrate relations (KSRs) in Saccharomyces cerevisiae using protein chip technology, although the exact p-sites were not determined (13). Recently, rapid advances in phosphoproteomics have provided a great opportunity to systematically assess phosphorylation (1, 14 -21). State-of-the-art highthroughput mass spectrometry (HTP-MS) techniques have the ability to detect thousands of p-sites in cells or tissues in a single experiment (1,14,16,22). We have collected 145,646 eukaryotic p-sites, primarily from these large-scale assays (supplemental Table S1); the regulatory PKs for 97.6% of these sites remain to be characterized.
Alternatively, the in silico prediction of ssKSRs can generate useful information for subsequent experimental manipulation. In 2001, Yaffe et al. developed the SLM-based software Scansite for the prediction of ssKSRs directly from protein primary sequences (7). Later, the strategy was employed in a variety of kinase-specific predictors (23), including our group-based prediction system (GPS) program (24). These tools may guarantee partially correct predictions for in vitro phosphorylation, but they are far from being adequate for in vivo hits because the contributions of various contextual factors cannot be neglected. To address this problem, Linding et al. developed a predictor of NetworKIN by combining an SLM-based approach with network contextual information to predict in vivo ssKSRs, and a potential in vivo human phosphorylation network (HPN) was modeled by annotating the phosphoproteomic data (8,9).
In this work, we developed a software package of iGPS (GPS algorithm with the interaction filter, or in vivo GPS) mainly for the prediction of in vivo ssKSRs.Eukaryotic PKs were classified into a hierarchy with four levels: group, family, subfamily, and single PK (3). Based on the hypothesis that similar PKs recognize similar SLMs, we selected a predictor in GPS 2.0 (24) for each PK and directly predicted the potential PKs for the un-annotated p-sites from the phosphoproteomic studies. Consequently, protein-protein interaction (PPI) information was used as the major contextual factor to reduce over 95% potentially false-positive hits. The performance of iGPS was shown by critical evaluations and comparisons to be promising for the accurate prediction of in vivo ssKSRs. Based on the prediction results of iGPS, we modeled eukaryotic protein phosphorylation networks (PPNs) and observed that phosphorylation regulation changes dramatically over the course of evolution, with poor conservation at both the site and substrate levels. This observation is consistent with previous studies (17,25). Furthermore, we combined a new multidimensional separation approach using reversed-phase-reversed-phase liquid chromatography (RP-RPLC) (22), with HTP-MS and a new data process platform of ArMone (26) to conduct a large-scale phosphorylation analysis of the human liver. Totally, 9719 p-sites of 2998 substrates were identified from 10,644 non-redundant phosphopeptides. The potential ssKSRs were predicted for the human liver phosphoproteome, whereas further statistical analysis suggested that 60 and 67 PKs preferentially regulate more or fewer p-sites in the human liver PPN (p valueϽ0.01). A number of results are consistent with previous observations, whereas other predictions can be useful for further experimental manipulation.
Performance Evaluation-As previously described (24), four standard measurements, including sensitivity (Sn), specificity (Sp), accuracy (Ac), and Mathew correlation coefficient (MCC) were defined as below: To evaluate the proportion of correct hits among the total predicted PKs, we defined the kinase precision (Kpr). Given m real p-sites, if we predict Y i PKs for the i site, with X i being the experimentally verified PKs, the Kpr is defined as: As previously described (24), we adopted large-scale precision (Lpr) to estimate the proportion of correct hits from the large-scale prediction of the PK information. Given n potential p-sites, suppose that p positive hits are phosphorylated by a PK or PK cluster under a certain threshold with a calculated false positive rate (FPR) value. Then the theoretically maximal false positive hits will be n * FPR, if all of the n site are real negative sites. Thus, the minimal proportion of correct predictions can be calculated as: Furthermore, the above equation was extended to estimate the average precision for a large data set prediction. Given n potential p-sites, suppose that we predict p 1 , p 2 , …, and p k positive hits for k PKs, respectively, the theoretically maximal false positive hits for each PK will be n * FPR j (j ϭ 1, 2, …, k). Then the total precision can be calculated as: Although the Lpr does underestimate the accuracy of large-scale predictions, nevertheless, it still showed the power and utility of the iGPS algorithm.
Liver Sample Preparation and Phosphopeptide Enrichment-The study was approved by the Institutional Review Board of Eastern Hepatobiliary Surgery Hospital (Shanghai, China). Informed consent was obtained from patients enrolled in this study. The human liver tissues are the non-tumor liver tissues Ն2 cm outside the hepatic hemangiomas removed by surgical operation. The liver tissues have been checked by histopathological examination to exclude the presence of invading or microscopic metastatic tumor cells.
As previously described (15), human liver tissues were lysated in 20 ml ice-cold homogenization buffer consisting of 8 M urea, 1% Triton X-100 v/v, 65 mM DTT, 1 mM EDTA, 0.5 mM EGTA, 1 mM PMSF, 200 l protease inhibitor mixture (Sigma), phosphatase inhibitors (1 mM sodium fluoride, 1 mM sodium orthovanadate, 1 mM ␤-glycerophosphate, 10 mM sodium pyrophosphate), and 40 mM Tris-HCl at pH 7.4. After being resuspended in the denaturing buffer containing 8 M urea and 50 mM Tris-HCl (pH 8.2), the proteins were reduced by DTT at 37°C for 2 h and alkylated by iodoacetamide in the dark at room temperature for 40 min. Then the solutions were diluted to 1 M urea with 50 mM Tris-HCl and trypsin was added, with the weight ratio of trypsin to protein at 1/25, and incubated at 37°C overnight. All of the resulting peptide solution was stored under Ϫ80°C.
The phosphopeptides were enriched from the digest of human liver lysate by Ti 4ϩ -IMAC microspheres (16). Briefly, peptide mixtures which were first incubated with the Ti 4ϩ -IMAC microsphere suspension (10 mg ml Ϫ1 in 80% ACN, 0.1% TFA) for 30 min then were washed with a solution containing 50% ACN, 6% TFA and 200 mM NaCl, followed by washing with 30% ACN/0.1% TFA. Finally, the enriched phosphopeptides were eluted with 10% NH 3 ⅐H 2 O and dried by vacuum centrifugation.
Multidimensional Separation of Phosphopeptides and Mass Spectrometry Analysis-The enriched phosphopeptides were redissolved in mobile phase A (25 mM ammonium formate (NH 4 FA) aqueous buffer, pH 7.5) and then were loaded onto the first dimensional separation column (250 mm ϫ 4.6 mm I. D. column packed with 5 m Hypersil GOLD aQ C18, Thermo). Mobile phase B was 25 mM NH 4 FA in water/acetonitrile (1: 9) and the gradient elution was performed with 0%-10% B (0 -80 min) and 10%-35% B (80 -90 min). A total of 90 fractions (one fraction per each minute) from the first dimensional separation were collected and then divided into two groups: an early group (fractions 1-45) and later group (fractions 46 -90). Then every two fractions with an equal time interval (fractions 1 and 46, 2 and 47, and so on) were mixed as described previously (22). Finally, half the total number of 45 fractions were lyophilized and submitted to the second dimensional RP-RPLC separation.
The HPLC system consisted of a degasser and a quaternary surveyor MS pump (Thermo Finnigan, CA). The capillary separation column was prepared as previously described (32). Briefly, the capillary was manually pulled to a fine point of ϳ 3 m with a flame torch and then was packed with C18 aQ beads (5 m, 120 Å) in a homemade pneumatic pressure cell using a slurry packing method. The lyophilized fractions from the first dimension were resuspended in 10 L of 0.1% FA solution, and 2 L of the sample were then manually loaded onto the column with each sample replicated three times. The mobile phase A was 0.1% FA in water and B was 0.1% FA in acetonitrile; gradient elution was performed with 3-25% B in 90 min at a flow rate after splitting 200 nL/min.
The MS analysis was performed on LTQ-Orbitrap mass spectrometer (Thermo, San Jose, CA) with a resolution of 100000 at m/z 400. The temperature of the ion transfer capillary was set at 200°C. The spray voltage was set at 1.8 kV and the normalized collision energy was set at 35.0%. The detection of phosphopeptides was performed with the mass spectrometer set for a full scan MS followed by three data-dependent MS2 events. Subsequently, the MS3 spectrum was automatically triggered when a neutral loss event of 97.97, 48.99, or 32.66 Da (loss of H 3 PO 4 for the ϩ1, ϩ2 and ϩ3 charge ions, respectively) was detected among the three most intense peaks in MS2. The target ion setting was 5e5 for the Orbitrap, with a maximum fill-time of 500 ms. MS2 scans were acquired in the LTQ with a target ion setting of 3e4 and a maximum fill-time of 100 ms. The dynamic exclusion function was set as follows: repeat count 2, repeat duration 30 s, and an exclusion duration of 60 s.
Database Search and Data Analysis-The peak list files for MS2 and MS3 spectra were extracted by Extract_msn.exe in Bioworks 3.3 using default settings. The MS2 and MS3 spectra were searched with SEQUEST (version 2.8) against a composite database containing both the original human IPI protein database (ipi.HUMAN.v3.17.fasta, including 60234 entries, http://www.ebi.ac.uk/IPI/IPIhuman.html) and its reversed complement. Trypsin was set as the specific proteolytic enzyme with fully enzymatic and up to two missed cleavages were allowed. The mass tolerance for the precursor ion was set as 50 ppm and 0.8 Da for the fragment ion. Carbamidomethylation (ϩ57.02146 Da) on cysteine was set as fixed modification, whereas oxidation (ϩ15.99452 Da) on methionine, phosphorylation (ϩ79.96633 Da) on serine, threonine, and tyrosine were set as variable modifications. For the searching with MS3 data, ␤-elimination of phosphoric acid (-18.010565 Da) on serine and threonine residues were also selected as variable modifications.
The database search results were processed with the software suite of ArMone, which was recently designed for the management and analysis of phosphoproteome data (26). The assignment of psites from identified phosphopeptides was determined by the Ascore algorithm (33), which was also implemented in ArMone (26). Based on the classification filtering strategy (34), the identified phosphopeptides were classified into four groups (supplemental Table S4, S5, S6, and S7). The mass spectra without significant neutral loss or without consecutive MS3 spectra were defined as the NoNeutral class (supplemental Table S5). The group of mass spectra with significant neutral loss was further separated into three classes, such as MS2/ MS3 (MS2/MS3 pair can generate the same phosphopeptide assignment, supplemental Table S4), NeutralMS2 (Phosphopeptide exclusively generated from the MS2 spectra, supplemental Table S6) and NeutralMS3 (Exclusively generated from the MS3 spectra, supplemental Table S7) classes. Based on the different characteristics of four classes identifications, different filtering strategies were adopted to achieve the false discovery rate (FDR)Ͻ1% (FDR ϭ 2N d /N, in which N is the number of peptide matches with scores above the cut-off and N d was the number of matches to decoy sequences). For MS2/MS3 identifications, DeltaCn'mϾ0.1, Xcorr'sϾ0.63; for NoNeutral, Delta-CnϾ0.1, XcorrϾ2.6, 3.2, 4.2 and 4.8 for ϩ1, ϩ2, ϩ3 and ϩ4 charge states respectively; for NeutralMS2, DeltaCnϾ0.1, XcorrϾ2.5, 3.8, 4.7 and 5 for ϩ1, ϩ2, ϩ3 and ϩ4 charge states respectively; for Neu-tralMS3, DeltaCnϾ 0.1, XcorrϾ2.2, 3.5, 4.5 and 4.3 for ϩ1, ϩ2, ϩ3 and ϩ4 charge states respectively.
All of the mass spectra with matched ion information of identified unique phosphopeptides were generated by the batch drawing module of Armone (26) and exported in .html format with hyperlinks to the spectrum images. The annotated spectra can be accessed at the publicly accessible database Tranche (https://proteomecommons. org/tranche/), using the following hash: XnGGs9eaZCdxxw6gb I6RXpfeϩ0EKutSSQL7Ue3zcwrGG4lMY44oXJAr0BϩmYzJmWUou2 bfn0B6ojsAhSBvaRTAIX4G4AAAAAAAADnQϭϭ.

Development of iGPS for the Prediction of in vivo ssKSRs-
In a previous study (24) we developed the software GPS 2.0, which could predict the kinase-specific p-sites for 408 PKs in humans. We estimated the FPR by randomly generating PSP(7, 7) peptides based on the real frequencies of amino acids in the eukaryotic proteomes. The high, medium, and low cut-off values were chosen based on the FPRs of 2%, 6%, and 10% for serine/threonine kinases (STKs), and 4%, 9%, and 15% for tyrosine kinases (TKs). With these high thresholds, we directly predicted 170,593 ssKSRs for 12,219 unannotated p-sites from a total of 13,254 mammalian sites (ϳ14 PK groups per site) (24). Although the coverage rate was very high at ϳ92.19%, it is strongly suspected that a large proportion of the predicted results might be false positive hits, because in vivo numerous contextual factors only permit a small number of PKs to specifically reach their substrates. Moreover, the GPS 2.0 program used a hierarchical structure with four levels, whereas only 39,540 kinase-substrate interactions for 7813 p-sites (a coverage rate of ϳ58.9%) were predicted at the single PK level (24). In addition, in our current data set, there are 35,711 nonmammalian p-sites (ϳ24.5%). Annotation of the potential PK information for these sites continues to be a formidable challenge.
To address these issues, we first hypothesized that eukaryotic PKs classified in a same group, family or subfamily would recognize similar consensus motifs/patterns of substrate modification, although the recognition similarities would differ in extent for the different PK clusters. Thus, if one site was predicted to be phosphorylated by any PK group, family, or subfamily, we assumed that all the PKs in the same cluster would phosphorylate the site. Second, the "kiss-then-farewell" (KTF) model was adopted (24,35). Thus, the PPI information was used as the major contextual factor to reduce false positive predictions (supplemental Table S19 and supplemental Fig. S1). Although some proportion of "kisses" might be slight and transient and thus not detected in standard PPI screenings, the interaction information would be expected to significantly enrich PK substrates, at least for Aurora-B (24) and PKA (35).
Based on the two hypotheses, we developed a novel software package of iGPS mainly for the prediction of in vivo ssKSRs from eukaryotic phosphoproteomic data (supplemental Fig. S2). The full computational procedure is shown in Fig. 1. First, we collected 28,457 phosphorylated substrates containing 145,646 p-sites from five eukaryotic organisms ( Fig. 1 and supplemental Table S1). The GPS 2.0 algorithm was used for the prediction of ssKSRs, whereas a low threshold was chosen with a FPR of 10% for STKs, and 15% for TKs, respectively (24). The original GPS 2.0 software contained 213 subpredictors for 144 STK and 69 TK clusters (24). However, because of the training data limitation, not all predictors showed an accurate performance. To ensure the accuracy of our predictions, we selected 56 STK-and 21 TKspecific predictors in GPS 2.0 for most of the PKs (supplemental Table S8). To further reduce the number of false-positive hits at the substrate level, the predicted and experimentally identified physical interactions between PKs and substrates were combined and used as the major contextual filters ( Fig. 1 and see supplemental Experimental Procedures). To visualize potential PPNs, the orientation was simply defined as Kinase -Ͼ Substrate. Because a number of substrates might be PKs, the orientation could also be Kinase A -Ͼ Kinase B (A phosphorylates B) or Kinase A Ͻ-Ͼ Kinase B (A and B reciprocally phosphorylate each other) (Fig. 1).
The PPI Information Can be an Efficient Filter to Reduce Potentially False Positive Hits-Although a number of contextual factors are believed to contribute additional specificity beyond the phosphorylation motif (7-10), here we only adopted one major factor of PPI information. Whether this single factor is able to significantly improve the prediction accuracy remains to be tested. Here, we collected a testing data set by obtaining from Phospho.ELM 8.2 (12), with 3508 kinase-specific p-sites in 1390 substrates (supplemental Table S20). The regulatory PKs of 3485 p-sites (99.3%) were identified in low-throughput experiments at a high level of confidence (supplemental Table S2). We calculated the prediction performances under three conditions: without PPI (No PPI), with both STRING and experimental PPI (STRING and Exp. PPI), and with experimental PPI information alone (Exp. PPI). The results for several typical predictors are shown in Table II (Table II). This shows that the PPI information can greatly decrease the number of potential false-positive PKs for p-site annotation. In addition, predictions suggest that regulatory PKs for the psites may not be fully identified. Thus, we mixed these sites together with other un-annotated p-sites for further analysis.
Extensive evaluations were performed for the prediction of 145,646 eukaryotic p-sites in 28,457 substrates (supplemental Table S10). Without the PPI information, we directly predicted 4,001,298 ssKSRs for 127,697 p-sites of all of the species, with a coverage rate of 87.7% (Table III and supplemental Table S11). When PPI information was used, although the coverage rates were reduced to 30.4% and 11.0% for the total PPIs and the experimental PPIs, respectively, the potential ssKSRs were also decreased to 186,922 (4.67%) and 34,873 (0.87%) (Table III and supplemental Table S11). In this regard, the potentially false positive predictions were greatly reduced through the PPI contextual filter. Because the coverage rate with the exclusively experimental PPIs is too low, both experimental and predicted PPIs were used for further analysis. Although the coverage rate can be enhanced if the threshold is relaxed, we adopted the current parameters for iGPS to ensure a high degree of confidence.
Comparison of iGPS with NetworKIN-In 2007, Linding et al. presented a pioneering study by introducing contextual factors to reduce potentially false positive hits for the prediction of ssKSRs (8,9). The NetworKIN 1.1 contains 21 PK classifiers to predict kinase-specific p-sites for 108 PKs (8,9), whereas the NetworKIN 2.0 beta version can predict ssKSRs for 123 PKs with 59 PK classifiers (unpublished). Here, we compared iGPS with both NetworKIN 1.1 and NetworKIN 2.0 beta for 12 PK groups, including 8 STK groups and 4 TK groups (Table I). To avoid any bias, the same testing data set was adopted for iGPS, NetworKIN 1.1 and NetworKIN 2.0 beta. Besides the known p-sites in Phospho.ELM 9.0 (12, 28), we additionally curated kinase-specific p-sites from the scientific literature. The nonredundant testing data set contains 1280 substrates with 2894 p-sites for the 12 PK groups (Table I).
We fixed the Sp values of iGPS so as to be similar to the two predictors, and then compared the Sn values (Table IV). In NetworKIN 1.1, the PK classifiers of CMGC/CDK/CDC2, Other/AUR/AUR-A, TK/ABL and TK/Syk were not available (8,9). Thus, we chose the cdk5 predictor for CMGC/CDK/CDC2, whereas the performances of Other/AUR/AUR-A, TK/ABL and TK/Syk were not compared. Except AGC/AKT, Atypical/PIKK/ ATM and Other/CK2, iGPS outperformed NetworKIN 1.1 for up to 6 PK groups (Table IV). For example, when the Sp value was 93.28%, the Sn values of iGPS and NetworKIN 1.1 for AGC/PKA were 57.56% and 32.20%, respectively (Table IV). Also, when the Sp value was 96.97%, the Sn of iGPS (56.76%) was much better than NetworKIN 1.1 (12.16%) for TK/EGFR (Table IV). For NetworKIN 2.0 beta, it only showed superiority for Other/CK2, whereas the performance of Atypical/PIKK/ATM was reduced and similar with iGPS (Table IV). The iGPS generated much better performances on the remaining 10 PK groups (Table IV). Taken together, we proposed the performance of iGPS is generally better than NetworKIN.
Computational Modeling and Analysis of Eukaryotic PPNs from the Phosphoproteomic Data-For the five eukaryotic phosphoproteomes, we predicted 186,922 ssKSRs among  (Table III and supplemental Table S11). The number of predicted phosphorylated proteins and p-sites for each PK were summarized in supplemental Table S12. For example, the PPNs in S. cerevisiae ( Fig. 2A) and H. sapiens (Fig. 2B) were modeled and shown. The human PPN contains 113,923 ssKSRs among the 380 PKs and 4140 targets for the 22,817 p-sites, with an average of 5.0 upstream PKs per p-site and 98.7 substrates per PK (Table III and Fig. 2B). Using pairwise comparison we found only 653 (3.1%) and 1,291 (0.7%) conserved ssKSRs in S. cerevisiae and H. sapiens, respectively (supplemental Table S13). Thus, eukaryotic phosphorylation is poorly conserved at the site and substrate levels. And this result is consistent with a previous analysis (17,25). By comparison, we found only one KSR of the kinase BUR1/CDK9, which phosphorylates the transcription elonga-tion factor protein Spt5, to be conserved in all five eukaryotic phosphoproteomes (Fig. 3). In S. cerevisiae, BUR1 might phosphorylate Spt5 at the single site of S136 in its N terminus, whereas its ortholog, CDK9, can phosphorylate up to 11 sites in humans (Fig. 3). We observed that fly Spt5 and S76, along with mouse and human Spt5 S19, are also phosphorylated (Fig. 3). Although the p-sites are not in the same position after sequence alignment, the Spt5 N terminus phosphorylation by BUR1/CDK9 might be a conserved mechanism. By sequence alignment, we observed that only the two p-sites of T806 and T814 in human are conserved in all five species (Fig. 4). From these results, it is proposed that segment around the last KOW (Kyprides, Ouzounis, Woese) domain might be the hotspot (from S666 to T814 in humans) for phosphorylation by CDK9 (Fig. 3). Furthermore, recent progress suggests that the transcription elongation factor Spt5 can be in vivo phosphorylated by BUR1 to recruit the polymerase-associated factor (PAF) complex in yeast (36,37). The in vitro experiments  indicate that the C-terminal repeat domain (CTD) of yeast Spt5 is potentially responsible for modification, whereas the in vivo verification still remains to be obtained (36,37). However, the yeast Spt5 CTD is not conserved in other species. Whether Spt5 CTD in higher eukaryotes is phosphorylated by CDK9 remains to be verified in vivo. In this regard, BUR1/ CDK9 might phosphorylate Spt5 in a quite complicated manner. The predictions might be useful for further experimental investigation. Systematic Analysis of the Human Liver Phosphoproteome and PPN-The liver is the largest internal organ in the human body. Beyond digestion, it plays a variety of essential roles in cell proliferation/differentiation, catabolism/metabolism, embryonic development, detoxification, drug pharmacokinetics, and so on (38). In 2002, the Human Liver Proteome Project (HLPP) was started as the first initiative for the understanding organ-or tissue-specific proteomes in humans (38). The recent completion of proteome, transcriptome, CHIP and massively parallel signature sequencing (MPSS) studies from the Chinese human liver proteome project (CNHLPP) revealed that tens of thousands of proteins/genes are regulated in the human liver (39,40). Determining how these proteins/genes are temporally and spatially modulated is still a great challenge, and the functional roles of PTMs in the liver remain to be elucidated.
In this work, we conducted a global phosphorylation analysis of the human liver, and experimentally identified 10,644 nonredundant phosphopeptides (Fig. 5A). The excellent per-formance of the new RP-RPLC approach on the in-depth phosphorylation analysis was shown by the uniform distribution of the identified phosphopeptides throughout the first dimension (Fig. 5B). In contrast with our previous strategy that only phosphopeptides identified by both MS2 and MS3 in the MS2/MS3 class were retained (15,16), the classification filtering strategy (34) generated a 31.8% increase in phosphopeptide identification by incorporating phosphopeptides identified exclusively from only the MS2 or MS3 spectra (Fig.  5C). In our results, there were 7214 singly (67.8%), 2403 doubly (22.6%), and 1027 triply (9.6%) phosphorylated peptides (Fig. 6A). Multiply phosphorylated peptides occupied over 30% of the total identifications (Fig. 6A), and this result is consistent with other large-scale phosphoproteomic studies (1,14). The phosphopeptides were mapped to the UniProt benchmark sequences, and totally 9719 p-sites were identified from 2998 substrates. The distribution of phosphoserine (pS), phosphothreonine (pT) and phosphotyrosine (pY) sites is 85.3% (8,294), 12.9% (1,249) and 1.8% (176) respectively (Fig. 6B), and the result is similar with a previous study although different samples were used (1). More details on the phosphopeptide analysis are present in supplemental Results and supplemental Figs. S3-S6.
By comparing with 60,816 known p-sites collected from heterogeneous resources, we observed that 5818 (59.9%) p-sites with 5063 pS, 664 pT and 91 pY sites have been reported previously (Fig. 6C). The high percentage of known p-sites indicates that p-sites identified in this study are of high confidence. Using iGPS, we predicted a potential PPN containing 12,819 potential ssKSRs among 350 PKs and 962 substrates for 2633 p-sites from the human liver phosphoproteome, with a coverage rate of 27.1% (supplemental Table  S14). From the top 10 PKs with the most predicted p-sites, we observed that up to 6 PKs belong to the CMGC PK group, such as Erk1, CDK2, Erk2, and GSK3A, p38a and CDC2 (supplemental Table S14). Although these PKs were predicted to phosphorylate more p-sites, we speculated whether some PKs preferentially modify more or fewer p-sites in the liver PPN against the whole human PPN. To address this problem, we performed the Yates' chi-squared ( 2 ) test with the 2 ϫ 2 contingency table method (41) (for more details see supplemental Experimental Procedures). Totally, we observed that 60 PKs significantly modify more p-sites, whereas 67 PKs preferentially modify fewer p-sites (p value Ͻ 0.01, see supplemental Table S15). The top 10 PKs with significantly overor under-represented p-sites in the human liver PPN were shown in Table V, respectively. Previously, experimental studies revealed that AKT family PKs play a predominantly regulatory role in regulating the hepatic gluconeogenesis (42,43). In the results, we observed that the p-sites of AKT1 are significantly over-represented with an enrichment ratio of 1.55 (p value ϭ 3.04E-19) (Table V). Also, it was reported that the CK2 activity increases after hepatectomy or laparatomy (44). Thus, the significantly enriched p-sites for CK2 might be attributed to the enhanced activity (E-ratio ϭ 1.87, p value ϭ 1.46E-35, Table V). Furthermore, ribosomal S6 kinases (RSKs) can be activated by hepatotoxin CCl4 on liver injury (45). In the results, the p-sites of up to five members of AGC/RSK group such as MSK1, RSK4, MSK2, p70S6K and RSK1 are significantly enriched (Table V). Because of the relatively low abundance of tyrosine phosphorylation, the number of identified pY sites from large-scale studies will be limited without specific enrichment through anti-pY antibodies (14). However, in our data set, the distribution of pY in the whole human phosphoproteome is 22.6% (Fig. 6D), with an order of magnitude higher than in liver (1.8%) (Fig. 6B). Thus, it can be expected that the p-sites of most TKs will be under-represented, because tyrosine-specific strategies were not used in our analysis. Indeed, no TKs were detected with over-represented p-sites, whereas 57 TKs (85.1%) preferentially modify fewer p-sites (Table V, supplemental Table S15). DISCUSSION In the post-genomic era, the dissection of the functional complexity and diversity of the proteome has emerged as an urgent challenge. In particular, proteins are transiently and dynamically regulated by hundreds of PTMs in vivo, which adds a dimension of functional complexity. As one of the most essential PTMs, phosphorylation has attracted considerable attention for its functional importance (1)(2)(3). Investigating phosphorylation at the systemic level can help in the under-standing of its molecular mechanisms and regulatory activity (8 -10). Rapid progresses in phosphoproteomics using phosphopeptide enrichment and HTP-MS techniques have detected tens of thousands of potential in vivo p-sites with high confidence (1,14,22). However, deeper analysis of these un-annotated p-sites to allow elucidation of ssKSRs in eukaryotes is lacking and at present is hampered by limited computational methods. In contrast with previous studies that focused exclusively on humans (8,9,17), here we designed a general and integrative approach to predict in vivo ssKSRs in five eukaryotic species. In iGPS 1.0, the GPS 2.0 algorithm was used to predict potential PKs for un-annotated p-sites (24), and both experimentally identified and pre-predicted PPI information was adopted for further filtering of false-positive hits. Extensive evaluations and comparisons suggest the prediction performance to be promisingly accurate and better than NetworKIN (8,9) (Table IV).
With this powerful tool, we systematically predicted potentially ssKSRs and modeled PPNs from eukaryotic phosphoproteomic data. The total predictive coverage is 30.4% (44,290/145,646), which is a great amount of information for experimentalists. Among the top 10 PKs with the most p-sites in the five eukaryotic phosphoproteomes, we observed up to 33 PKs (66.7%) that belong to the CMGC group (supplemental Table S12). The PKs in the CMGC group are implicated in the cell cycle/cell division (e.g. CMGC/CDK) and signal transduction (e.g. CMGC/MAPK and CMGC/GSK) pathways (from the GO annotations in the UniProt database), which is con- sistent with the major roles of phosphorylation (1)(2)(3)17). Three potential hypotheses may be offered to interpret this observation. First, the prediction might be influenced by the GPS 2.0 algorithm at the p-site level such that better performance can generate more kinase-specific p-sites (24). In GPS 2.0, the Ac, Sn, Sp, and MCC of CMGC/MAPK are 86.05%, 91.21%, 85.94%, and 0.2950, respectively, whereas the performance of AGC/GRK is 92.46% (Ac), 94.05% (Sn), 92.37% (Sp), and 0.5999 (MCC) (24). Although the accuracy of AGC/ GRK is much better than CMGC/MAPK, no GRK members are included in the top 10 PKs with the most p-sites in any organism. Also, although the Ac, Sn, Sp, and MCC of Atypical/PIKK/ATM are 94.47%, 100.00%, 94.38%, and 0.4451, respectively, the human ATM is not contained in the top 10 PKs with the most p-sites. Thus, this result is not caused by a bias from the GPS 2.0 prediction. Second, the number of PPIs might influence the prediction at the substrate level such that more PPIs lead to a greater number of predicted substrates. We counted the number of PPIs for each PK, and only 14 CMGC PKs (28%) belong to the top 10 PKs with the most PPIs in the five species (supplemental Table S16). Again, although the number of GSK3A interacting proteins in humans is only the 28th in rank (supplemental Table S16), it is one of the top 10 PKs with the most p-sites in this study (supplemental Table S12). Although the number of ATM binding proteins ranks 8 th in humans (supplemental Table S16), it is not included in the top 10 PKs with the most p-sites (supplemental Table S12). In this regard, this observation is not caused by a bias from the PPI filter. Finally, this prediction might reflect the bona fide status that most of the p-sites were phosphorylated and regulated by the CMGC group PKs. In addition, by analyzing the human liver phosphoproteome, we observed a similar result that 6 of the top 10 PKs with the most p-sites belong to CMGC PKs (supplemental Table S14). Taken together, it is proposed that CMGC PKs play a predominant role in regulating cellular phosphorylation. Although CMGC PKs play a general role for the phosphorylation, several PKs in a distinct sample might preferentially modify more or fewer p-sites to ensure the precise regulation. By the statistical analysis and comparison of predicted results of human liver and whole PPNs, we observed that a considerable number of PKs significantly regulate more or fewer p-sites in human liver PPN (supplemental Table S15). Beyond the results that are consistent with previous analyses, our study suggested that a number of PKs, such as CLK1, PKN2, and CK1d, also play a potentially important role in the human liver PPN (Table V). In 2007, Villen et al. experimentally identified thousands of p-sites from a 21-day-old mouse liver (14). By collecting 6089 p-sites in 2209 mouse liver proteins (14), we predicted 4502 ssKSRs among the 308 PKs and 543 proteins for 1176 p-sites, with a coverage rate of 19.3% (supplemental Table S17). However, we only detected 13 and 9 PKs with significantly over-or under-represented p-sites with the Yates' chi-squared ( 2 ) test (p value Ͻ 0.01, supple-mental Table S18) (41). And the statistical significance is much lower against the result in the human liver PPN (supplemental Table S15). In this regard, we proposed that our results might be more useful for further studying hepatic functions in H. sapiens.
Our approach can be generally used to identify potential in vivo ssKSRs in eukaryotes. The total predictive coverage is 30.4% (44,290/145,646) (Table III), which is a great amount of information for experimentalists. We anticipate that more efficient contextual filters will be integrated into this method over time to improve its prediction ability. This study can serve as a starting point for the general analysis of the various PTM-regulating proteomes, not limited to phosphorylation.