A Curated Resource for Phosphosite-specific Signature Analysis

Signaling pathways are orchestrated by post-translational modifications (PTMs) such as phosphorylation. However, pathway analysis of PTM data sets generated by mass spectrometry (MS)-based proteomics is typically performed at a gene-centric level because of the lack of appropriately curated PTM signature databases and bioinformatic tools that leverage PTM site-specific information. Here we present the first version of PTMsigDB, a database of modification site-specific signatures of perturbations, kinase activities and signaling pathways curated from more than 2,500 publications. We adapted the widely used single sample Gene Set Enrichment Analysis approach to utilize PTMsigDB, enabling PTM Signature Enrichment Analysis (PTM-SEA) of quantitative MS data. We used a well-characterized data set of epidermal growth factor (EGF)-perturbed cancer cells to evaluate our approach and demonstrated better representation of signaling events compared with gene-centric methods. We then applied PTM-SEA to analyze the phosphopro-teomes of cancer cells treated with cell-cycle inhibitors and detected mechanism-of-action specific signatures of cell cycle kinases. We also applied our methods to analyze the phosphoproteomes of PI3K-inhibited human breast cancer cells and detected signatures of compounds inhibiting PI3K as well as targets downstream of PI3K (AKT, MAPK/ERK)


In Brief
Pathway analysis of PTM data sets is typically performed at a gene-centric level because of the lack of appropriately curated PTM signature databases. We have developed a PTM signatures database (PTMsigDB) providing curated phosphorylation signatures of kinases, perturbations and signaling pathways to enable site-specific PTM signature enrichment analysis (PTM-SEA). Application of PTM-SEA to phosphoproteomes of several cell lines perturbed with growth factors, cell cycle inhibitors, or a specific PI3K inhibitor demonstrated the potential of our site centric approach to study dysregulated pathways in cancers.

Graphical Abstract
Identifying signaling pathways that are dysregulated in diseases such as cancer is crucial for understanding the molecular mechanisms underlying the disease and ultimately for developing better treatments. Post-translational modifications (PTMs) 1 of proteins play a key role in practically every cellular process by regulating activity, localization and interaction of proteins. Mass spectrometry (MS)-based proteomics facilitates profiling of tens of thousands of PTM sites in a single experiment, most importantly phosphorylation, acetylation and ubiquitylation. Because of tremendous improvements in mass spectrometry technologies as well as sample processing workflows, large-scale analyses of PTMs have now become feasible for entire patient cohorts. This is exemplified by recent Clinical Proteomics Tumor Analysis Consortium (CPTAC) landmark studies of breast and ovarian cancer, in which pivotal findings were derived from phosphoproteomic analyses of up to 174 patient tumor samples (1,2). For example, regulation of specific phosphorylation pathways revealed a novel breast cancer subtype that was unique to the phosphoproteomes of the tumors and could not be observed in RNA, DNA or protein space (1). Such analyses require well-annotated molecular signaling pathways and ideally would incorporate the role of specific modified sites rather than gene products alone. However, databases annotating biological pathways such as KEGG (3), Reactome (4) or MSigDB (5) are gene-centric and do not capture information for individual PTM sites. Therefore, pathway analysis of acquired phosphoproteomes typically involves collapsing discrete measurements of PTM sites on proteins into a single measurement represented by e.g. the mean or median of corresponding sites, which are mapped to their respective gene symbols. This procedure allows the resulting data set to be used for gene-centric pathway analysis, sacrificing the additional information provided by individual PTM sites (Fig. 1) and likely diluting the signal represented in the phosphoproteome data. The information loss is es- . Protein A exemplifies the presence of two protein isoforms carrying an isoform-specific phosphorylation site. B, Gene-centric pathway analysis typically involves combining fold changes of multiple sites mapping to the same gene symbol by calculating a center value of abundance (mean or median) or by choosing a single site characterized by high degree of variance (thus information) across a sample cohort. Additional information provided by multiple phosphorylation sites on a single protein as well as different sites on protein isoforms are not considered. Resulting gene-centric expression matrix can then be queried against gene-centric pathway databases such as Reactome, KEGG or MSigDB, in which each pathway is represented as a collection of gene symbols. C, Site-centric pathway analysis takes all quantified phosphorylation sites into account and requires a database of pathways annotated using individual phosphorylation sites. For that purpose, we developed PTMsigDB, annotating signatures of pathways, kinases and perturbations at the level of sites. Each site is annotated with the direction of regulation in a signature, i.e. whether its abundance is decreased or elevated, exemplified by blue and red arrows, respectively. Quantitative, site-centric phosphoproteomic data can be directly queried against PTMsigDB to identify signatures of phosphosites that correlate with annotated signature sets in PTMsigDB.
Here we present the first iteration of PTMsigDB, a collection of PTM site-specific signatures that have been assembled and curated from published data sets (Fig. 1). The current version is comprised exclusively of phosphorylation signatures but serves as a foundation for signatures of other PTMs like ubiquitination or acetylation to be added in the future. Although PTMsigDB enables kinase signature analysis similar to published tools described above, it additionally includes many curated signature sets of PTM sites mined from Phos-phoSitePlus (13), Netpath (14) and WikiPathways (15) that have been i) quantitatively characterized in specified perturbation studies; ii) are known substrates of kinases; or iii) are known to be activated/deactivated in signaling pathways. A unique feature of PTMsigDB is the annotation of each PTM site with the direction of regulation, i.e. increased or decreased abundance upon a specific perturbation or in the context of signaling through a canonical pathway. We further present an extension of the widely used single sample Gene Set Enrichment Analysis (ssGSEA) (16,17) that enables PTM Signature Enrichment Analysis (PTM-SEA) illustrated here for MS-based phosphoproteomics data sets.
We first assessed the utility of our approach using a wellcharacterized, literature-derived phosphoproteome data set of epidermal growth factor (EGF)-perturbed HeLa cells (18) and compared PTM-SEA to regular gene-centric ssGSEA. We next applied PTM-SEA to phosphoproteomes of human cell lines treated with different cell cycle inhibitors that have been profiled in the Library of Integrated Network-Based Cellular Signatures (LINCS) (19,20). Finally, to demonstrate the value of PTM-SEA in perturbation studies targeting clinically relevant cancer pathways, we applied our tools to analyze a deep phosphoproteome of a PI3Ka-inhibited breast cancer cell line.

EXPERIMENTAL PROCEDURES
Assembly of PTMsigDB-The foundation of PTMsigDB is PTM sites curated by PhosphoSitePlus (PSP). All sites present in PTMsigDB were either directly derived from, or subsequently mapped to PSP. Each site in PTMsigDB is annotated with the PSP site group id, flanking amino acid sequence (Ϯ 7 amino acids) and UniProt accession number (Table I, supplemental Table S1). Each site in our database is annotated with the reported direction of regulation upon a specific perturbation or in a molecular pathway. We note that we do not capture functional annotations of PTM sites in PTMsigDB, such as activating or inactivating effect on the modified protein.
PTMsigDB is stored in Gene Matrix Transposed (GMT) format, a simple flat file format utilized by the gene-centric molecular signature database MSigDB (16,21). Signature sets were assembled using three different modes of curation: i) entirely manual specification by domain experts (NetPath, WikiPathways, PSP kinase-substrate sig-natures); ii) semi-automatic specification (PSP perturbation signatures); and iii) fully automatic specification (LINCS perturbation signatures). Each signature is a unique collection of confidently localized PTM sites. Because sites are single amino acid residues, PTMsigDB requires single site-centric data matrices, and multiply phosphorylated peptides must be resolved into single site-centric data.
Although in this manuscript we focus on human signature sets, we have also started to derive signatures for mouse and rat (Table II,  supplemental Table S2). Below we describe in more detail how signature sets of different categories have been assembled.
Signature Extraction from PSP Perturbation Data-Perturbation signatures were derived from PTM sites manually curated from literature describing the phosphoproteomic impact of perturbations such as small molecules and growth factors. The initial database represented results from 2,483 perturbation studies. Many studies employed common perturbations (for instance, the same small molecule) with different experimental conditions (such as different cell lines, dosages or time points). To manage the bias inherent in such differential representation of perturbations we assembled "consensussignatures" that combined results consistently reported across different studies of a common perturbation. For each perturbation P we extracted all PTM sites that were increased or decreased in abundance by P. For each PTM site S belonging to the P-altered set we then required a minimum of two independent studies that reported S to be regulated by P in the same direction (Fig 2A). Requiring consistency between two independent studies balanced the competing interests of retaining large numbers of signatures and PTM sites per signature and maximizing the likely relevance of such signatures to novel perturbation data sets. Each site in this category is annotated by the Pubmed identifier (PMID) of the corresponding publication in the Description field of the GMT file.
Kinase-Substrate Signatures from PSP-The file Kinase_Substrate_ Data set contains a list of experimentally-determined kinase substrates and metadata curated from the literature by PSP and was downloaded from their website (https://www.phosphosite.org/static-Downloads) on March 21, 2018. Kinase-substrate signatures were parsed from the downloaded file using an in-house developed Rscript. Briefly, all annotated substrate sites for each kinase were extracted and a minimum of five distinct sites was required to form a kinase signature. The script was used to parse kinase signatures for human, mouse and rat.
Signatures of Molecular Signaling Pathways from NetPath-Annotation of pathways with phosphorylation was performed manually by domain experts and was based on 909 publications. All sites were first mapped to PSP and were annotated by flanking amino acid sequence and site group id. Only sites that could be mapped to PSP and that were consistently annotated with either increased or decreased levels were retained. Each site in this category is annotated by either the Pubmed identifier (PMID) or by a string concatenating first author and publication year of the corresponding manuscript. This information can be found the Description field of the GMT file.
Signatures of Molecular Signaling Pathways from WikiPathways-All human, mouse and rat specific pathways in gpml format were downloaded from (https://www.wikipathways.org/index.php/ Download_Pathways) on March 20, 2018. An in-house developed R-script was used to parse each file representing a single pathway and to extract information about annotated phosphorylation sites. A minimum of five distinct sites was required to form a signature. All signature sets were exported in GMT format. Perturbation Signatures from LINCS Database-Perturbation data derived from the P100 assay (19) were downloaded from Panora-maWeb (https://panoramaweb.org/project/LINCS/P100/begin.view) (22) on August 30, 2017. We retrieved level 4 data (L4) which were sample-normalized (columns), QC'd and plate-normalized (rows) comprising a total of 189 perturbations and 12 human cell lines. Data for the P100 assay were acquired in PRM or in DIA mode and analyzed using Skyline (23). A more detailed description of the different data processing steps, including the analysis of DIA data, are described in (24). We first normalized P100 probes to the DMSO control experiment, which was present in each 96 well plate, by subtracting the average log2 DMSO ratio followed by Z-score transformation (per sample column). To derive signatures for a specific perturbation we considered probes with Z-scores more extreme than Ϯ 2 per sample column observed in at least two samples (replicates or cell lines) measured on the same 96 well plate. Perturbations measured on multiple plates were combined into a single perturbation signature set. All metadata (plate number, cell line, dose, etc.) for each site of a signature can be found in the Description field. All sites were mapped to PSP and were annotated by flanking amino acid sequence and site group id. A minimum of five distinct sites was required to form a signature and signature sets were exported in GMT format.
Single Sample Gene Set Enrichment Analysis (ssGSEA)-Gene Set Enrichment Analysis (GSEA) (16) is a tool for determining enrichment of gene sets using genes differentially expressed between two phenotypes measured in a sufficient number of replicates. The approach to apply GSEA on a single, pre-ranked data vector representing gene abundance, especially from a single sample, has been introduced in (17) and is typically referred to as the signature projection method or single sample GSEA (ssGSEA). Briefly, expression values of genes in one sample are rank-ordered and an enrichment score for a signature S is calculated based on the difference in the Empirical Cumulative Distribution Functions (ECDFs) of genes in S and the set of all genes. This ECDF difference between signature S and remaining genes (⌬ECDF) is calculated by walking down a rank-sorted list of genes, increasing a running-sum statistic when a gene belongs to S, and decreasing the running-sum statistic if the gene does not belong to S. The enrichment score ES reflects the degree to which genes in S are overrepresented at the top or bottom of the ranked-ordered list. Two methods have been proposed to calculate ES: the original GSEA publication (16) described a weighted Kolmogorov-Smirnov-like statistic defined as maximal distance between both ECDFs: To account for different signature sizes, and correlations between gene set and the data set, a normalized enrichment score (NES) is calculated by permuting the genes in the sample using a user-defined number of permutations n (typically n ϭ 1000): Permutations are also used to calculate a p value for a given NES of a signature S: Calculated p values can be further corrected for multiple hypothesis testing. ssGSEA 2.0 and PTM-SEA Specific Modifications-We obtained the R program for ssGSEA analysis developed by the original authors and re-implemented parts of the software to optimize execution speed (support of multiple CPU cores). We further modified the handling of missing values in the data set more robust by discarding them before calculating pathway enrichment scores and implemented support for GCT (Gene Cluster Text) v1.3 files which enable storing data and metadata in a single text file. Import, export and manipulation of GCT files was implemented using the cmapR R-package (https:// github.com/cmap/cmapR). We further extended the generated output of the original implementation of ssGSEA by utilizing GCT v1.3 metadata fields. For instance, for each scored gene set information on the number of scored genes and total percentage of genes scored set are included in the results files.
We extended the regular ssGSEA algorithm to enable site-specific, directional ( Fig 2B) PTM signature enrichment analysis by incorporating directionality into the scoring scheme ( Fig 2C). Each part of a directional signature set is scored separately by first calculating enrichment scores (ES) of PTM sites annotated with increased abundance (ES u ). If the signature set contains PTM sites that were annotated with decreased abundance, a separate enrichment score (ES d ) for those sites is calculated. Both scores are combined into a single enrichment score by calculating: ES ϭ ES u Ϫ ES d . The resulting combined score reflects a measure of correlation between annotated sites in a signature and the abundance ranks of the signature in the actual data: positive values indicate positive correlation; negative values indicate anti-correlation. Normalized enrichment scores and p values are calculated on the combined score based on a user-defined number of permutations applied to a single sample column.
PTM-SEA is Available on GenePattern (25), a powerful platform to deploy and run software and entire analysis pipelines in a web browser, and can be accessed using the following URL: https:// tinyurl.com/PTM-SEA-GP. Source code, documentation and instructions to run ssGSEA2.0/PTM-SEA on a local computer and on GenePattern are available on GitHub and can be accessed using the following URL: https://github.com/broadinstitute/ssGSEA2.0.
Preprocessing of Data Sets for Site-specific PTM-SEA-The input to the PTM-SEA R program is a single site-centric data matrix, m, stored in GCT format and a signature set database in GMT (Gene Matrix Transposed) format. Each row in m represents a single phosphorylation site confidently localized to a specific amino acid residue, with measured abundances across samples specified in columns in m. Multiple phosphorylation sites detected on the same peptide must be converted into separate site-specific entities for every site. Although some proteomics software packages, such as MaxQuant (26), readily produce single site-centric PTM reports, the use of other software packages might require additional preprocessing steps. To resolve the abundance of phosposites that have been measured on different phospho-proteoform peptides we used the abundance measured on the monophosphorylated version. If no monophosphorylated peptide version has been observed, we reported the abundance of the peptide carrying the least number of phosphate groups. Because we cannot disambiguate the abundance of multiple phosphosites measured on the same peptide in general, each site was assigned the measured abundance of the corresponding peptide and reported as separate rows in m. Lastly, if a site was measured on two or more equivalent peptides (e.g. one site exclusively detected on two doubly phosphorylated peptides), we reported the average abundance of corresponding peptides. The general PTM-SEA workflow, including necessary pre-processing steps and description of data file formats is depicted in supplemental Fig. S1.
Parameters for PTM-SEA and ssGSEA2.0 -Abundance levels of proteins/genes or PTM sites were first rank-normalized (sample. norm.type ϭ "rank"). Rank-normalized profiles were further Z-scored (correl.type ϭ "z.score") and weighted (weight ϭ 0.75). To obtain signature enrichment scores we calculated the difference of the ECDFs of the sites part of the signature and of all sites (statistic ϭ "area.under.RES") as described in (17). Normalized enrichment scores (NES) and p values were calculated using 10,000 permutations (nperm ϭ 10000). We note that 1,000 permutations were typically enough to obtain stable p values. Although p values were calculated on a sample-by-sample basis, we further corrected permutationderived p values for multiple hypothesis testing using the method proposed by Benjamini and Hochberg (27) using the entire data set (global.fdr ϭ 'TRUE'). Table S2 from the manuscript by Sharma et al. (18). The authors used the MaxQuant software suite (26,28) to obtain label-free intensity estimates of phosphosite abundances. We converted absolute intensities into relative ratios by dividing each phosphosite by its average intensity measured in six replicates of DMSO-treated control samples and subsequently applied log2 transformation. We performed the conversion to ratio data to intrinsically compare the effect of EGF and nocodazole treatment to the DMSO control experiment. We note that this conversion was not required to apply PTM-SEA, as PTM-SEA can also be directly applied to non-ratio data. All phosphosites with localization probability Ͼ 0.75 and a DMSO ratio observed in at least one sample were considered in further analyses (n ϭ 27,307).

Ablelin et al. Drug Perturbation Data Set-
The results of the discovery data described in (19) were downloaded from MassIVE using accession number MSV000079524. We downloaded the datafile Supp_DataDet1_P100_Discovery.gct and re-formatted the GCT id column to match the UniProt-centric format of PTMsigDB, as described above (supplemental Fig. S1). The resulting GCT file could be readily used for PTM-SEA using parameters described above. Because this discovery data set was used to derive the targeted P100 probes (from which we curated LINCS signature sets) we removed all P100 signatures from PTMsigDB prior to analysis to avoid a potential contamination of training and test data sets.
Cluster Analysis-All cluster analyses were based on hierarchical clustering using average linkage and Euclidean distance metric. Heat-maps were generated in R using the pretty heatmap package (pheatmap) or in Morpheus (https://software.broadinstitute.org/morpheus).

CCLE PI3K Perturbation Experiment-
Cell Culture and Treatment-T47D cells were cultured in RPMI medium with 10% FBS and 1% Penicillin/streptomycin. Cells were plated in 15 cm plates at a final density of 3.25e6 and 2e6, respectively, in a total volume of 29 ml, and incubated at 37°C under standard culture conditions. Forty-five hours (approximately) after plating, 1 ml of BYL719 diluted in RPMIϩFBSϩP/S medium was added to the treated plates to achieve a final concentration of 1uM, with 1 ml of DMSO-dilution in RPMIϩFBSϩP/S medium added to the control-plates. Plates were then placed back in the 37°C incubator. Plates were o harvested at 6-h and 24-h time points. For each time point, BYL719-treated and DMSO-treated plates were removed from the incubator in batches of 2, placed on ice, and washed once with ice-cold PBS. Cells were then scraped off the plate in fresh ice-cold PBS and placed in 15 ml conical tubes on ice. The process was repeated in batches of 2 until all plates for that specific time point and treatment were collected into a single 15 ml conical tube. Cells were then spun at 1200 RPM in a 4°C centrifuge for 10 min. PBS was removed by aspiration, and the cell pellet was resuspended in ice cold PBS for a third wash. The cell suspension was then distributed, proportionally, into individual Eppendorf tubes intended for proteomics profiling, RNAseq and immunoblotting analysis. Eppendorfs were spun at 3000rpm in a 4°C table top centrifuge, the PBS supernatant was manually removed, and cell pellets were snap frozen in a 70% ethanol/dry ice bath. Frozen pellets were stored at Ϫ80C°until processing for proteomics experiments. Time course treatments were performed in technical triplicates, each performed on different days.
Lysis and Digestion-Cell pellets were lysed in 50 mM Tris-HCl buffer (pH 8.0) containing 8 M Urea, 75 mM NaCl, 1 mM EDTA, protease and phosphatase inhibitors as previously described (29). Protein concentration was determined using a bicinchoninic acid (BCA) protein assay (Pierce, Thermo Fisher Scientific, Bremen, Germany). Following reduction with 5 mM dythiothreitol (DTT) for 30min at room temperature (RT) and alkylation with 10 mM iodoacetamide for 45min at RT in the dark, urea concentration was reduced to 2 M and samples were digested with Endoproteinase LysC (Wako Chemicals, Richmond, VA) at an enzyme-to-substrate ratio of 1:50 for 2 h at 30°C followed by sequencing grade trypsin (Promega, Madison, WI) at an enzyme-to-substrate ratio of 1:50 overnight at 30°C. Reactions were quenched the next day with formic acid added to a final concentration of 1%. The samples were desalted using tC18 SepPak 200 mg cartridges (Waters, Milford, MA). Briefly, the cartridges were conditioned with 3 ml of Acetonitrile (ACN) followed by 3 ml of 50% ACN containing 0.1% formic acid (FA). After equilibrating cartridges with 3 ϫ 4 ml of 0.1% trifluoroacetic acid (TFA)/water, samples were loaded and washed with 3 ϫ 3 ml of 0.1% TFA/water followed by 1 ϫ 3 ml of 1%FA/water. Samples were eluted with 2 ϫ 1.5 ml 0.1% FA/50% ACN, frozen at Ϫ80°C and dried down by vacuum centrifugation.
Labeling-For the quantitative phosphoproteomics experiment, 10 samples were labeled with TMT 10-plex reagent (Lot Number: PE201534A) (Thermo Fisher Scientific) following the manufacturer's protocol. Briefly, 1 mg of each sample was reconstituted in 1 ml of HEPES buffer at pH 8.5. Eight milligram of each of the TMT10 reagents was reconstituted in 410l of ACN and added to the corresponding sample (Table III). After incubation at RT with shaking for 1 h the reactions were quenched by adding 80l of 5% hydroxylamine to each and incubating at RT with shaking for 15min. The samples were combined and dried down by vacuum centrifugation. The combined sample was reconstituted in 0.1% FA/3% ACN and desalted using tC18 SepPack 500 mg cartridge as described above. Eluate was dried down completely.
Phosphoptyrosine (pY) Antibody Enrichment-The sample was reconstituted in 3 ml of immunoaffinity purification (IAP) buffer consisting of 50 mM MOPS at pH 7.2, 10 mM sodium phosphate, and 50 mM NaCl. One aliquot of pY1000 antibody beads (Cell Signaling Technologies, Danvers, MA) was washed 3 times with 1.5 ml IAP buffer. Washed antibody beads were added to the sample and incubated at 4°C with end-over-end rotation for 1 h. After spinning down the antibody beads, the flow-through was collected to be desalted for the deep phospho-profiling. The beads were further washed 4x in 1.5 ml of ice-cold PBS, and pY-containing peptides eluted with 2 ϫ 50l of 0.15% TFA, desalted using C18 stage tips as described before (29) and dried down.
Flow-through was desalted using tC18 SepPack 500 mg cartridge as described above and eluate was dried down.
Basic Reversed Phase (bRP) Fractionation of Peptides-Desalted flow-through sample was fractionated on Agilent 1100 HPLC system using Zorbax 300 Extend-C18 4.6 mm ID x 250 mm length column with buffers A and B of 5 mM ammonium formate pH 10.0, in 2 and 90% ACN, respectively. Sample was reconstituted in 900l of buffer A and 2 injections of 425 l were performed. Sample was fractionated at a flow rate of 1 ml/min with the following gradient: 0 -16% buffer B for 6min, 16 -40% buffer B for 60min, 40 -44% buffer B for 4min, 44 -60% buffer B for 5min and isocratic hold at 60% B for 14min. One-minute fractions were collected throughout the entire gradient for both injections into the same 96-deep well plate (GE Healthcare, Chicago, IL). Collected fractions were pooled into 24 fractions using a concatenation strategy as described before (29). Fractions were acidified with FA for a final concentration of 0.1%. 150 l (5%) from each of the fractions was transferred into an autosampler vial and dried down for LC-MS/MS analysis of the proteome. The remaining 95% of the 24 fractions was dried down for detection and quantification of phosphopeptides.
Immobilized Metal Affinity Chromatography (IMAC) of the Fractions-Ni-NTA Superflow Agarose beads (Qiagen, Venlo, The Netherlands) were used for IMAC enrichment of the peptides as described previously (29). Briefly, an appropriate amount of bead slurry for 24 fractions was transferred to an Eppendorf tube and washed with HPLC water. Beads were stripped of Ni by incubating with 100 mM EDTA for 30 min at RT with end-over-end rotation and iron was added by incubating with 10 mM iron chloride (FeCl 3 ) for 30min at RT with end-over-end rotation. Beads were washed 3ϫ in 1 ml HPLC water and resuspended in an appropriate volume of 1:1:1 (v/v/vol) ratio ACN/Methanol/0.01% Acetic acid for aliquoting of 30 l of slurry (corresponding to 10l of beads) for each fraction. Dried-down fractions were reconstituted in 1 ml of binding buffer (0.1% TFA/80% ACN) and added to the beads. Samples were incubated for 30 min at RT with gentle shaking. After spinning down the beads, the supernatant consisting of unbound peptides was removed and beads reconstituted with phosphopeptides in 200 l of binding buffer. Beads were transferred to stage tips for elution and desalting of the phosphopeptides as previously described (29). Eluted phosphopeptides were dried down completely.
LC-MS/MS Analysis of pY and IMAC Samples-pY and IMAC samples were resuspended in 9 l of 0.1% FA/3% ACN and 4 l of each was analyzed on a Q Exactive Plus mass spectrometer (Thermo Fisher Scientific) coupled to Easy-nLC 1000 liquid chromatography system (Thermo Fisher Scientific) with buffers A and B of 0.1% FA in 3 and 90% ACN, respectively. Peptides were separated at a flow rate of 200 nl/min on 75 m ID picofrit columns (New Objective, Woburn, MA) packed in-house to 22-25 cm length with Reprosil-Pur C18-AQ 1.9 m beads (Dr. Maisch GmbH, Ammerbuch-Entringen, Germany) and heated to 50°C. IMAC samples were separated with a solvent B ramp of 2 -6% in 1min followed by a linear gradient of 6 -30% in 84min for a total run time of 110 min. Eluted peptides were analyzed in data-dependent mode acquiring MS1 scan at 70,000 resolution with automatic gain control (AGC) set to 3e6 followed by MS/MS on the 12 most abundant ions with resolution of 35,000, AGC of 5e4 with maximum inject time of 100ms, isolation width of 1.6Da with 0.3Da offset, normalized collision energy (NCE) set to 31(optimized for the specific QE Plus), peptide match function set to preferred, isotope exclusion function enabled, and dynamic exclusion set to 30 s. pY samples were separated in 2 ϫ 4 l injections of the sample with a 154min LC-MS/MS method where linear gradient of 6 -35% solvent B was performed in 120min. MS parameters were as described above other than the maximum inject time for MS/MS being set to 120ms and dynamic exclusion set to 5 s.
CCLE Perturbation Data Analysis-Raw mass spectrometry data was analyzed using Spectrum Mill Workbench v5.1 (pre-release) (Agilent Technologies, Santa Clara, CA) as described before (29). Briefly, extracted MS/MS spectra were searched against human UniProt protein sequence database (downloaded on 12/28/2017) complemented with 264 commonly observed lab contaminants (65,068 database entries). Enzyme definition was defined as "trypsin allow P" for full tryptic specificity and up to five missed cleavages were allowed. Carbamidomethylation of cysteines was set as a fixed modification together with TMT10 isobaric labels at lysine residues (N termini would be considered regardless if it was TMT-labeled). Acetylation of protein N termini, oxidation of methionine and phosphorylation of serine, threonine and tyrosine were set as variable modifications with a precursor MHϩ shift range of 0 to 272 Da. Database spectral matching was performed by Spectrum Mill search engine using Ϯ20 ppm mass tolerance for precursor ions and product ions. Peptidespectrum matches (PSMs) were automatically designated as confidently assigned using the Spectrum Mill autovalidation module employing a target-decoy search strategy to control false discovery rate (FDR). Individual PSMs were rolled up to phosphorylation sites that were separately FDR-controlled resulting in an estimated FDR Ͻ1.3% at the level of individual phosphorylation sites. Phosphorylation sites that could be unambiguously assigned to specific serine, threonine and tyrosine residues as automatically determined by Spectrum Mill software (1) were considered for further analyses (n ϭ 24,035). TMT reporter ion intensities were corrected for isotope impurities using a previously described strategy (30) integrated into Spectrum Mill. TMT reporter ion ratios for BYL719/DMSO treatments of the first two replicates were calculated by matched replicates (e.g. BYL719 R1 / DMSO R1 ) ( Table III). The third replicate did not have a paired control experiment and the average reporter ion intensity of both DMSO replicates was used as denominator. TMT reporter ion ratios were log2 transformed and median-centered for subsequent analysis.
Experimental Design and Statistical Rationale-The PI3K perturbation experiment comprised a single cell line (T47D, n ϭ 1), analyzed at 6h and 24h time points, of inhibitor (BYL719) and paired control   TABLE III  TMT-10 reporter ion channel layout for PI3Ka inhibitor experiment   Channel  126  127N  127C  128N  128C  129N  129C  130N  130C  131   Pert  DMSO  DMSO  DMSO  DMSO  BYL719  BYL719  BYL719  BYL719  BYL719  BYL719  Time  6h  24h  6h  24h  6h  24h  6h  24h  6h  24h  Replicate  R1  R1  R2  R2  R1  R1  R2  R2  R3  R3 (DMSO) treatments. The BYL719 treatments were measured in three biological replicates, whereas the control experiments were measured in two biological replicates to accommodate the entire experiment in a single TMT-10 plex (Table III). BYL719/DMSO ratios were calculated and median-normalized as described above. To combine the triplicate measurements in each time point into a single readout as input for PTM-SEA, we employed a moderated one-sample t test using the limma R-package. This appropriately accounted for the variance observed across replicates captured in a p value. The resulting vector of p values was log-transformed and multiplied by the sign of the average log2 reporter ion ratio: s ʦ ͕site͖:p s ϭ Ϫ 10 log 10 ͑ p value͒ sign͑log͑ fold change͒͒ The transformed p values were used as input for PTM-SEA. We did not rank-normalize the transformed p values (sample.norm. type ϭ 'none') and set the weight parameter to 1 (weight ϭ 1) to incorporate the magnitude of the transformed phosphosite p values into the calculation of enrichment scores and p values for each signature.

RESULTS
Signature Sets in PTMsigDB-We developed PTMsigDB, a curated phosphosite-centric resource of molecular signatures of perturbations, kinase activity and molecular pathways. The foundation of PTMsigDB is PhosphoSitePlus (PSP) (13), a comprehensive systems biology resource for PTMs, which provides high-quality curation and annotation of PTMs at the individual residue level. We define a collection of PTM sites, whose levels are collectively regulated in a curated pathway or upon a perturbation, as a signature set. To derive signature sets, and to build up PTMsigDB, we mined PhosphoSitePlus (PSP), NetPath (NP), WikiPathways (WP) and the LINCS databases. Depending on the source of a signature, the curation process was either entirely manual (WP, NP), semi-automated (PSP kinase signatures, PSP perturbation signatures) or fully automated (LINCS). Manually curated signature sets representing molecular pathways were assembled by human domain experts. For signatures sets derived in a semi-automated or fully automated manner, we required that consistent results (site and direction of change) be reported by more than one independent study (see Methods for details) (Fig 2A,  supplemental Fig. S2A).
Signature sets in PTMsigDB can be separated into three categories: 1) Perturbation signatures derived from treatment of cells with perturbations such as small molecules or growth factors (PSP, LINCS); 2) Signature sets of molecular signaling pathways (NP, WP); and 3) Kinase-substrate signatures (PSP). The different categories contain different numbers of signature sets, associated proteins and PTM sites (Table IV, supplemental Fig. S2B). In total, PTMsigDB contains 491 signature sets comprising 8029 phosphorylation sites on 2650 proteins. To assess the proportion of sites of known functional relevance, we compared sites in PTMsigDB to regulatory PTM sites in PSP, a set of functionally annotated sites that regulate downstream cellular processes, molecular functions and protein-protein interactions that we refer to as a "gold standard" of annotated PTM sites. Of 6,685 human phosphosites in the PSP regulatory site collection, 70% were present in signatures of PTMsigDB, emphasizing the functional relevance of most phosphorylation sites in PTMsigDB (supplemental Fig.  S2C).
Kinases are key players in molecular signaling and a single kinase can play an essential role in multiple different pathways. Depending on the essentiality of a kinase, its substrate site can be assigned to several molecular pathways or perturbation signatures in PTMsigDB. To assess the degree of essentiality of phosphosites in PTMsigDB, we determined for each site the number of signatures they contribute to. Whereas the most redundant, and hence essential, sites occur in about 20% of all signatures, the median redundancy across all phosphosites was 2.5 signatures. Interestingly, six of the 10 most redundant PTM-sites in our database were T-x-Y activation loops of several mitogen activated protein kinases (MAPKs) underlining their essential role in a large number of molecular signaling pathways (Table V). Although this observation is consistent with gene-centric databases (MAPK3 is the most redundant gene symbol in MSigDB 5.1), in PTMsigDB we capture the increased phosphorylation levels of the activation loop (MAPK3 Thr-202 and Tyr-204), and so provide a direct readout of enzyme activation rather than merely elevated gene expression levels of MAPK3 as represented in MSigDB.
Tyrosine phosphorylation of proteins presents the key mechanism to propagate extracellular signals, mediated by growth factor ligands, into the cell via receptor tyrosine kinases (RTKs), a class of transmembrane proteins which induce intracellular signal cascades and ultimately initiate cell proliferation (31). Compared with its low frequency in the human proteome (2.6%), tyrosine sites were overrepresented in PTMsigDB (20.4%) (supplemental Fig. S2D), once again emphasizing the functional relevance of phosphorylation sites in PTMsigDB.
A unique advantage of PTMsigDB over other pathway databases is the annotation of each PTM site with its reported direction of change upon a specific perturbation or signaling event. For instance, stimulation with growth factors results in an activation of specific signaling cascades; thus, PTM sites involved in these cascades are typically up-regulated, whereas inhibitory perturbations such as kinase inhibitors result in down-regulation of affected PTM sites. Although most signature sets in PTMsigDB were unidirectional (i.e. solely consisted of either up-or downregulated sites (Fig 2B), about 30% of signature sets were bidirectional and contained sites with both increased and decreased levels. However, most bidirectional signature sets were dominated by sites demonstrating up-regulation. For instance, requiring an arbitrarily chosen minimum of five PTM sites per category (increased/decreased abundance) resulted in 28 signatures that were distinctly bidirectional ( Fig 2B). With ongoing collection of high-quality quantitative MS-based PTM data and associated curation efforts, we expect to see more site-specific phosphorylation signatures of a bidirectional nature. To account for the additional information provided by annotated direction of abundance changes, we adapted the scoring scheme of the widely used single sample Gene Set Enrichment Analysis (ssGSEA) algorithm (17) (see Methods). The resulting signature score reflects the degree of correlation between signature and data set. Positive signature scores indicate positive correlation, scores around zero indicate no correlation and negative scores indicate anti-correlation between signature and data (Fig 2C). To reduce experimental noise we require that each site be consistently reported by at least two independent studies (consensus signatures). Experimental noise can be introduced by use of different cell systems (e.g. in vivo versus in vitro), different protocols (e.g. dosage of perturbation) or different technologies (e.g. low-throughput versus high throughput). B, Bar chart depicting the distribution of unidirectional (all sites of a signature are annotated with decreased OR increased abundance levels) and bidirectional (sites of a signature include both decreased AND increased abundance levels) signature sets. C, Modified scoring scheme of the single sample Gene Set Enrichment Analysis (ssGSEA) algorithm enables scoring of directional signature sets. For each direction a running-sum statistic (y axis) is calculated by walking down the ranked list of PTM sites in the data set (x axis). An enrichment score (ES) reflected by the area under the resulting curve is calculated for each direction (ES u for "up" sites and ES d for "down" sites) separately and combined by subtracting ES d from ES u . D, Heatmap depicting the overlap between phosphorylation sites commonly affected by kinase-inhibitors. The similarity matrix is based on the number of shared phosphosites between signatures. The upper and lower triangular matrices are normalized by the total number of sites of signatures listed in rows and columns, respectively. PTMsigDB includes phosphosite signatures of 45 kinase inhibitors (KI), a class of small molecules that revolutionized cancer therapies by enabling targeted inhibition of pathways dysregulated in specific cancers. Whereas most KI signature sets were assembled automatically from the LINCS database (see Methods), fifteen of these KI signatures sets were assembled semi-automatically from a set of manually curated phosphorylation sites derived from PSP. These signature sets comprised inhibitors targeting ABL, EGFR, mTOR, cMET, MEK, PI3K, ROCK and SRC kinases and were assembled from 131 phosphorylation sites with only moderate overlap (Fig 2D). To group KIs according to their phosphosite signatures we constructed a distance matrix based on the number of phosphosites shared between KI signatures which we subsequently subjected to hierarchical clustering. Interestingly, KIs targeting PI3K (LY294002, wortmannin) and MEK (PD98059, U0126) clearly cluster together whereas signatures of KIs targeting other kinases such as EGFR and mTOR were not characterized by high overlap in phosphosites pointing to different downstream effects introduced upon kinase inhibition. Although the vast majority of phosphosites from KI signatures were annotated with decreased abundance levels upon KI treatment, 17 out of 131 sites were annotated with increased abundance levels including Y861 on PTK2 (erlotinib) andY977 on PLCG1 (imatinib). A list of all KI signature sets, including fully-automatically derived signatures, can be found in supplemental Table S3.
Modified Amino Acid Residues in PTMsigDB Are Robustly Represented Across Databases and Organisms-Gene-centric pathway databases such as KEGG, Reactome and MSigDB use gene symbols as standardized primary identifiers (32). However, there is no standardized nomenclature for protein accession numbers. Protein database resources such as UniProt (33), NCBI RefSeq (34) or Ensemble (35) have distinct accession number schemes, which makes it difficult to compare proteins across databases. Moreover, underlying gene models of a given protein, such as translation start sites (TSS) or exon-intron boundaries, can vary across databases and between different database releases. This directly affects the representation of modified protein residues in terms of protein accession and residue number. Approaches to circumvent this issue include the use of amino acids flanking the modified residue as a unique identifier. However, these flanking residues can vary between isoforms or can be altered by non-synonymous mutations when searching customized sequence databases (36). PSP has developed a robust classification scheme providing unique identifiers for modification sites (site group ID) combining homologous and orthologous sites within proteoforms and between species (13). Every site in PTMsigDB was either directly derived from, or first mapped to PSP, thereby ensuring an unambiguous identifier across proteoforms and species for every PTM site. To ensure a high degree of compatibility to phosphorylation data sets generated by different software packages and searched against different protein sequence databases, PTMsigDB represents signatures using three different identifiers for phosphorylation sites: i) PSP site group ID; ii) UniProt; iii) Flanking sequence (Table I). Although the PSP site group ID provides an unambiguous representation of PTM sites within protein families and across species (13), using this type of identifier restricts the analysis to PTM sites present in PSP. We generally recommend using the flanking sequence as site identifier, because these are more invariant to updates made to protein sequence databases. The database format was kept generic and can be readily used to incorporate signatures of other PTMs beyond phosphorylation.
PTM-SEA Captured Underlying Signaling Cascades in EGFand Nocodazole-treated HeLa Cells-To assess our approach in the context of known biology we applied PTM-SEA to a recent high quality study of the well-characterized phosphoproteome of HeLa cells (18). In this study the authors evaluated the phosphoproteome dynamics of HeLa S3 cells stim- ulated with epidermal growth factor (EGF) and arrested in mitosis by nocodazole treatment. Because signatures of both EGF and nocodazole perturbation are part of PTMsigDB, this study provided a useful benchmark data set to assess the utility of our approach. The authors report Ͼ50K phosphorylation sites, of which Ͼ36K sites could be confidently localized to a specific residue and so were used in subsequent analysis. We first converted the label-free intensity values into ratios to the control experiment (DMSO) and subsequently applied PTM-SEA to project phosphosites to signature sets. PTM-SEA showed strongest enrichment of the EGF and nocodazole signatures in PTMsigDB in replicate samples of corresponding treatments (FDR Ͻ 1%) (Fig 3A). In addition, we detected cell cycle-related signatures in the nocodazole treated samples, which was used to mitotically arrest cells. For instance, signatures of cell cycle kinases CDK1, CDK2 and Aurora B were enriched. Among depleted signatures were kinases involved in DNA damage response (ATR, ATM) and the EGFR pathway. Samples treated with EGF showed enrichment of signatures of other growth factors (fibroblast growth factor, FGF1; hepatocyte growth factor, HGF), signatures of different cytokines (tumor necrosis factor, TNF; erythropoietin; EPO, and interleukin IL1B, IL2 IL33 pathway, IL11 pathway), signatures of signaling pathways (e.g. MAPK signaling pathway; thymic stromal lymphopoietin pathway, TSLP; AGE/RAGE pathway) and kinases acting downstream of these pathways (ERK2, p90RSK). No consistent enrichment or depletion of signatures was observed in the DMSO-treated control samples. Focusing solely on kinase enrichment scores, we compared PTM-SEA to KSEA (KSEA App) (7,9) which demonstrated good agreement between both scoring approaches (supplemental Fig. S3). Taken together, these results demonstrated that PTM-SEA captured underlying signaling cascades stimulated by EGF or nocodazole treatment, beyond kinase activity signatures.
PTM-SEA Increased Enrichment of Expected Signaling Signatures-We then used the same data set to compare the performance of PTM-SEA's site-centric approach (in which individual sites were considered) to a gene-centric pathway analysis (in which multiple sites on the same gene were combined), and assessed how well each approach captured the underlying signaling events upon EGF and nocodazole treatment. To enable a direct comparison of both approaches we generated a gene-centric version of PTMsigDB in which signatures were represented as non-redundant sets of gene symbols without annotated regulatory direction, mimicking conventional gene-centric pathway databases like MSigDB. Additionally, in the HeLa data set described above we collapsed abundance of multiple sites mapping to the same gene symbol to their average DMSO normalized abundance. The resulting gene-centric expression matrix was projected to gene set enrichment scores by application of ssGSEA using the same parameters as described above. We used hierarchical clustering (supplemental Fig. S4A, S4B) and subsequent silhouette analysis to assess how well site-centric phospho signatures and gene-centric pathways captured signaling events introduced upon stimulation with EGF and cell cycle arrest by nocodazole. Given a distance matrix D and number of clusters K, silhouette coefficients represent a measure of intra-cluster similarity and inter-cluster distance for each clustered item i, and range between Ϫ1 and 1. A value close to 1 indicates similarity of item i with all other items in the same cluster and low similarity to items assigned to neighboring clusters, whereas negative values indicate closer similarity of item i to items in neighboring clusters. Silhouette scores were calculated for each sample (replicate measurements of EGF, nocodazole and DMSO treatments) using K ϭ 3 as the cluster number to match the number of different treatments. Importantly, silhouette scores from site-centric clustering were consistently higher than those from gene-centric analysis (Fig 3B). In addition, hierarchical clustering of gene-centric enrichment scores into three cluster did not consistently group samples of the same treatments together (Fig 3B, supplemental Fig.  S4B), whereas the site-centric analysis clearly separated different treatments and clustered together samples of EGF, nocodazole and control treatment (supplemental Fig. S4A). Although enrichment scores of the nocodazole signature were specifically enriched in the nocodazole-treated samples for both approaches, the extent of enrichment was more significant in the site-centric approach (FDR ϭ 0.001) compared with the gene-centric approach (FDR ϭ 0.08) (Fig 3C). Similar results were observed for the EGF signature (supplemental Fig. S4C, S4D) demonstrating better representation of the signature using individual phospho sites rather than collapsed gene-centric measures.
Site-specific Approach Outperforms Gene-specific Scoring-We next wanted to assess whether scoring each site-localized modified amino acid residue in a protein provides any benefit over simply scoring multiple phosphorylation events on single proteins (non-site-specific). The latter approach would enable a pseudo site-centric analysis of PTMdata sets using gene-centric databases and ssGSEA for which a substantial number of well-curated, gene-centric pathways provided by MSigDB, KEGG, Biocarta or Reactome are available. Instead of matching exact residue positions between data and database, gene symbols of multiply phosphorylated proteins were represented multiple times in calculation of enrichment scores corresponding to the number of observed phosphorylation events. The resulting bias toward higher weights of longer proteins that carry more phosphorylation sites was intrinsically addressed by calculation of normalized enrichment scores (NES). Each phosphosite was represented by the gene symbol mapping to the corresponding protein, which enabled the use of gene-centric pathway databases. Because of the resulting redundancy in repeated gene symbols we refer to this approach as gene-centricredundant. To enable the scoring of redundant lists of gene symbols, we extended the ssGSEA algorithm accordingly and applied gene-centric-redundant ssGSEA to the EGF/nocodazole treated HeLa data set described above. To enable a direct comparison to the site-centric and gene-centric approaches, we used the gene-centric version of PTMsigDB and performed the same silhouette analysis based on hierarchical clustering as described above (supplemental Fig. S5). Like the gene-centric analysis, hierarchical clustering of gene-centricredundant enrichment scores did not clearly separate the different treatments and resulted in a mixed cluster containing DMSO-and nocodazole-treated samples (supplemental Fig.  S5A). As a result, silhouette scores were consistently lower compared with site-centric analysis (supplemental Fig. S5B) even for the correctly assigned EGF cluster pointing to a less stable clustering.
Focusing on nocodazole and EGF signatures alone, which were used as positive controls (see previous section), the gene-centric-redundant approach showed better enrichment of the EGF signature (FDR: 0.001) (supplemental Fig. S5C) and nocodazole signature (FDR: 0.001) (supplemental Fig.  S5D) compared with the gene-centric EGF (FDR: 0.013) ( Fig  3C) and nocodazole signature (FDR: 0.083) (supplemental Fig.  S4D). Like in the site-centric analysis, the gene-centric-redundant approach detected an enrichment of the EGF and nocodazole signatures at an FDR of 0.001 in the corresponding treatments, therefore achieving comparable performance.
In summary, results obtained by the gene-centric-redundant approach outperformed the gene-centric approach but did not result in a stable clustering of samples according to treatments compared with site-centric enrichment analysis. We therefore conclude that scoring the exact residue number to derive signature enrichment scores best represented the underlying signaling events introduced by EGF and nocodazole stimulation. Because availability of site-specific signature sets in PTMsigDB is still limited compared with gene centric database such as MSigDB we note that the gene-centricredundant approach might be employed to score genecentric signatures, in particular for modifications other than phosphorylation. A systematic assessment of this approach however is beyond the scope of this manuscript.
A table of all scored signatures derived from the site-centric, gene-centric and gene-centric-redundant approach can be found in supplemental Table S4.
PTM-SEA Facilitates Detection of Differential Kinase Activity Signatures of Cell Cycle Inhibitors-We applied PTM-SEA to perturbed phosphoproteomes of three human cancer cell lines representing breast cancer (MCF7), prostate cancer (PC3) and leukemia (HL60). Cell lines were treated with 23 different chemically active compounds as part of the Library of Integrated Network-Based Cellular Signatures (LINCS). These data were collected to derive a reduced representation of phosphorylation signatures (19) and were therefore acquired in single-shot LC-MS/MS mode without sample fractionation, resulting in relatively shallow phosphoproteome coverage (1.3K p-sites per LC-MS/MS run). Despite the resulting bias toward detection of highly abundant phosphorylated peptides, several signatures present in PTMsigDB were readily observed, demonstrating the feasibility of our approach in data sets comprising ϳ1K phosphosites (Fig. 4, supplemental Table S5). To avoid potential contamination of the test data set (single shot data used to design the P100 assay) with training data (PRM/DIA data from application of the P100 assay), signatures derived from the P100 assay were removed from PTMsigDB prior to this analysis. The strongest signatures consistently observed across all three cell lines were detected in perturbation experiments involving cell cycle inhibitors and were predominantly comprised of cyclin kinases (CDK1, CDK2). The large number of substrate sites known to be phosphorylated by CDK1 (n ϭ 487) and CDK2 (n ϭ 341) facilitated the scoring of these signatures in relatively shallow phosphoproteome data sets. Kinase activity of CDKs is largely driven by binding to small regulatory proteins called cyclins to form a CDK-cyclin complex representing the active kinase. Different CDKs bind to different cyclins whose concentration varies across the cell cycle. Depending on the cell cycle phase in which the inhibitor is active, PTM-SEA detected distinct signatures of CDK activity that clearly correlated with concentration of Cyclin A and B (Fig. 4). For instance, treatment with Irinotecan, which inhibits DNA replication specifically during S phase when the concentration of Cyclin A is low, led to negative enrichment of the CDK1 and CDK2 signatures, indicating inhibition. Positive enrichment of the CDK1 signature was detected after treatment with Paclitaxel, which disrupts microtubules resulting in a cell cycle arrest in prometaphase during mitosis. During this phase the concentration of cyclin B has already increased, thereby, leading to high activity of CDK1. We also observed increased activity of CDK2, which regulates events in prophase right before prometaphase. Other inhibitors resulted in decreased CDK activity either by inhibiting CDK2 directly (GW8510) or by inhibiting DNA synthesis (anisomycin). These results demonstrated the ability of our approach to detect and quantify signatures of kinase activities in perturbed cell systems that clearly correlated with the mode of action (MoA) of the respective compounds.
Signatures of PI3Ka Inhibition in Human Breast Cancer Cells Readily Detected by PTM-SEA-As part of the cancer cell line encyclopedia (CCLE) project (37), which aims to provide genetic and pharmacological characterization of Ͼ1K human cancer cell lines, our laboratory has started to generate deep phosphoproteomes of selected cell lines treated with clinically relevant kinase inhibitors targeting dysregulated pathways in the respective cancers. Here we applied PTM-SEA to the PI3K-inhibited (PI3Ki) phosphoproteome of T47D human breast cancer cells to characterize signaling events effected upon kinase inhibition. To that end, T47D cells were treated with the PI3Ka inhibitor BYL719 for 6 h and 24 h, using DMSO as control treatment in both time points. Experiments were performed in triplicate employing chemical peptide labeling (TMT-10) for sample multiplexing. Off-line basic pH reversed phase fractionation of phospho-enriched peptides prior to analysis by liquid chromatography coupled to high performance tandem mass spectrometry (1,38) resulted in the identification of 38,587 phosphorylation events in a single TMT-10 plex. 23,937 phosphorylation sites could be unambiguously assigned to a specific amino acid residue and were used in subsequent analyses (supplemental Table S6). Notably, phosphosite ratios normalized for measured protein expression levels showed high correlation to unnormalized phosphosite ratios demonstrating negligible effects of protein expression changes on phosphosite abundances (supplemental Fig. S6).
We applied PTM-SEA to phosphosite ratios (BYL719/ DMSO treatment) to calculate enrichment scores for signature sets in PTMsigDB. Hierarchical clustering of signature scores calculated in individual experiments closely grouped triplicate measurements of both time point treatments together demonstrating high reproducibility of signature scores within replicate measurements of each time point (supplemental Fig.  S7). To obtain a single enrichment score for each time point while appropriately accounting for biological variance across replicate measurements, we applied a one sample moderated t test to replicates of each time point, and used the signed, log-transformed p values as input to PTM-SEA (see Methods for details). In total we detected 52 significant signatures (FDR Ͻ 0.05) after 6 h and 56 signatures after 24 h of PI3K inhibition (Fig. 5). Although a BYL719 signature was not represented in PTMsigDB we detected signatures of several PI3K inhibitors that were positively enriched upon drug treatment. In the 6 h time point wortmannin, a nonspecific covalent inhibitor of PI3K, and IPI145 (duvelisib), a PI3K delta and gamma inhibitor, were significantly enriched (Fig 5A). In the 24 h time point two additional signatures of PI3K inhibitors, dactolisib and LY294002, were significantly enriched ( Fig 5B) because of longer inhibition of PI3K. These signatures were semi-automatically (wortmannin, LY294002) and fully-automatically (IPI145, dactolisib) curated, validating the parameters used in the automatic curation pipelines to extract perturbation signatures from the LINCS and PSP databases (see Methods). We further observed signatures of MEK and mTOR inhibitors downstream of PI3K signaling positively enriched in both time points. Besides positively enriched KI signatures we observed an enrichment of the protein kinase CK2 signature (CK2A1), after 24 h treatment, pointing to an elevated activity of the kinase which was found to be dysregulated in many cancers (39, 40) (supplemental Fig. S8A).
Among signatures that showed negative enrichment upon PI3K inhibition were AKT1, MEK1, mTOR, two isoforms of Raf kinase (ARAF, BRAF) and MAPK/ERK pathways downstream of PI3K, and associated kinases such as ERK1, ERK2, JNK1, p90RSK and p70S6K. Further downstream of PI3K we observed decreased activity of CDK (CDK1, CDK2, CDK4, CDK5, CDK6) and aurora (AurA, AurB) kinases which are essential regulators of the cell cycle, thereby highlighting the effect of PI3K inhibition on the inactivation of the cell proliferation machinery. These cell cycle-related signatures were more pronounced in the 24h time point (Fig 5B). We also observed decreased activity of the Leptin-induced pathway (supplemental Fig. S8B) which has been reported as a potential prognostic marker for breast cancer (41). PTMsigDB contains a PI3K-AKT pathway signature obtained from WikiPathways that we expected to observe with negative enrichment levels upon BYL7109 treatment. We detected the lowest enrichment of this pathway in the 24 h time point and the signature score did not pass statistical significance (FDR ϭ 0.13, supplemental Table S7). However, further inspection of FIG. 5. Signatures of PI3K inhibition in breast cancer cells. T47D cells were treated with the PI3Ka inhibitor BYL719 and DMSO as control for 6h and 24h and deep phosphoproteomes were acquired resulting in ϳ24K phosphosites localized to a specific Ser/Thr/Tyr residue. A, Volcano plot depicting enrichment of phosphoproteome signatures in the 6h time point. The x axis represents the normalized enrichment score (NES) between DMSO (left side) and drug treatment (right side). The size of the dots scale with the relative number of scored phosphorylation sites in a signature. The gray area contains signatures that did not significantly change upon drug treatment (permutation-based FDR Ն 5%). PTM signature sets of inhibitors are annotated with an "i" after the drug target. B, Volcano plot depicting phospho signatures after treatment for 24 h. C, Schematic representation of the PI3K-AKT-mTOR and Ras-Raf-MEK-ERK pathways. Circles indicate kinases, boxes indicate specific kinase inhibitors. Highlighted in colors are significant phospho signatures shown in the volcano plots in A and B. The pathway representation is based on Fig. 1 in (44). the phosphorylation sites in the PI3K-AKT pathway signature demonstrated a clear trend toward the inactivation of the pathway (supplemental Fig. S8C). A complete list of detected signature sets, together with signature enrichment scores and FDR-corrected p values can be found in (supplemental Table  S7).
In summary, these findings were concordant with known cancer biology and covered a substantial part of the canonical PI3K-AKT-mTOR and Ras-Raf-MEK-ERK pathways that are affected by PI3K inhibition (Fig 5C). Our results demonstrated the potential of site-specific PTM signature enrichment analysis in interpreting perturbation studies that attempt to decipher the molecular mechanism of clinically relevant kinase inhibitors. DISCUSSION We describe an approach to perform phosphosite-specific signature analysis (PTM-SEA) based on a curated database of phosphosite-specific signatures (PTMsigDB). Curation and maintenance of such a database requires knowledge from domain experts, which in this study are represented by curators from PhosphoSitePlus, NetPath and WikiPathways. We consider PTMsigDB to be the foundation of a more comprehensive resource for PTM site-specific signatures of pathways and perturbations, rather than a mature and completely developed database. We aim to further extend PTMsigDB when more curated, site-specific molecular pathways are made available by repositories such as PSP. Moreover, we envision that this curation process will become a community effort in which researchers studying PTMs in particular pathways will contribute to curating these pathways at the level of PTM sites. Resources like WikiPathways that are maintained by the scientific community recently started to include PTM site-specific pathway annotations and provide a promising source for future expansion of PTMsigDB.
We have demonstrated the potential of PTMsigDB in analyzing MS-based phosphoproteomic data sets derived from perturbation studies involving EGF, cell cycle inhibitors and a specific PI3K inhibitor. The relatively small total number of signature sets (ϳ490) and unequal representation of perturbations, kinases and signaling pathways still limits the true potential of a PTM site-centric database. Our approach performs best when applied to perturbation data sets in which we expect a clear molecular phenotype. Subtle abundance differences in unperturbed baseline phosphoproteomes are much more challenging to detect with the current selection of available signature sets in PTMsigDB, especially because of the underrepresentation of molecular signaling pathways. Curation of the latter requires domain experts as well as the availability of detailed metadata accurately describing the biochemical experiments that were conducted to study signaling events in a pathway. However, these metadata (such as dosage of stimulant, cell system and time points of treatment) are often buried in respective subsections of manuscripts rather than stored together with the data itself, making curation of pathways prohibitively and unnecessarily difficult.
About 40% of signatures in PTMsigDB were derived from perturbation studies in which pathways were specifically activated (e.g. via growth factors) or inhibited (e.g. via kinase inhibitors). The cellular responses upon perturbation treatment are represented in perturbation signatures in PTMsigDB. Many of the respective phosphosites are of low stoichiometry and even after enrichment for modified peptides are of too low abundance to be sequenced by single-shot LC-MS/MS analysis. Although kinase-substrate signatures in PTMsigDB consist of 46 sites on average, this number is substantially lower for perturbation signatures (16 sites on average). Therefore, deeper phosphoproteome coverage is required to detect signatures beyond kinase activity, which typically involves extensive sample fractionation.
PTMsigDB contains signature sets for 45 kinase inhibitors (KI) and, to our knowledge, presents the first set of KI signatures curated at the level of phosphorylation sites. Although it is well known that KIs often target more than one protein, a systematic characterization of clinically relevant KIs and their targets was released only recently (42). The availability of molecular signatures of KIs curated at the PTM-site level enables further systematic screening for potential off-target effects of certain KIs. Though the number of KI signatures is still relatively small in the current version of PTMsigDB, this category of signatures is expected to be especially valuable for the scientific community.
Phosphosite-specific analysis of kinase substrate relationships is not an entirely new concept and it has been successfully employed by several groups (6, 7, 10 -12). Instead of using the kinase expression as proxy for its activity, the abundance profiles of phosphorylation sites that are known substrates of the respective kinase are monitored. In this study we detected differential signatures of cyclin-dependent kinase (CDK) 1 and 2 in three human cancer cell lines upon treatment with different cell cycle inhibitors. Most importantly, treatment with paclitaxel resulted in increased activity of CDK1/2 whereas other inhibitors showed decreased kinase activity. Paclitaxel is an antineoplastic chemotherapy medication that targets tubulin by stabilizing the microtubule polymer preventing it from disassembly. The resulting inhibition of the mitotic spindle function blocks the progression of mitosis in which the concentration of cyclin A is high. Cyclin A specifically binds to CDK1, and the resulting complex is in an active state phosphorylating a multitude of different substrates that were identified by PTM-SEA.
Application of PTM-SEA to the deeply fractionated phosphoproteome of PI3K inhibited human breast cancer cells detected signatures that covered a substantial part of the canonical PI3K-AKT-mTOR and Ras-Raf-MEK-ERK pathways inhibitory signatures of kinases downstream (MEK, AKT, ERK) of PI3K. We also detected increased activity of the catalytic subunit ␣ of casein kinase 2 (CK2a), a ubiquitous protein kinase known to be dysregulated in multiple cancers (39). Attenuation of PI3K/Akt signaling upon selective inhibition of CK has been reported previously (43). Our detection of higher activity levels of this kinase after inhibition of the PI3K pathway is an interesting observation for which the rationale remain unclear. Importantly, signatures of several PI3K inhibitors were enriched upon treatment and demonstrated decreased phosphorylation levels of sites that are part of these signatures, thereby supporting the selection of phosphosites for these inhibitors in PTMsigDB. Not surprisingly, the strongest effects of PI3K inhibition were observed in the cell proliferation machinery, exemplified by significantly reduced activity of mitotic kinases (CDKs and Aurora kinases).
With ongoing curation efforts to create PTM site-specific signatures spanning more molecular signaling pathways in PTMsigDB, we expect to achieve a more fine-grained view on signaling events disrupted by targeted inhibition of cancerrelevant pathways. We present here a uniform method and guidelines to the community to create these PTM signatures. In cases where no site-specific information is available yet, our gene-centric-redundant ssGSEA approach provides an alternative to standard gene centric approaches.
Acknowledgments-We thank Alexander R. Pico from the Gladstone Institutes and cofounder of WikiPathways for constructive discussions around PTMsigDB.

DATA AVAILABILITY
The original mass spectra for the PI3K perturbation experiment may be downloaded from MassIVE (http://massive. ucsd.edu) using the identifier: MSV000083128. The data is directly accessible via ftp://massive.ucsd.edu/MSV000083128.