Systematic analysis of the intersection of disease mutations with protein modifications

Background Perturbed posttranslational modification (PTM) landscapes commonly cause pathological phenotypes. The Cancer Genome Atlas (TCGA) project profiles thousands of tumors allowing the identification of spontaneous cancer-driving mutations, while Uniprot and dbSNP manage genetic disease-associated variants in the human population. PhosphoSitePlus (PSP) is the most comprehensive resource for studying experimentally observed PTM sites and the only repository with daily updates on functional annotations for many of these sites. To elucidate altered PTM landscapes on a large scale, we integrated disease-associated mutations from TCGA, Uniprot, and dbSNP with PTM sites from PhosphoSitePlus. We characterized each dataset individually, compared somatic with germline mutations, and analyzed PTM sites intersecting directly with disease variants. To assess the impact of mutations in the flanking regions of phosphosites, we developed DeltaScansite, a pipeline that compares Scansite predictions on wild type versus mutated sequences. Disease mutations are also visualized in PhosphoSitePlus. Results Characterization of somatic variants revealed oncoprotein-like mutation profiles of U2AF1, PGM5, and several other proteins, showing alteration patterns similar to germline mutations. The union of all datasets uncovered previously unknown losses and gains of PTM events in diseases unevenly distributed across different PTM types. Focusing on phosphorylation, our DeltaScansite workflow predicted perturbed signaling networks consistent with calculations by the machine learning method MIMP. Conclusions We discovered oncoprotein-like profiles in TCGA and mutations that presumably modify protein function by impacting PTM sites directly or by rewiring upstream regulation. The resulting datasets are enriched with functional annotations from PhosphoSitePlus and present a unique resource for potential biomarkers or disease drivers. Electronic supplementary material The online version of this article (10.1186/s12920-019-0543-2) contains supplementary material, which is available to authorized users.


Background
Recent breakthroughs in next-generation sequencing technologies have been accompanied by large-scale initiatives such as the TCGA network profiling large numbers of tumors [1] or the 1000 Genomes Project cataloging human genetic variations [2]. Concordantly, DNA sequencing is becoming part of routine clinical care [3], and initiatives such as the Obama Precision Medicine program [4] aim to profile patients or healthy individuals at the molecular level via sequencing or genotyping, hereby entering the era of personalized genomics and medicine.
While these advances have outpaced our ability to functionally characterize the plethora of molecular information, recent studies showed the statistical significance of perturbed signaling and altered transferase activities at a high level [5,6]. These results are consistent with the established classification of PTM-mediated pathways as hallmarks in cancer and other diseases [7]. Consequently small molecule inhibitors targeting kinases such as HER2 [8], RAF [9], PI3K [10], or MEK [11], have been prime targets of drug development for decades, and several of them have advanced into the clinic. In addition to kinases, recent sequencing efforts revealed other significantly altered transferases, which have been subsequently pursued as targets, including histone methyltransferase EZH2 for the treatment of myelodysplastic syndrome and cutaneous T cell lymphoma [12,13].
However, to fully understand the impact of mutations on PTMs and pathways, in particular at the substrate level, the integration of genomic with proteomic data is required [14]. Recent studies therefore thought to identify phosphorylation network-attacking mutations in cancer cell lines [15] or to determine significantly phospho-mutated proteins and pathways in tumors [16,17]. These analyses gave important insights into perturbed signaling, but focused on the interplay between somatic cancer mutations and phosphosites.
Here we extended previous approaches by expanding the panel of cancer types for the identification of somatic driver mutations, and focused on hotspot mutations instead of taking the entire mutation load into account. In addition we analyzed disease-associated single nucleotide polymorphisms (SNPs) from the population, and investigated other PTM types such as ubiquitylation, acetylation, and methylation. We further applied Scansite [18], a widely used method for the prediction of upstream kinase regulation, to identify rewired signaling networks. Our hypotheses on specific impacted PTM sites are backed up by functional annotations in PhosphoSitePlus [19] to form a unique resource for further investigation.

Mutation analyses
Hotspot mutation scores (ΔS) were calculated for each protein as described [39]: Where n is the total number of mutations, k is the number of different mutation types, n i is the number of occurrences for mutation i, and f i is the frequency of mutation i or n i /n.
For the clustering and comparison of frequencies of amino acid changes each missense alteration type (from one amino acid to any of 19 others) was counted in both the somatic and germline datasets, and categorized by the unmodified wild-type amino acid or by the impacted PTM class. The resulting count matrix was normalized and used as input to create heatmaps with the R package pheatmap version 1.0.10 (https://cran.r-project.org/pack-age=pheatmap). Default parameters (complete hierarchical clustering and Euclidean distance) were used for row-and column-wise clustering. Expected mutations on PTM sites were calculated by taking the product of the number of observed mutations on the unmodified amino acid and the proportion of those amino acid residues that are a PTM site in the human proteome.

Prediction of altered upstream kinase regulation
Somatic hotspot and germline SNP mutations within 5 residues of a PTM site were compiled, and mutated flanking sequences (+/− 7 residues) were derived. Scansite 4.0 [18] was used to calculate kinase-binding scores corresponding to wild type or mutated flanking sequences at minimum stringency. DeltaScansite scores were defined as the difference between the Scansite scores for mutated and wild type flanking sequences. A second set of scores corresponding to the same wild type and mutated flanking sequences was calculated using RMIMP (version 1.2) [40]. Predictions were matched for rewiring events for which both methods provided scores.

Integrating PTM sites with disease variants
We retrieved PTM sites from PhosphoSitePlus (https:// www.phosphosite.org), somatic mutations from thousands of TCGA tumors from cBioPortal, and disease-associated SNPs from Uniprot and dbSNP (Materials and Methods). Altogether, we collected 348,570 PTM sites on 18,154 human proteins. The most frequent modification sites included 234,058 phosphorylation sites, 62,216 ubiquitylation sites, 22,712 acetylation sites, and 15,872 methylation sites (Fig. 1a). The TCGA dataset contained 481,370 somatic missense mutations from 4440 tumors across 15 cancer types (Fig. 1b). Filtering the dbSNP dataset for disease-associated human variants, which result in missense alterations at the protein level, yielded a set of 18, 511 non-redundant mutations on 2532 proteins linked with more than 3000 different diseases.

Identification of somatic hotspot mutations reveals potential cancer drivers
While the set of germline variants exclusively contained disease-associated SNPs and hence did not require further filtering, only a fraction of somatic mutations in the TCGA dataset are tumorigenic. To distinguish between passenger and driver mutations, we determined recurrent mutations and calculated entropy-based 'hotspot mutation scores' [39] reflecting the preferred occurrence of specific point mutations in a protein.
Hotspot mutations are likely cancer-driving, and present the most characteristic feature of oncoproteins [41]. As expected, known cancer proteins BRAF, PIK3CA, KRas, Akt1, IDH1, and NRas showed the highest hotspot mutation scores in the TCGA dataset ( Fig. 2 and Additional file 1). However, many other proteins, whose contributions to oncogenesis are unknown or not fully understood, also revealed hotspot mutations. For example, splicing factor U2AF1 showed a recurrent mutation (S34F) in leukemia and lung adenocarcinoma resulting in the 7th highest hotspot score in our analysis. Similarly, phosphoglucomutase-like protein 5 (PGM5) had a hotspot mutation (I98V) in stomach cancer and the 10th highest score.
By definition, hotspot scores at the protein level correlate with the degree of recurrence of corresponding mutations. The presence of at least one recurrent mutation found in three or more tumors was sufficient for a protein to gain a hotspot score significantly higher compared to proteins with less recurrent mutations (p < 0.01 based on Mann-Whitney-Wilcoxon test) (Fig. 2b). Using this cutoff to enrich for oncogenic mutations yielded a set of non-redundant 1783 hotspot mutations on 1369 proteins ( 100,000  [42], but most cases are not fully understood. Notably the number of tumors per cancer type varied from 65 (chromophobe renal cell carcinoma) to 817 (breast cancer), so that the ranking of hotspot mutations by frequency was biased towards cancer types with larger cohorts. We therefore included cancer type-specific scores in Additional file 1.  Additionally, p53 contained both the largest total number and highest density of unique hotspot mutations, with 96 unique hotspot alterations on p53 found in the dataset.

Deleted PTM sites in diseases
Having filtered and characterized somatic hotspot mutations and disease-associated mutations from dbSNP, we set out to identify mutations directly affecting PTM sites. For this analysis, we postulate that previously reported PTM events would indeed be present in a tissue in the absence of mutations. First, we mapped TCGA mutations and PTM sites to the corresponding protein sequences and investigated the overlap. We found 49 somatic hotspot mutations in TCGA, destroying 45 PTM sites in 35 proteins (Additional file 4). This overlap between PTM sites and somatic mutations was significantly larger than expected (p < 0.01 based on a twotailed Fisher exact test). While p53 (7 PTM sites) (Fig. 4a), CTNNB1 (4 PTM sites) (Fig. 4b), and hnRNP U (2 PTM sites) showed multiple impacted sites, all other proteins had only one lost PTM site including known cancer proteins such as EGFR, APC1, BRAF, or RAF1. A total of ten affected PTM sites have been associated with specific biological processes or molecular functions. For example, phosphorylation of protooncoprotein beta-catenin on S33, S37, and T41 by GSK-3 beta is known to target beta-catenin towards degradation, and CK1-mediated phosphorylation of S45 functions as a gatekeeper for this process [43]. Mutations of these sites primarily occur in endometrial but also other cancer types including lung cancer. The destruction of these regulating PTM sites presumably prohibits the degradation process leading to continuous and oncogenic activity of beta-catenin. While the loss of these PTM sites on beta-catenin are consistent with an oncogenic model, other cases are more complex. For example, PRMT1-mediated methylation of EGFR on R222 has been shown to enhance binding to EGF and subsequent receptor dimerization and signaling activation [44]. However, the loss of this PTM site, found in 1.4% of glioblastoma samples, would be consistent with a tumor-suppressing role of the mutation. Overall, most cases, such as the mutation on growth factor receptorbound protein 10 (T422 M), found in three tumors, have been detected by mass spectrometry without functional characterization, forming a candidate set for further characterization.
We also examined the set of disease-associated germline SNPs, and identified 420 genetic variants overlapping with 402 PTM sites on 276 proteins (Additional file 5). A total of 73 proteins showed two or more overlaps between PTM sites and germline mutations. Lamin A/C (20 overlaps), CTNNB1 (10 overlaps), and p53 (9 overlaps) showed the largest number of intersects. The observed number of SNP mutations on K-acetylation, T-phosphorylation and Y-phosphorylation sites was significantly greater than expected (p < 0.01 based on a two-tailed Fisher exact test) (Fig. 5a). A total of 48 destructed PTM sites have been functionally characterized by previous studies (Additional file 5). Among these, the best-described PTM site is S32 on NF-kappa-B inhibitor alpha (IkB-alpha). Overall, 127 references have described the functional impact of phosphorylation of IkBalpha at S32 leading to proteasome-mediated degradation and consequent activation of NF-kappa-B/Rel transcription factors via translocation from the cytosol to the nucleus [45]. Mutation on this residue has been shown to be associated with autosomal dominant anhidrotic ectodermal dysplasia and T cell immunodeficiency [46]. As observed for the overlap with somatic mutations, most dbSNP-overlapping PTM sites have been experimentally validated without functional characterization including PTM sites on cancer proteins. For example, phosphorylation of RAF1 on T310 has been validated in six highthroughput experiments. While the function of this PTM site is unknown, variation on the residue (T301A) is associated with childhood-onset dilated cardiomyopathy [47].

Mimicked phosphorylation sites in diseases
While the destruction of PTM sites in diseases might imply the loss of tumor-suppressing functions in the cell, constant activation of PTM sites points to the promotion of oncogenic processes. Focusing on phosphosites, we scanned the data for residues mutated to the negatively charged amino acids aspartic (D) and glutamic acid (E). This was based on the idea that diseases might  utilize the same trick that scientists use in experiments to mimic constantly active phosphosites [48]. While none of the overlapping somatic hotspot mutations contained mimicking alterations, eleven SNPs intersected with phosphosites in their wild type form and mimicked them in diseases (Additional file 6). Interestingly, all of these were phosphorylated tyrosines mutated to aspartic acid. Strikingly, phosphorylation of non-receptor-type protein tyrosine phosphatase SHP-2 on Y62 has been detected in 2115 mass spectrometry experiments, but not functionally characterized to the best of our knowledge. The overlapping mutation (Y62D), however, has been associated with Noonan Syndrome [49].
In addition to the identification of phosphosites that are phosphorylated in the wild type state and mimicked in mutated form, we identified 31 somatic hotspot and 1230 germline mutations that mutated to aspartic and glutamic acidirrespective of whether they present PTM sites in their wild type origin. Besides well-known cases such as BRAF (V600E), KRas (G12D, G13D) or PIK3CA (G118D), we derived previously uncharacterized hotspot mutations such as SF3B1 (K700E) found in eight breast tumors and one leukemia sample.
We also looked into the distributions of each kind of missense mutation on PTM sites compared to unmodified residues. Clustering of these frequencies showed that mutation patterns for PTM sites were similar to their unmodified counterparts in the TCGA (Additional file 7) and the dbSNP sets (Fig. 5b).

Mutations triggering rewiring of signaling networks
While mutations on central PTM sites trigger their loss or in some cases even mimic their active states, mutations in their flanking regions could rewire regulation by changing the substrate sequence motif. Scansite uses scoring matrices derived from peptide library experiments to identify short protein sequence motifs [18]. For the first time we applied Scansite on wild type and mutated flanking regions of PTM sites to derive 'Delta-Scansite' score as a measure of rewiring (Materials and Methods). We also compared the results with the machine learning method MIMP [40].
In the TCGA dataset, Delta-Scansite and MIMP calculated 149 matching rewiring scores for 87 mutations on 83 flanking regions across 73 proteins. Among these paired scores, 138 (93%) agreed in sign ( Fig. 6a and Additional file 8 A). For example, both Delta-Scansite and MIMP predicted a loss phosphorylation of T284 on p53 by Aurora B due to the alteration R282W found in 30 tumors across multiple cancer types. Aurora Bmediated phosphorylation on T284 has been shown to compromise p53 transcriptional activity [50].
Similarly, both methods predicted a loss phosphorylation of T254 on MTF1 by Akt1. Dephosphorylation of T254 by phosphatase PP2A PR110 has been shown to regulate MTF1 activity [51]. To our knowledge the corresponding kinase, however, has not been reported. Furthermore, in addition to mutations leading to direct loss of regulating PTM sites of beta-catenin as described above, both approaches predicted reduced phosphorylation on S33 by GSK, triggered by three hotspot mutations on a flanking residue that is itself a phosphosite (S37C, S37F, and S37A). These mutations were found in primarily endometrial cancer tumors. In the SNP dataset, 590 scores were calculated for 318 mutations on 283 sites across 188 proteins (Additional file 8 B). In total, 494 (84%) of these scores agreed in sign (Fig. 6b). For example, the S112 phosphosite on PPAR-gamma plays a role in cell differentiation, growth, and transcription as described by multiple studies curated in PhosphoSitePlus, and the flanking SNP on P113Q has been associated with obesity [52]. Both our approach and MIMP predicted a loss of MAPK and CDK binding to S112, but a gain of ATM Kinase binding, indicating a possible rewiring event.
While most predictions were concordant between Del-taScansite and MIMP, we also observed contradictory predictions. For example, MIMP predicted that mutations of arginine to histidine on the − 3 position relative to AKT1 substrate sites induce loss of phosphorylation by AKT1. In contrast DeltaScansite predicted gain of phosphorylation. The canonical sequence motif for AKT substrates requires an arginine on the − 3 position. However, AKT has been indeed reported to potentially phosphorylate SP1 at T679 despite histidine on the − 3 position [53]. Thus it is difficult to determine which prediction method is correct in such cases.
Altogether we found that DeltaScansite and MIMP predictions are consistent. The rewiring events suggested by these methods provide a unique resource for studying perturbed signaling in diseases.

Concluding remarks and future plans
The convergence of our knowledge about missense mutations and PTMs has opened up a new approach for analyzing and understanding the interplay between disease mutations and cellular signaling networks. This interplay provides a unique framework for investigating pathogenesis initiated by missense mutations, and conversely for understanding the cellular processes and signaling networks influenced by the posttranslational status of a modification site.
We have analyzed the union of somatic mutations in 15 different cancer types, disease-associated germline mutations, and PTMs. To our knowledge this is the first study to include acetylation, methylation, ubiquitylation, and other non-phosphorylation PTM sites in the analysis of the intersection with mutations. In fact we found more than one hundred non-phosphorylation PTM sites overlapping with disease mutations. In addition to disease-associated germline variants we included somatic cancer mutations. Distinguishing between passenger and driver mutations based on recurrence revealed a number of somatic hotspot mutations previously not linked with tumorigenesis. While mutations on central PTM sites presumably result in the destruction or even mimicking of their active states, mutations in the flanking sequence motif could rewire regulation as predicted by Scansite and MIMP.
We discovered mutations that may impact posttranslational signaling, modifying protein function and network dynamics. Our datasets serve as unique resources for potential biomarkers or disease drivers. The concordance between DeltaScansite and MIMP makes clear that altered upstream regulation can be estimated in-silico and extended for any PTM type in the future.