Pan‐Cancer landscape of protein activities identifies drivers of signalling dysregulation and patient survival

Abstract Genetic alterations in cancer cells trigger oncogenic transformation, a process largely mediated by the dysregulation of kinase and transcription factor (TF) activities. While the mutational profiles of thousands of tumours have been extensively characterised, the measurements of protein activities have been technically limited until recently. We compiled public data of matched genomics and (phospho)proteomics measurements for 1,110 tumours and 77 cell lines that we used to estimate activity changes in 218 kinases and 292 TFs. Co‐regulation of kinase and TF activities reflects previously known regulatory relationships and allows us to dissect genetic drivers of signalling changes in cancer. We find that loss‐of‐function mutations are not often associated with the dysregulation of downstream targets, suggesting frequent compensatory mechanisms. Finally, we identified the activities most differentially regulated in cancer subtypes and showed how these can be linked to differences in patient survival. Our results provide broad insights into the dysregulation of protein activities in cancer and their contribution to disease severity.


Table of contents Supplementary Results 3
Appendix Figure S1 -Sample size and Pearson's correlation between phosphorylation levels and corresponding protein abundances. 6 Appendix Figure S2. Lists of kinase-substrate associations compiled in this study. 7 Appendix Figure S3. Validation of kinase-substrate sources and kinase activity estimates in the cancer samples. 8 Appendix Figure S4. Effects of genomic alterations on protein abundances. 9 Appendix Figure S5. Effects of genomic alterations on protein activities. 10 Appendix Figure S6. Impact of recurrent mutations on the MAPK/ERK signalling transduction pathway and phosphatases differentially expressed between BRAFV600E-mutated samples with high and low activity of BRAF. 11 Appendix Figure S7. Examples of associations between the mutational status of TFs and their activities. 12 Appendix Figure S8. Correlation of genetic associations metrics derived from Pan-cancer and tissue levels analyses. 13 Appendix Figure S9. Projection of protein activities in low-dimensional spaces and kinase-TF associations. 14 Appendix Figure S10. Kinase activity regulation in tumours and perturbed human conditions. 15 Appendix Figure S11. Examples of kinase and TF activities associated with the overall survival of cancer patients. 16 Appendix Figure S12. Cross-correlation matrix of tumour sample kinase-TF activity signature. 17 Appendix Figure S13. Heatmap of clinical features over-representation. 18 Appendix Figure S14. Most consistently deregulated kinase and TF activities in each cluster. 19 Appendix Figure S15. Mechanistic hypotheses to connect the highlighted kinases and TFs of each cancer cluster. 20

Multi-omic based stratification of tumour samples
We clustered tumour samples according to kinase and transcription factor activities.
To do so, we built a cross correlation matrix of tumour samples (by correlating their combined signature of batch corrected kinase and TF normalised enrichment scores (NES)) (Appendix Figure S12). Based on visual inspection of the cross-correlation matrix, as well as information from applying silhouette and "within sum of square" methods, we decided to stratify tumour samples into 8 distinct groups (Methods).
We performed an over-representation analysis of clinical features in each cluster (Appendix Figure S13) to find functional similarities between samples that have different annotations. For example, we found enrichments for ovarian cancer (cluster 1), stage I lung cancer (cluster 2), serous/mucinous ovary and colon cancers (cluster 3), colon cancer (cluster 4) and breast cancer (cluster 5) samples. We found enrichments also for immunity infiltration states: Cluster 6 is marginally enriched in CD8+ inflamed samples and cluster 7 in inflamed (CD8-) mesenchymal and infiltrating lobular carcinoma.

TF and kinase activity characterization of tumour sample clusters
To characterise which kinases and TFs are the most consistently deregulated in each cluster, the mean of each protein activity was divided by its corresponding standard deviation, within each cluster respectively (Appendix Figure S14). There are several activity enrichments that are consistent with prior observations. For example, cluster 1 and 3 (over-represented with high grade serous ovarian cystadenocarcinoma (SOC)) show increased activity of ARID1A, which is frequently mutated with loss of function (but not loss of protein expression) in ovary cancers (Yachida et al, 2020). IRF1 down-regulation in cluster 1 is coherent with finding that its over-expression in SOC is associated with better prognosis and survival (Cohen et al, 2014). Additionally, NFKB was found to have a tumour suppressor role in low grade SOC, while we observe a coherent down-regulation of NFKB1, RELA and SPI1 in cluster 1. Cluster 3 (also over-represented in high grade SOC) displays a different protein dysregulation profile. Cluster 3 also shows an up-regulation of KDM5B activity, previously found to be associated with high grade SOC and poor prognosis (Wang et al, 2015). TF4AP is a transcription factor associated with SOC proliferation activated down-stream of oestrogen signalling, and is also up-regulated in cluster 3 (O'Donnell et al, 2005).
In cluster 2 (lung adenocarcinoma (LUAD) over-represented) BHLHE40 is consistently up-regulated, which is coherent with reports showing its up-regulation in lung and esophageal carcinoma (Kiss et al, 2020). E2F6 activity, whose transcript levels are often elevated in LUAD, was also found to be up-regulated. Like cluster 2, cluster 5 is also over-represented with LUAD but also in breast cancer (BRCA) and shows a different protein dysregulation profile than cluster 2. Cluster 5 shows a high activity of ZEB2, a promoter of epithelial to mesenchymal transition (EMT), metastasis and resistance in LUAD and BRCA (Duan et al, 2016;Cui et al, 2019;Li et al, 2017). Coherently, MF2C activity is also inhibited in cluster 5. MF2C is known to be specifically inhibited by miR-223 (secreted by stromal cells) to increase invasion and migration of LUAD (Alečković & Kang, 2015).
Cluster 4 (over-represented with colorectal cancer (COREAD)) shows a high activity of BCL6, a novel potential therapeutic target of COREAD (Sun et al, 2020). Cluster 6 (over-represented in kidney cancer (ccRCC)) is characterised by a down-regulation of SMARCC2 activity. SMARCC2 is a core subunit of the tumour suppressor complex SWI/SNF and is emerging as a mechanism driving tumour development in murine models and ccRCC patients (Nargund et al, 2017;Agaimy et al, 2018). Cluster 7 (over-represented by CD8-inflamed tissues) shows a high activation of NFKB1. Interestingly, RUNX1 is also up-regulated, although it usually suppresses inflammation through NFKB1 inhibition (Bellissimo et al, 2020). JUN is also up-regulated in cluster 7, which is coherent with its role in the regulation of inflammation in epithelial tissues (Schonthaler et al, 2011). Cluster 8 (over-represented with uterus cancer patients) shows an up-regulation of BCL3, MAZ and MTA2. BCL3 and MTA2 have been found to be up-regulated in uterus cancer (Lin et al, 2020) while MAZ dysregulation seems to be under-studied in cancer.

Systematic exploration of mechanistic hypotheses to connect dysregulated kinases and TF in tumour clusters
After highlighting the most consistently deregulated kinases and TFs in each cluster, we sought to investigate the mechanistic links that could connect them. In particular, we searched which kinases could best explain the downstream deregulated TFs. For that we contextualised a prior knowledge network (PKN) obtained from Omnipath using the top deregulated kinases and TFs of each cluster with CARNIVAL (Methods) (Dugourd et al, 2021;Liu et al, 2019). The 8 causal reasoning networks generated had an average of 75 edges (54 s.d.). In each network, we can explore how given kinases are activating/inhibiting downstream signalling proteins. Kinases and TFs are connected by coherent causal interactions through intermediate signalling proteins (proteins for which there were not enough measurements to estimate their activities). Thus, the contextualised causal links also help us to hypothesise the potential activity of intermediate nodes (Appendix Figure S15).
In cluster 1, IRF1, RELA and SPI1 appear to be down-regulated as a consequence of the coordinate up-regulation of PRKCQ and down-regulation of GSK3B, through the down-regulation of SRC and JUN.
In cluster 2, the up-regulation of BHLHE40 and MAZ can be explained by an upstream up-regulation of the PAK2 kinase, the subsequent up-regulation of TP53, and the coordinated down-regulation of CDK1. PAK2 is commonly found over-expressed and activated in cancer (Ye & Field, 2012).
In cluster 3 we found an interesting signalling cascade in the form of SNAI2 inhibition downstream of the HIPK2 kinase. This can be particularly interesting because SNAI2 is another marker of aggressiveness and chemotherapy resistance in ovarian cancer, especially to cisplatin treatment (Fan et al, 2020), while HIPK2 over-expression is associated with sensitivity to paclitaxel (Li et al, 2010).
In cluster 4, the increased activity of NCOR2, BCL6 and MNT can be explained by the upstream activation of MAPK14. Notably, MAPK14, NCOR2 and BLC6 are all well known to promote survival and resistance to stress in tumour cells, while suppressing their growth (Battaglia et al, 2010;Grossi et al, 2014;Cardenas et al, 2017). This cascade is also coherent with the predicted down-regulation of CHUCK downstream of MAP3K8.
In cluster 5, the down-regualtion of RBPJ and SPI1 combined with the up-regulation ZEB2 can be explained by the activation of HDAC1 downstream of CSNK2A2 kinase. The activation of HDAC1 seems to be relevant for cancer progression (Cao et al, 2017), which is highlighted here by its ability to inhibit senescence pathways on one hand (through SPI1 inhibition) while promoting metastasis on the other hand (through ZEB2 activation) (Delestré et al, 2017).
In cluster 6, the down-regulation of MYOD1, SMARCC2, TFAP4 and PBX2 can be explained by the combined effect of MAPK1 down-regulation and PRKCZ up-regulation. The MAPK1 down-regulation could be mediated by the downstream inhibition of CREBBP. This is particularly relevant considering that CREBBP inhibition can either negatively or positively regulate proliferation and invasion of different types of kidney cancer cell lines . The network also shows RELA as an activated intermediate of PRKCZ, which is a potential therapeutic target in ccRCC (Peri et al, 2013).
In cluster 7, we can see that NFKB1, JUN, ETS1, SPI1 and RUNX1 are connected in a cascade. These up-regulated proteins are coherent with the over-representation of inflamed samples in this cluster. The combined action of PRKG1 and CDK5 seems to be facilitating the activation of NFKB1 and the rest of the cascade. In particular, NFKB1 is known to be a very important molecular effector of IFNG (Pfeffer, 2011). Coherently, IFNG is one of the most important secreted chemokines of CD8 independent immune response in cancer (Pluhar et al, 2015). These results suggest a cluster of tumour samples where the anti-tumoral response of the host is particularly active.
In cluster 8, we can see that the activation of BCL3 can be explained as a direct consequence of CDK5 activity. The deregulation of MAZ and HOXB13 can be explained as a consequence of the activity of CDK1 through CSNK2A1 and YY1. YY1 has been studied as a potential drug target in HPV-induced uterus cancer (He et al, 2011).

Appendix Figure S1 -Sample size and Pearson's correlation between phosphorylation levels and corresponding protein abundances.
(A) Number of samples by cancer dataset and data type.

(B) Distribution of the correlations between protein abundances and phosphorylation changes for protein-phosphosite pairs (number of pairs beneath the boxplots) across
all cancer samples. Given the sparseness of the (phospho)proteomics data, we selected the protein-phosphosite pairs with protein/phosphorylation measures in at least 1% (n > 10) of the total cancer samples. Left: non-regressed-out phosphorylation data. Right: protein regressed-out phosphorylation data (Methods). (C) Representation of the same data as (B) by cancer dataset. Correlations were calculated for those protein-phosphosite pairs with protein/phosphorylation measures in at least 10% (n > 5) of the samples of each dataset. A small amount of correlation between phosphosites and proteins remains as the regression was done across all of the dataset.  Figure 1C. Kinase activities were re-estimated in cancer samples after removing the kinase auto-regulatory phosphosites from the kinase targets. The boxplots show the distribution of the Pearson's correlation between kinase activities and phosphosite quantifications that mapped to the same kinase, with (n = 118) and without (n = 743) annotation (activating) in PhosphoSitePlus.

(D)
Pearson's correlation between the CPTAC MS-based and the TCGA RPPA-based phosphosite quantifications, for the same phosphosite pair (n = 12) and others (n = 132). A P-value from a Wilcoxon rank sum test is shown. (E) Fraction of kinases and TFs classified as highly variable across the tumour samples. Kinases and TFs with absolute activity measures higher than 1.75 and 3.89 (96.7 th percentiles), respectively, in at least 5% of the samples were classified as highly variable. (F) Percentage of phosphosites in TFs significantly and not significantly correlated with the corresponding TF activities, stratified by their functional score. Analysis based on 1,183 phosphosites mapping to 178 TFs (n > 10).
Appendix Figure S4. Effects of genomic alterations on protein abundances.
(A) Comparison of the distribution of the correlations (Pearson's r) between the CNV levels (GISTIC2) and the mRNA and protein abundances (log2 fold-changes). Correlations were calculated for those genes with CNV, mRNA and protein quantifications in at least 10 samples (11,624 genes). The experimental batch was regressed-out from the mRNA and protein quantification data before computing the correlations (Methods). (B) Same as (A) by tissue and experimental study. P-values < 2.2e-16 in all cases (Wilcoxon rank sum test). The number of genes is indicated in the plot. (C) Protein abundance distribution between mutation types from the CPTAC tumours. One sample may have multiple mutations in the same protein. Therefore, we selected the sample-protein pairs that were exclusive of each mutation type to prevent the cases where different mutations in the same protein and sample have the same protein abundance. The outliers (defined as the data points beyond Q1-1.5*IQR and Q3+1.5*IQR, where Q1 and Q3 are the first and third quartiles and IQR is the interquartile range) were removed from the distributions for representation purposes. The number of protein quantifications (including outliers) is shown at the left of each boxplot. The P-values from a two-sample T-test comparing each distribution with the background (no mutation) are shown at the right. All data points (including outliers) were used to calculate the P-values. (D) Same as (C) for the cancer cell lines from the NCI60 and CRC65 panels.
Appendix Figure S5. Effects of genomic alterations on protein activities.
(A) Distribution of kinase and TF activities between mutation types from the CPTAC tumours. Only the sample-protein pairs that were specific of each mutation type were selected to prevent the cases where different mutations in the same protein and sample have the same protein activity. The outliers (defined as the data points beyond Q1-1.5*IQR and Q3+1.5*IQR, where Q1 and Q3 are the first and third quartiles and IQR is the interquartile range) were removed from the distributions for representation purposes. The number of protein activity quantifications (including outliers) is shown at the left of each boxplot. The P-values from a two-sample T-test comparing each distribution with the background (no mutation) are shown at the right. All data points (including outliers) were used to calculate the P-values. (B) Same as (A) for the cancer cell lines from the NCI60 and CRC65 panels. (C) Scatterplots between the -log10 SIFT score (x-axis) of missense mutations and the activity of kinases and TFs (y-axis) from the CPTAC tumours. The linear regression line and the Pearson correlation coefficient, with the respective P-value, are shown. Cases where the same sample had multiple missense mutations in the same gene were removed to prevent the assignment of the same protein activity to different SIFT scores. (D) Same as (C) for the cancer cell lines from the NCI60 and CRC65 panels. (E) Same as (C) for the CPTAC tumours with higher purity (greater than 0.8). The purity score provides information about the degree of immune infiltration and was calculated from the gene expression data using the ESTIMATE algorithm. (F) Volcano plot showing the associations between the BRAF V600E mutation and the activity of kinases. The x-axis contains the mutation coefficient (effect size) and the y-axis the adjusted P-values. Highlighted are kinases from the MAPK/ERK signalling pathway and CDK1/7 (significantly associated with the BRAF V600E mutation). (G) Differential activity of the CDK1 and CDK7 kinases between samples with and without BRAF V600E mutation. The x-axis separates the cancer samples by mutation status (1 if mutated and 0 otherwise) and the y-axis contains the kinase activities. The outliers were removed from the distributions for representation purposes. The number of quantifications (including outliers) are shown beneath each boxplot. A P-value from a Wilcoxon rank sum test comparing both distributions is shown. All data points (including outliers) were used to calculate the P-values.
Appendix Figure S6. Impact of recurrent mutations on the MAPK/ERK signalling transduction pathway and phosphatases differentially expressed between BRAF V600E -mutated samples with high and low activity of BRAF.
(A) The x-axis contains highly recurrent mutations in cancer -BRAF V600E , KRAS G12C and KRAS G12D . The y-axis shows the mutation coefficients (effect sizes) from multiple linear regression models fitted to predict the activities of kinases from the MAPK/ERK signalling transduction pathway using the mutations as predictors. The mutation coefficient P-values and predicted kinases are colour coded. (B) The x-axis is the log2 fold-changes comparing the gene expression of phosphatases between BRAF V600E -mutated samples with high and low activity of BRAF. Phosphatases with increased expression in samples with low BRAF activity have positive log2 fold-changes. The y-axis is the -log10 adjusted P-values from a Wilcoxon rank sum test. Marked in red are the phosphatases with a significant difference (FDR < 15%) between the two sample groups and that tend to have higher expression levels in low BRAF activity samples.
Appendix Figure S7. Examples of associations between the mutational status of TFs and their activities.
Related to the main Figure 2B. The x-axis represents the TFs and the y-axis the activities. The colours stratify the samples by their mutational status in the respective TFs. The number of quantifications are shown beneath each boxplot. The P-values from Wilcoxon rank sum tests comparing both distributions are shown.