Credentialing individual samples for proteogenomic analysis

KICH, Kidney chromophobe; KIRC, Kidney renal cell lower LIHC, Liver

Introduction being isolated and analyzed at different centers. The higher correlation between proteins and RNAs in cell lines where sample management and preanalytic conditions could be carefully controlled compared to patient samples suggested that preanalytic conditions may contribute to decreased correlations between protein and RNA in patient samples. Further, in TCGA study, RPPA and RNA-seq were analyzed from separate but contiguous samples and thus heterogeneity in the tumor cells or in the amount and function of stromal cells could contribute to the decreased correlations observed.
The pattern of protein-mRNA correlations across TCGA samples showed significant association with tissue of origin (p<1x10 -16 , Figure 1C). While unsupervised clustering did not identify clear subgroups based on the patterns of cis protein-mRNA correlation ( Figure 1D), three gynecological cancers -breast cancer (BRCA), uterine corpus endometrial carcinoma (UCEC) and serous ovarian carcinoma (OV) -and two lung cancers -lung squamous cell carcinoma (LUSC), lung adenocarcinoma (LUAD) co-clustered.
Thyroid carcinoma (THCA), prostate adenocarcinoma (PRAD), pancreatic adenocarcinoma (PAAD), brain lower grade glioma (LGG) and cholangiocarcinoma (CHOL) had markedly lower concordance, with average cis protein-mRNA correlation <0.2. In THCA, PRAD, PPAD and LGG, ~50% of the total proteins measured by RPPA were correlated with cis mRNAs at levels comparable to the correlation with trans mRNAs (trans represents random pairs throughout), indicative of very low concordance (Supplementary Figure 1A). In contrast, OV and testicular germ cell tumors (TGCT) had higher concordance than other lineages (median R = 0.44 for OV and 0.40 for TGCT). Indeed, for ovarian samples, the cis protein-mRNA correlation was similar between tumors and cell lines (Supplementary Figure 1B). This could represent decreased heterogeneity of ovarian tumor samples or differences in preanalytic conditions. The large tumor size of most ovarian cancers may have facilitated selection of samples with similar characteristics for both RNA and protein analyses.
We then examined features that contribute to cis protein-mRNA concordance. In pan-cancer analysis and analysis of individual tumor types, dynamic range (characterized by interquartile range) of protein expression was significantly associated with protein-mRNA correlation in all tissues except for THCA (Supplementary Figure 2). The dynamic range of mRNA amounts showed similar associations albeit to a lesser extent. Of note, the median expression level of protein or mRNA was not associated with protein-mRNA correlation. A high dynamic range, particularly of proteins, may decrease challenges with heterogeneity or preanalytic characteristics and increase the ability to detect mRNA-protein correlations.

Identification of correlation-based markers for sample quality control of RPPA data
Analysis of distribution of the cis total protein-mRNA correlations identified a 2-component Gaussian mixture model (Supplementary Figure 3A). This suggested the existence of a subset of highly correlated mRNA and proteins that could potentially be used for quality control of individual samples for integrated transcriptomic and proteomic investigation in terms of preanalytic characteristics. Therefore, we analyzed the set of proteins that were most highly correlated with cis mRNAs, and examined the utility of protein-mRNA correlation-based (PMC) scores as a metric for sample quality control. Across distinct tumor types, the sets of highly correlated proteins and mRNAs exhibited substantial differences consistent with alternative mechanisms of regulation of protein levels including dynamic range in different tissues. In the 21 cancer types with >100 samples available for analysis, the number of highly correlated protein-mRNA pairs (Pearson's R > 0.6) ranged from 0 to 40 (Figure 2A, Supplementary Table   1). Analysis of diseases such as PRAD, colon and rectum adenocarcinoma (CORE), PAAD, stomach adenocarcinoma (STAD), kidney renal clear cell carcinoma (KIRC), THCA, and LGG yielded few highly correlated protein-mRNA pairs, which may indicate limited power of this approach in these tumor lineages or alternatively marked challenges with preanalytic characteristics in these samples. Therefore, we integrated RNA-seq and RPPA data from 13 cancer types with >100 samples and in which >15 genes had highly correlated protein-mRNA levels, naming this the TCGA13 data set, and developed PMC markers in this pan-cancer cohort.
In TCGA13 data set, we performed a leave-one-out cross-validation analysis to identify a set of consensus PMC markers (Supplementary Figure 3B). Data of 12 tumor types were pooled as the training set and the remaining tumor type was left out as the test set. In the training set, highly correlated cis total protein-mRNA levels (Pearson's correlation > 0.6) were used to select PMC markers. Then a PMC score, defined as the Pearson's correlation across PMC markers for paired mRNA and protein data, was assigned to each sample in the training set. To determine the cutoff threshold for high quality samples, in the training set, permutation test was performed to estimate the empirical distribution of PMC scores on random protein and mRNA pairs. Samples with significantly higher PMC score than random pairs at FDR of 0.1 were designated high quality (PMC-pass). The same PMC markers and cutoff were then applied to the test set to estimate PMC score and to identify PMC-pass samples. The training-test procedure was applied to each disease type.
To assess the cross-validation performance, we compared two metrics, cis total protein-mRNA correlation for individual genes (within-gene correlation) and for individual samples (within-sample correlation), in the set of protein-mRNA pairs that were not selected as PMC markers, aka test protein-mRNA set, between PMC-pass and PMC-fail samples. In training and test sets, the two types of correlations were both significantly higher in PMC-pass compared to PMC-fail samples, which demonstrated that the leave-one-out analysis identified markers that allow robust improvement in protein-mRNA correlation across lineages (Supplementary Figure 4,Supplementary Table 2). In comparing PMC markers in the 13-fold cross-validation, substantial consensus was observed ( Figure 2B).
While 34 genes were identified as PMC markers in at least one analysis, 30 genes were selected in >9 disease types and were defined as consensus PMC markers (Supplementary Table 3). Compared to the complete set of antibodies in RPPA, protein abundance of consensus PMC markers have a significantly greater dynamic range while exhibiting a similar median level (Supplementary Figure 5), which is consistent with the observed association between dynamic range and ability to detect protein-mRNA correlations.

Consensus PMC markers identified samples with preanalytic challenges
Based on the consensus PMC markers, we computed PMC scores on paired samples and on random protein-mRNA sample pairs. At FDR of 0.1, PMC scores of >0.386 were designated as PMC-pass ( Figure   2C). Overall, 82.2% samples passed the test across cancer type. The pass rate ranged from 66.5% to 93.3% in different diseases ( Figure 2D, Supplementary Table 4). To evaluate the performance of consensus PMC markers, we investigated within-gene and within-sample correlation, in test protein-mRNA pairs (genes that were assayed by RPPA and RNA-seq and were not selected as PMC markers).
Correlation of random protein-mRNA pairs (trans within-gene correlation) and permutated sample pairs (trans within-sample correlation) were used as controls ( Figure 3A-H, Supplementary Table 4). In TCGA13 pan-cancer data set, within-gene and within-sample correlation were significantly higher in PMC-pass samples (within-gene: p=3.04x10 -10 , within-sample: p=1.21x10 -5 ), while no difference was observed in the control analyses. Next for individual disease types, we assessed within-gene and withinsample correlation as well as correlation of two types of trans protein-mRNA correlation. Again, while all tumor types had significantly higher cis within-gene protein-mRNA correlation and all except for SARC samples had significantly higher cis within-sample correlation in PMC-pass samples, no difference in trans correlation was observed in any tumor type.
Then we examined the relationship between PMC score and metrics that have previously been used for quality assessment in RPPA and RNA-seq data ( Figure 3I-L, Supplementary Table 4). Total protein content defined as the overall levels across all antibodies in the RPPA set was adopted by TCGA study as a quality control measurement for RPPA data [1]. Tumor purity estimates based on somatic DNA copy number aberrations has been used as an indicator of tumor content to evaluate quality of DNA and RNA samples in TCGA [28]. In the pan-cancer dataset, total protein content and purity were both significantly higher in PMC-pass than PMC-fail samples (total protein content: p=0.00138, purity: p=0.0186). Within individual diseases, total protein content and purity were substantially higher in PMC-pass samples in 9/13 and 6/8 diseases respectively.
Consensus PMC markers developed from pooled patient cohorts provide a more comprehensive characterization of the complexity and diversity of cancer patients. Therefore, we predicted they would have more robust performance in identifying samples with favorable features than markers developed in a disease-specific context. To quantitatively compare the performance of consensus and diseasespecific markers, we selected benchmarks of low-quality samples based on within-sample correlation, total protein content and purity. Samples with low level (<5% quantile) of the three measurements were considered likely be low-quality samples. With these criteria, Receiver Operating Characteristic (ROC) analysis demonstrated that the consensus PMC score was able to detect three types of low-quality  Table 5). ROC analysis of ability to predict samples with low within-sample correlation provided AUC >0.75 in 11/13 tumor types. ROC analysis of ability to predict samples with low total protein content and purity provided AUC>0.65 in 10/13 and 7/9 tumor types respectively.
Collectively, PMC analysis with 30 consensus marker genes and a cutoff of 0.386 following a proposed computational pipeline ( Figure 4) could be used to identify samples with poor protein-mRNA correlation, low total protein content and low purity. The performance of consensus PMC markers was robust across 13 cancer types and was comparable or superior to markers developed for individual disease types.

PMC markers for sample quality control of MS data
We then assessed the performance of PMC analysis in a mass spectrometry (MS)-based protein analysis platform. The NCI CPTAC group employed MS with isobaric tags for relative and absolute quantitation (iTRAQ) to characterize global protein expression patterns in a subset of tumors from TCGA. We investigated protein-mRNA linkages in MS-based proteomic data from two cohorts, CPTAC BRCA and OV samples.
In the CPTAC BRCA study, MS data was generated for 111 breast samples, including 105 breast tumors, 3 tumor replicates and 3 normal breast control samples [9]. In more than half of the samples, MS detected protein products from 9204 genes, including 8688 genes with mRNA data available. In comparing protein profiles generated by MS and RPPA, median correlation of mRNA levels with cis total proteins quantified by RPPA for the 146 proteins assessed was 0.416 and by MS for the 8688 proteins assessed was 0.289 ( Figure 5A). Overall, 118 proteins were measured by both RPPA and MS and provide comparable protein-mRNA concordance, including 28/30 RPPA consensus PMC markers ( Figure 5B).
Intriguingly, unlike the antibody-based RPPA platform, low average mRNA or protein level was associated with poor protein-mRNA correlation, while the impact of protein dynamic range was more limited compared to that for RPPA correlations and was weakly and inversely associated with protein-mRNA correlation in MS analysis (Supplementary Figure 7A-C). We assessed protein-mRNA concordance of 28 RPPA consensus PMC markers that were detected by the orthogonal MS platform ( Figure 5B). Among them, 15 mRNA-protein pairs had highly correlated mRNA and protein levels as quantified by MS (R>0.6); another 8 pairs were moderately correlated (R>0.4); the other 5 pairs had low but statistically significant correlation (p<0.05).
We then identified MS-based PMC markers in CPTAC BRCA data following the same computational pipeline as RPPA analysis (Figure 4). 1068 genes were identified as having highly correlated protein and mRNA levels (R>0.6, Supplementary Table 6). Functional annotation clustering analysis showed no bias towards specific signaling pathways. The top clusters included oxidation-reduction process, mitochondrion, nucleotide binding, and cell-cell adhesion (Supplementary Table 7). PMC scores were computed for each sample and showed no significant difference across intrinsic subtypes or clinical/pathological subpopulations (Supplementary Table 8).
For most MS datasets, the number of protein-mRNA pairs assessed exceeds the sample size by one or two magnitudes. Therefore, instead of sample-based permutation, we employed supervised and unsupervised learning to estimate the cutoff for PMC-pass samples. In the CPTAC BRCA set, we adopted a supervised approach and used two types of low quality measurements as benchmarks. First, 28 samples had been identified by the CPTAC group to have skewed protein distribution which was potentially attributed to protein degradation [9]. Second, unsupervised clustering of the remaining tumors and three normal breast samples identified 10 tumor samples that clustered with the normal samples (Supplementary Figure 8). Given that none of the 10 tumors were characterized as normal-like by intrinsic subtype profiling by RNA-Seq, co-clustering with normal tissues was considered a potential indicator of low tumor content due to different parts of the tumor being analyzed for protein and RNA.
Thus, in total, we defined 38 likely low-quality samples by these two criteria. PMC scores were determined using the 1068 MS-based markers. ROC analysis demonstrated that samples with low PMC scores were significantly enriched with the two classes of low-quality samples (Area Under Curve = 0.823) ( Figure 6A-B). The cutoff threshold of PMC score was determined to allow 80% sensitivity to identify high-quality samples, and identified 42/108 samples that fail the test, including 19 samples with skewed protein distribution and 9 samples that co-clustered with normal breast tissues. We then assessed within-gene and within-sample cis protein-mRNA correlations in PMC-pass and PMC-fail samples. PMC markers were removed from the total gene set generating a test set of 7620 protein-mRNA pairs. Within-gene and within-sample correlation were both significantly higher in PMC-pass samples (within-gene: p<1x10 -16 , within-sample: p=4.08x10 -7 ), while the two types of trans correlations were not different between pass and fail samples ( Figure 6C-F). Of the 108 tumor samples assessed, 63 samples were PMC-pass for both RPPA and MS data (dual PMC-pass). For the set of proteins quantified by both RPPA and MS that were not defined as either RPPA or MS PMC markers, inter-platform concordance of the proteins was significantly higher in the dual PMC-pass samples (median Pearson's R=0.291) than in samples with at least one PMC-fail measurement (median Pearson's R=0.166, p=0.02, Figure 6G). Similar to RPPA data, the PMC-pass set trended towards enrichment of samples with higher tumor purity ( Figure 6H, p=0.066).
We further tested whether RPPA-based consensus PMC markers could be employed to credential MS samples. Based on the 28/30 RPPA-based consensus markers present in CPTAC BRCA data, a RPPA-based PMC score was computed for each sample. The RPPA-based score provided moderate correlation with MS-based score (R=0.497, Figure 7A) with a reasonable ability to identify low quality samples determined by MS-based markers (AUC=0.772, Figure 7B-C). Moreover, ROC analysis demonstrated that the RPPA-based score was able to detect samples with skewed protein distribution and normal-like samples (AUC=0.688, Figure 7D). Similar to MS-based markers, within-gene and within-sample cis correlations, but not trans correlations, were remarkably higher in RPPA-based PMC-pass samples and proteins quantified by MS and RPPA trended towards higher concordance between the two platforms in dual PMC-pass samples. (Figure 7E-I). Analysis of purity estimates showed no significant difference in PMC pass and fail samples ( Figure 7J). Importantly, the RPPA and MS analysis was performed on different portions of the tumor potentially contributing to the differences in scores.
We further tested the PMC score analysis in an independent MS data set from the CPTAC OV study.  Figure 9F). Additionally, a higher purity score was associated with the PMCpass set (p=0.0176, Supplementary Figure 9G). Therefore, in two independent MS data sets, samples with better correlated global protein-mRNA levels, higher purity and more concordant inter-platform proteomic quantification were identified by the PMC analysis.
Taken together, the PMC score appears to be a useful and robust metric to identify samples that are not well suited to proteogenomic analysis using either RPPA or MS data sets. The described PMC analysis approach could be applied to protein/mRNA data sets that have been generated, or even could serve as a low-cost solution to prequalify samples to be assayed by high-throughput technologies in proteogenomic studies (Figure 4).

Discussion
Recent large-scale multi-omics initiatives have characterized thousands of tumors expanding our understanding of information flow from DNA aberrations to RNA to protein. While emerging evidence is consistent with a complex linkage between mRNA and protein levels, there is a pressing need to minimize the effects of technical confounding factors and to evaluate sample quality for integrated proteogenomic studies. Further an ability to do this prior to performing sample preparation for RNA-Seq or MS would greatly decrease the costs and increase the information content obtained from such studies.
To address this issue, we analyzed RNA-seq and RPPA data across 31 tumor types from TCGA and Intriguingly, different factors, likely independent of preanalytic characteristics, intrinsic to the analytic platform used likely contribute to protein-mRNA correlation in RPPA and MS sample sets. In RPPA data, dynamic range is significantly associated with protein-mRNA correlations, highlighting dynamic range as a critical feature to test for utility of antibodies in RPPA. While for MS, assay sensitivity is a more important determinant, as low mRNA or protein amounts are associated with remarkably worse correlations. Indeed, previous MS and RPPA comparisons have shown that the antibody based RPPA platform has a greater sensitivity for at least a subset of proteins [32].
We reasoned that variation in correlations in a single sample for mRNA-protein pairs that were highly correlated across a sample set could be used to "qualify" individual samples as high quality including having similar tumor and stroma composition for proteogenomic analysis. Indeed, we were able to identify a set of proteins and RNAs that were highly correlated suggesting that protein amount is driven primarily by translation of existing RNAs. These protein-mRNA pairs were enriched for molecules that participate in normal cellular processes, a concept that the Human Proteome Map Project has also observed in normal human tissues [33,34]. If this concept holds across multiple data sets, it would serve to identify samples that should be discarded from the analysis improving reliability of the data. We analyzed global protein-mRNA correlation as a metric for quality of paired samples because of its representative of two classes of preanalytic challenges: protein or RNA quality of individual samples and structural heterogeneity of matched sample pairs. Additional measurements such as total protein content and DNA-based tumor purity scores can also provide indirect information of sample quality. As expected, when we applied this approach to samples with matched RNA-seq and RPPA or MS data, samples with high PMC scores were associated with other correlates of high quality.
A lack of concordance in protein abundance measured by MS and RPPA has been observed in several proteomic studies [1,7], which may be attributed to protein degradation, biological heterogeneity in the samples assayed by the two approaches or systematic difference between the two technologies. Our data revealed that it could be markedly improved by assessing samples credentialed by PMC analysis, which suggests that some of the discrepancies between the platforms are due to differences in sample quality or in composition of the samples assessed.
Protein-mRNA correlations represent an aggregation of distinct sources of variability. In addition to dependence of protein and mRNA levels on preanalytic challenges, samples from different tissues or experimental batches and/or genes with markedly different dynamic range may exhibit opposite trends, thus leading to incorrect conclusions -an effect similar to Simpson's paradox [23,35]. In analysis of TCGA RPPA data set, a pan-cancer cohort that represents substantial diversity in tissue types and patient resources, we accounted for these potential confounding factors using computational approaches. To assess the effects of tissue types, we adopted a leave-one-out cross-validation approach and demonstrated the efficacy of PMC score both in the pan-cancer data set and within individual disease types. To address the challenges of batch effect, RBN-normalization was applied to all samples to ensure that dynamic range and median levels were comparable across batches. Dynamic range varies remarkably across different protein and mRNA amounts, which potentially would generate high acrossgene correlation even when within-gene correlation is low [35]. Therefore, we assessed both withingene and within-sample correlation in the test protein-mRNA set, and designed control approaches with trans protein-mRNA correlations. The data showed that when accounting for these factors, for most tumor types and samples, the PMC-pass set has superior performance in predicting high-quality samples.
In theory, PMC analysis is a simultaneous evaluation for sample quality of RNA and protein data as well as sample heterogeneity. Aberrations in any one of the components (i.e. quality of RNA or protein or sample heterogeneity) would result in a low PMC score and the approach is unable to distinguish the different causes of low PMC scores. RNA-seq data was used as the gold standard in our analysis for three reasons. First, the robustness of RNA-seq has been extensively validated. Second, specific to TCGA data set, a large proportion of mRNA samples were quantified by both microarray and RNA-seq and yielded highly concordant results [36]. Third, in TCGA13 dataset, >99.8% of samples have RNA integrity score (RIN) >7, and 50.9% of samples have RIN >8.5, suggesting that there was limited degradation of RNA in the TCGA sample set.
As described, PMC was validated for use after RNA and protein analysis has been performed. However, given the costs of each approach, it would be optimal to be able to prequalify samples. This could be done by analyzing a relatively small set of RNA and proteins (the consensus PMC marker is only 30 RNA protein pairs) using approaches such as Nanostring with barcoded antibodies[37] that allows concurrent analysis of RNA and protein or Nanostring combined with a Luminex or Multiple Reaction Monitoring (MRM) or similar platform (Figure 4). This could help to identify samples that are likely to provide high quality proteogenomics data. A more formal PMC analysis could be done following the generation of RNA-Seq and MS or RPPA data to eliminate any samples that failed later in the analysis pipeline.
The PMC score has some limitations. First, its performance varies in different tissues. In several tumor types, the overall protein-mRNA correlation is low potentially due to antibody set surveyed on the RPPA platform ( Figure 2B), therefore, they were not included in our analysis. Second, PMC analysis based on RPPA consensus markers in RPPA data sets requires detection of 30 RNA and proteins for an efficient analysis. Third, a low PMC score can occur due to tissue heterogeneity or poor-quality protein or RNA analysis. PMC cannot distinguish between these mechanisms. Fourth, a surrogate quality control approach of analyzing the 30 PMC consensus markers on Nanostring or a similar platform that can assess both RNA and protein has not yet been developed. Fifth, setting cut-off values for high quality remains a challenge as it does with any continuous variable. When applying PMC analysis in MS data, the cutoff threshold was determined based on the assumption that the majority of samples are of high quality and provide high protein-mRNA correlation, so that the PMC score of matched sample pairs is significantly better than that of randomized pairs. Unfortunately the CPTAC BRCA MS data set had a limited sample size and many samples had evidence for preanalytic challenges including "skewed" quantification and likely normal contamination in the protein data set [9]. Nevertheless, it appears that the PMC score approach has the potential to greatly enrich the quality of proteogenomic analyses.
In conclusion, this study shed light on an untapped issue in quality assessment of proteogenomic studies through large-scale multi-platform transcriptomic and proteomic analysis, and provides a novel objective metric, PMC score, to simultaneously qualify paired RNA and protein samples for proteogenomic studies.