An Analysis of the Sensitivity of Proteogenomic Mapping of Somatic Mutations and Novel Splicing Events in Cancer*

Improvements in mass spectrometry (MS)-based peptide sequencing provide a new opportunity to determine whether polymorphisms, mutations, and splice variants identified in cancer cells are translated. Herein, we apply a proteogenomic data integration tool (QUILTS) to illustrate protein variant discovery using whole genome, whole transcriptome, and global proteome datasets generated from a pair of luminal and basal-like breast-cancer-patient-derived xenografts (PDX). The sensitivity of proteogenomic analysis for singe nucleotide variant (SNV) expression and novel splice junction (NSJ) detection was probed using multiple MS/MS sample process replicates defined here as an independent tandem MS experiment using identical sample material. Despite analysis of over 30 sample process replicates, only about 10% of SNVs (somatic and germline) detected by both DNA and RNA sequencing were observed as peptides. An even smaller proportion of peptides corresponding to NSJ observed by RNA sequencing were detected (<0.1%). Peptides mapping to DNA-detected SNVs without a detectable mRNA transcript were also observed, suggesting that transcriptome coverage was incomplete (∼80%). In contrast to germline variants, somatic variants were less likely to be detected at the peptide level in the basal-like tumor than in the luminal tumor, raising the possibility of differential translation or protein degradation effects. In conclusion, this large-scale proteogenomic integration allowed us to determine the degree to which mutations are translated and identify gaps in sequence coverage, thereby benchmarking current technology and progress toward whole cancer proteome and transcriptome analysis.


Improvements in mass spectrometry (MS)-based peptide
sequencing provide a new opportunity to determine whether polymorphisms, mutations, and splice variants identified in cancer cells are translated. Herein, we apply a proteogenomic data integration tool (QUILTS) to illustrate protein variant discovery using whole genome, whole transcriptome, and global proteome datasets generated from a pair of luminal and basal-like breast-cancer-patient-derived xenografts (PDX). The sensitivity of proteogenomic analysis for singe nucleotide variant (SNV) expression and novel splice junction (NSJ) detection was probed using multiple MS/MS sample process replicates defined here as an independent tandem MS experiment using identical sample material. Despite analysis of over 30 sample process replicates, only about 10% of SNVs (somatic and germline) detected by both DNA and RNA sequencing were observed as peptides. An even smaller proportion of peptides corresponding to NSJ observed by RNA sequencing were detected (<0.1%). Peptides mapping to DNA-detected SNVs without a detectable mRNA transcript were also observed, suggesting that transcriptome coverage was incomplete (ϳ80%). In contrast to germline variants, somatic variants were less likely to be detected at the peptide level in the basal-like tumor than in the luminal tumor, raising the possibility of differential translation or protein degradation effects. In conclusion, this large-scale proteogenomic integration allowed us to determine the degree to which mutations are translated and identify gaps in sequence coverage, thereby benchmarking current technology and progress toward whole cancer proteome and transcriptome analysis. Massively parallel sequencing (MPS) 1 of cancer genomes has demonstrated enormous complexity, and it is often unclear which somatic mutations drive tumor biology and which are nonfunctional passenger mutations that passively accumulate. RNA sequencing is frequently used to determine which nucleotide variants are transcribed and therefore have the potential for biological function. However, many mutations detected at the DNA level are not observed at the mRNA level, and their observation is dependent upon expression of the stability of the mRNA (1). Mutation detection at the peptide level clearly increases the confidence that any given variant is a potential biological driver, and by assessing peptide levels across all of an individual's polymorphisms, an independent assessment of transcriptome coverage can be obtained.
Integrated proteogenomic methods that combine MPS analysis and proteomics are of particular importance for identifying novel peptides resulting from somatic mutations or inherited polymorphisms. The identification of peptide sequences by mass spectrometry (MS) relies heavily on the quality of the protein sequence database. Use of databases with missing peptide sequences will fail to identify the corresponding peptides within the proteomic data; however, addressing this by including large numbers of irrelevant sequences in the search will decrease sensitivity. Therefore, it is essential that data acquired through MPS are used to create tumor-specific databases, incorporating the possibility of variant proteins arising through somatic mutation, inherited polymorphisms, alternatively spliced isoforms, and novel expression. The goal of this study was to analyze the flow of information though the central dogma of biology in an unbiased and comprehensive way to profoundly understand the aberrant information flux that underlies all cancer biology (2)(3)(4)(5).
Integrated proteogenomic methodologies have been successfully implemented in several model systems, including mouse and human cell lines (2, 4 -7), and in other model systems for gene annotation (3, 8 -10). Sample-specific DNA databases in MS identification pipelines are a developing tool, but we lack a clear understanding of the proteomic depth and sensitivity required to obtain a truly comprehensive mutant or germline variant peptide identification. To address these issues, we used two patient-derived xenograft (PDX) breast cancer models that provided enough material for extensive genomic, transcriptomic, and proteomic analyses to determine the capabilities and limitations of MS/MS for novel protein isoforms identification in cancer cells.

EXPERIMENTAL PROCEDURES
Patient-Derived Xenograft (PDX) Tumors-Patient-derived xenograft (PDX) tumors from established Basal (WHIM2) and Luminal-B (WHIM16) breast cancer intrinsic subtypes (11) (12) were obtained from the Washington University Human and Mouse-Linked Evaluation of Tumors Core. The tumors were raised subcutaneously in 8-weekold NOD.Cg-Prkdc scid Il2rg tm1Wjl /SzJ mice (Jackson Labs, Bar Harbor, ME) as previously described (13) (14) and in compliance with NIH regulations and institutional guidelines and approved by the institutional review board at Washington University. These tumors have significantly different gene expression and proteomic signatures (14) that are related to their intrinsic biology and endocrine signaling. Tumors were harvested and processed to a frozen powder with minimal ischemia time as previously described (15,16). Tumors from each animal were harvested by surgical excision at ϳ1.5 cm 3 with minimal ischemia time by immediate immersion in a liquid nitrogen bath. The tumor tissues were then placed in precooled tubes on dry ice and stored at Ϫ80°C. A tissue "pool" of cryopulverized tumors was prepared in order to generate sufficient material that could be reliably shared and analyzed between multiple laboratories. Briefly, tumor pieces were transferred into precooled Covaris Tissue-Tube 1 Extra (TT01xt) bags (Covaris #520007) and processed in a Covaris CP02 Cryoprep device using different impact settings according to the total tumor tissue weight: Ͻ250 mg ϭ 3; 250 -350 mg ϭ 4; 350 -440 mg ϭ 5; 440 -550 mg ϭ 6. Tissue powder was transferred to an aluminum weighing dish (VWR #1131-436) on dry ice, and the tissue was thoroughly mixed with a metal spatula precooled in liquid nitrogen. The tissue powder was then partitioned (ϳ 100 mg aliquots) into precooled cryovials (Corning #430487). (Note: Cryopulverized tissue will melt if transferred to a plastic weighing boat.) All procedures were carried out on dry ice to maintain tissue in a powdered, frozen state.
Whole Genome Analysis and RNA Sequencing-Methodologies and data repositories have been published (14). Variant calling using the Whole Genome Sequencing (WGS) data were performed by an in-house developed pipeline combining several variant calling algorithms (VarScan (17), Genome Analysis Toolkit (GATK) (18), and Pindel (19)) to achieve better accuracy and sensitivity. For the RNA-Seq analysis, raw FASTQ files were trimmed by 1bp from both ends. All trimmed RNA-Seq reads were aligned to the human reference genome version hg19 using software TopHat version 2.0.3 with -g 1, -bowtie1 (version 0.12.7.0), -M, -x 1, -n 2, -no-coverage-search, and -fusion-search settings to generate Binary Sequence Alignment/Map (BAM) files, junction files, and fusion files. Then software TopHat-Fusion (20) version 2.0.3 and ChimeraScan fusion detection (17,18) (default parameters) were used to generate final potential fusion genes. BedTools and in-house developed software were then used to generate the base pair counts for each exon based on Ensembl annotation file version GRCh37.68 using mapped reads in BAM format.
Global Proteomics-Tumor pieces were cryopulverized in precooled Covaris Tissue-Tube 1 Extra (TT01xt) bags and processed in a Covaris CP02 Cryoprep device, with impact setting derived from the weight of the tumor (Ͻ250 mg ϭ 3; 250 -350 mg ϭ 4; 350 -440 mg ϭ 5; 440 -550 mg ϭ 6). The resulting powder was then transferred to a weighing dish, mixed using a precooled metal spatula, and partitioned into ϳ100 mg aliquots. All procedures were completed on dry ice to maintain sample freezing. Proteins were reduced, alkylated, and subjected to trypsin digestion. Peptides were separated using an off-line high pH (7.5 or 10) reversed-phase column and analyzed by Thermo Fisher Q-Exactive and Orbitrap Velos instruments. The iTRAQ analysis was completed by three separate sites, and the label-free analysis was performed at two separate sites. The data are freely available through the Clinical Proteomic Tumor Analysis Consortium (CPTAC) Data Portal (https://cptacdcc.georgetown.edu/cptac/study/ showDetails?accNumϭS010).
Tumor-Specific Database Construction-QUILTS is a publicly available software (quilts.fenyolab.org) that can take in up to four inputs for sample-specific database creation: (i) a Browser Extensible Data (BED) file containing RNA-Seq predicted junctions; Variant Call Format (VCF) files containing (ii) somatic variants and (iii) germline variants; and (iv) a fusion file containing all predicted fusion genes. The output of QUILTS is a protein sequence FASTA file, using either Ensembl or RefSeq as a reference for the hg19 proteome and genome. Example VCF, BED (http://genome.ucsc.edu/FAQ/FAQformat.html), and fusion input files are available alongside the QUILTS webserver for mock database creation and file formatting.
Step 1. Creation of Variant Peptide Database-QUILTS parses out details of variant location and nucleotide change from the variant VCF file for subsequent incorporation into genomic sequences. In cases where both somatic and germline variants are present, tumor-specific variants are obtained by filtering out all germline variants from the somatic variant calls. Based on intron/exon boundaries from Ensembl (version 70) or RefSeq (version 20130727), sequences of annotated coding regions are extracted, and variant changes were incorporated into these regions based on genomic location. The modified sequences are then translated to proteins in a single reading frame and stored as a FASTA file (Fig. 1A). Stop codon removal and insertion due to single amino acid changes were accounted for and highlighted within file output. In this study, even low-quality variants were included in the variant database to allow for validation by mass spectrometric analysis.
Step 2. Creation of Junction Protein Database-The RNA-Seqderived junction BED file contains predicted boundaries of adjoining exons (chromosome, intron start, intron end) for each sample. QUILTS filters out splice junctions matching previously annotated boundaries (Ensembl or RefSeq), leaving only novel junction which are split into three categories, unannotated alterative splicing (two known exons, Fig. 1B), completely novel junctions (no known exons, Figs. 1F and 1G), and partially novel junctions (one known exon, Figs. 1C-1E). The required frame translation for in silico protein synthesis is indication in Fig. 2 for each scenario. Variant peptides are incorporated into the sequences of the alternatively spliced proteins as described in Step 1 (Fig. 1A). In this study, junctions with at least one supporting RNA read were included in database creation.
Step 3. Creation of Fusion Protein Database-QUILTS translates predicted fusion gene output by a six-frame translation (Fig. 1H). Protein coding regions with more than six consecutive amino acids are included in the fusion protein database.
Experimental Design and Statistical Rational-Proteowizard (release 2.2.3246 (2012-1-30)) was used to convert raw MS data files to mzML format for downstream data analysis. Peptide spectral matching was then done using the tumor-specific QUILTS databases with the database search engine X! Tandem CYCLONE (2013.02.01.1) using precursor tolerance of 20 ppm and fragment mass tolerances of 20 ppm for high-resolution data and 0.4 Da for low-resolution data, complete modifications of carbamidomethylated cysteines and iTRAQ 4-plex on peptide N termini and lysines, potential modifications of oxidation of methionine, and deamidation of aspartic and glutamic acid, allowing for one missed cleavage. The databases contained a total of 1,078,922 and 1,225,632 trypic peptides for basal and luminal, respectively. Both databases included a list of external contaminants from common Repository of Adventitious Proteins (cRAP) (maintained by the global proteome machine organization) in addition to Ensembl sequences for Mus musculus. The search was done against a concatenated database containing mouse and human Ensembl 37.70, cRAP, basal and luminal variant peptides, and NSJ peptides. Searches were completed independently for each center (A-E, Table I) using a simple False Discovery Rate (FDR) with separate target-decoy searches that were created through reversal of all protein sequences and searched in combination (21). The FDR was calculated as described by Kä ll, et al. (22) and was controlled for at a q-value of 0.01 (1%) at the peptide level. The FDR was calculated separately for variant, junction, and references peptide (23). Information detailing identified variant and junction peptides are included  Table 4). The dataset was also searched against the same databases, allowing up to one amino acid substitution per tryptic peptide, and any spectrum that in this search matched better against a variant peptide not detected in the WGS data was removed from the subsequent analysis. Also, peptides with length shorter than eight amino acids were excluded. The results were further filtered on the percentage of the intensity in the fragment mass spectra that were explained by fragmentation of the assigned peptide sequence and only peptides with PSM where larger than 50% of the fragment ion intensity matched to the sequence (all peaks with intensity larger than 1% of the largest fragment peak) were included. Peptides were also required not to have continuous gaps of larger than two amino acids, i.e. peptides with three amino acids in a row without evidence of fragmentation between them were included, but peptides with four or more amino without fragment ions between them were excluded. In addition, if the total number of gaps throughout a peptide exceeded six, the peptide was excluded. The rationale for the filters based on gap size was the observation that, commonly when extensive fragmentation is observed for one end of a peptide but not for the other end, the identification is incorrect. In these cases, manual inspection often reveals for high-quality spectra that the search engine has assigned the incorrect sequence, and the correct sequence is missing from the database that was searched.
To assess the quality of the variant peptide identification, the distributions of the fraction of MS/MS intensity explained by the identified peptide, the peptide mass, the e-values, and total fragment ion intensity were compared with the distributions for all identified peptides in the reference protein sequence database (Supplemental Figs. 1A-1D, 1G-1J, and 1M-1P). Also, the distribution of the variant (Supplemental Figs. 1E, 1K, and 1Q) and reference (Supplemental Fig.  1F, 1L, and 1R) peptides in the four-dimensional space spanning MS/MS intensity explained by the identified peptide, peptide mass, e-value, and total MS/MS intensity were compared. Because no significant difference was observed, we conclude that the FDR for the variant peptides can be accurately estimated using the overall peptide FDR.
Also, for the NSJ peptides the quality was assessed by comparing the distributions of the fraction of MS/MS intensity explained by the identified peptide, the peptide mass, the e-values, and total fragment ion intensity to the distributions for all identified peptides in the reference protein sequence database (Supplemental Figs. 2A-2D, 2F-2I, 2K-2N). Also, the distribution of the NSJ (Supplemental Figs. 2E, 2J, and 2O) and reference (Supplemental Figs. 1F, 1L, and 1R) peptides in the four-dimensional space spanning MS/MS intensity explained by the identified peptide, peptide mass, e-value, and total MS/MS intensity were compared. Again no significant difference was observed between the NSJ peptides and the reference peptides, indicating that the overall quality of NSJ peptides is close to that of the reference peptides.
In addition, no significant difference was observed for germline and somatic variant peptides (Supplemental Figs. 3A-3R) or for variant peptides with or without mRNA evidence (Supplemental Figs. 4A-4R).
All the individual peptide spectrum matches are shown in Supplemental Fig. 5. The spectra are annotated with the assigned b-and y-ions and neutral losses. The evidence for fragmentation between pairs of amino acids is also shown by annotating the peptide sequence with bars with lengths proportional to the corresponding fragment ion intensity.
For variant and novel peptide identification, a total of 18 iTRAQ and 30 label-free sample process replicates (independent tandem MS experiments using identical sample material) were used to identify sample-specific peptides. Peptides with at least one peptide spectral match were considered as a positive identification. This is a reasonable assumption because of the stringent filter that includes the requirement of no gaps in the peptide sequence coverage larger than two consecutive amino acid pairs and a maximum of six gaps over the entire peptide. Following identification, peptides matching the reference human, mouse or cRAP proteomes were removed, leaving those peptides that map only to predicted variant or novel junction peptides. All variant and NSJ peptides were compared with the RefSeq Human ϩ Mouse protein sequence database (downloaded December 1, 2011) to filter out any known peptides that were missed by the original filtering step. Additionally, all novel peptides were compared against the GenBank nonredundant translation database and the neXtProt protein database (24), and all matches were included in the associated tables (Supplemental Tables 1 and 2). Peptides with nonredundant matches were included in all analysis except in the creation of highly confident variant and novel junction peptide tables, for which they were removed. For each of the 48 process replicates, the number of unique variant or novel peptides identified was used for titration analysis. Robust regression was used to determine the correlation between total peptides identified and novel/variant peptides in each run. Variants were searched against the dbsnp database (build 142) and the Catalogue of Somatic Mutations in Cancer (COS-MIC) to determine if the genomic variant had been previously identified (Supplemental Table 1C).
As per minimum guidelines for proteogenomic studies, suggested by Nesvizhskii (23), the custom protein sequence databases described here are available at http://openslice.fenyolab.org/data/ compref/compref.zipand annotated spectrum for all novel splice junction, and single nucleotide variant peptides are provided (Supplemental Fig. 5).

RESULTS
PDX tumor lines were generated from primary breast tumors as previously described (25). Two PDX lines (one derived from a luminal tumor, WHIM16, and the other from a basal-like tumor, WHIM2) were included in this study, of which the whole genome sequences (WGS) and RNA-Seq analysis have already been published (13,14). In the WGS analysis, 40x-50x coverage of 75 bp paired end reads (1.5 ϫ 10 9 -2 ϫ 10 9 ) was obtained for germline, tumor, and xenograft. RNA-Seq data of 4 ϫ 10 8 , 75 bp paired end reads were obtained for each xenograft. Global proteome expression was acquired by 18 independent iTRAQ-labeled MS/MS sample process replicates (each analysis contained the two luminal and two basal samples, both of which were represented by two different reporter ions in the iTRAQ 4-plex) and 15 independent labelfree MS/MS sample replicate pairs, containing each PDX sample. These experiments were completed across five centers, three using iTRAQ chemical labeling and two using labelfree methods (Table I, Supplemental Fig. 6). The number of total peptides, variant peptides, and novel junction peptides identified by each center varied based on MS instrumentation, fraction number, and gradient length (Table I). A total of 184,182 unique human peptides and 9,597 proteins were identified across all datasets, representing the state-of-the-art proteomic coverage available with current techniques. Paired tumor-specific protein databases and corresponding deep proteomics tandem MS data were used to first identify novel protein isoforms involved in breast cancer progression and second to determine the depth of proteomic analysis required to capture comprehensive variant peptide identification.
Variant identification through whole genome sequencing permits the representation of nonsynonymous substitutions in our protein sequence database, increasing the number of tumor-specific peptides that can be identified by tandem MS (2,5,9). Most MS identification methods rely on reference databases, such as Ensembl, UniProt, or RefSeq, for peptide matching. Unfortunately, these databases do not allow novel peptide identification due to variations in the somatic or germline genome. In order to identify tumor-specific variant peptides, we used the proteogenomic integration tool Quantitative Integrated Library of Translated SNPs/Splicing (QUILTS), developed specifically for cancer proteome analysis (http:// quilts.fenyolab.org). QUILTS uses protein coding variant calls (germline and somatic) and RNA-Seq-based junction predictions to build a customized, tumor-specific database containing peptide sequences that contain single nucleotide variants and bridging sequences from alternative splicing against the background of a reference human proteome database.
Variant Peptide Identification-Following its construction, each sample-specific database was used to identify tumor and sample-specific peptides attributable to both germline and somatic genomic variants. A total of 31,792 basal (26,421 germline and 5,371 somatic) and 28,635 luminal (16,662 germline and 11,973 somatic) variant peptides were predicted by WGS. Of these, we expect only a portion to be both verifiable by MS and also expressed at the mRNA level. The portion of variants identifiable by MS can be predicted based on peptide characteristics known to provide efficient MS identification, including low hydrophobicity and peptide length within 6 -30 amino acids. Further, peptides that map to more than one proteomic location are difficult to attribute to a specific gene (26). Therefore, we considered only proteotypic tumor-specific changes with appropriate amino acid lengths as verifiable variant peptides. We found that ϳ3.5% of theoretical variant peptides mapped to more than one gene for basal and luminal tumors, which were then removed from subsequent analysis. An additional 32% of all possible variant peptides were found to be outside of the peptide length limitations for both tumor types. The proportion of variants that are expressed at the mRNA level, however, cannot be easily predicted and relies on variant calling from whole transcriptome sequencing. Of the approximate 65% of peptides meeting the requirements for MS/MS, a minority had evidence for expression at the mRNA level (30.5% germline basal; 30.7% germline luminal; 10% somatic basal; 19% somatic luminal) ( Figs. 2A and 2B).
A total of 772 unique variant peptides were identified across all 48 sample process replicates, 119 of which were found by both labeled and label-free analysis, representing a total of 667 genomic SNVs (Supplemental Table 1). Peptide spectral matching of iTRAQ and label-free global proteomics data using tumor-specific databases identified a total of 610 basal and 496 luminal variant peptides (including both somatic and germline variants). Each variant peptide contains at least one amino acid change resulting from a genomic SNV, with eight of the 772 containing two amino acid substitutions, two peptides with three substitutions, and one peptide with five substitutions (Supplemental Table 1). An overwhelming majority of variants identified were represented in dbsnp (96.8%) and neXtProt (76.8%) databases, leaving only 1.8% of identified protein variants being described by neither database. Interestingly, ϳ20% of the variants identified by proteomics analysis lacked mRNA evidence based on RNA-Seq variant calling (Fig. 2B). This can be explained, at least partially, by the limitations associated with variant calling from RNA-Seq data due to the inherent complexity of the transcriptome (27).
Of the 610 variant peptides in the basal tumor, 605 were due to germline variants and only five were due to somatic variants (Fig. 2C). Two of the five basal somatic variants were identified by RNA-seq, one in the transcription factor GTF3C3 (ALGYMEGAAESYGK) and one in the nucleoprotein AHNAK2 (SFGVLAPGK) (Supplemental Table 1A). In the luminal tumor, 140 somatic and 356 germline variant peptides were identified, and more than 70% of the luminal somatic variants were found at the mRNA level. The difference in somatic variant expression between the basal and the luminal is considerable, implicating differential translation or increased protein degradation effects in the basal tumor. While ϳ1/3 of the variant peptides were only identified by only one spectral match across all MS/MS runs, 20% had ten or more associated   Table 1). Additionally, comparison of the proteomic variants with predicted SNVs in 105 of the breast The Cancer Genome Atlas (TCGA) tumors chosen for analysis by CPTAC found that 89% of the same genomic variants were present at the DNA level in at least one of the human breast tumors (Supplemental Table 1).
We identified a subset of highly confident somatic variant peptides that had more than ten peptide spectral matches identified in at least one process replicate, associated RNA expression, and evidence for expression in the associated protein (100ϩ PSM along the protein) (Table II). Fifteen variants were identified as tumor specific (somatic only) with high confidence, including variants in Keratin19 (ENSP00000408759: A60G), the mannose-6 phosphate receptor-binding protein Perilipin-3 (ENSP00000465596: V275A), and the interferon regulatory factor binding protein IRF2BP2 (ENSP00000355569: V275A) (Table II).
In order to determine the number of sample process replicates that are required to identify the maximum number of translated variants, we completed a titration analysis using 18 global iTRAQ and 28 label-free low-resolution MS/MS experiments. We demonstrated that additional MS analysis would likely result in further variant peptide identification (Figs. 3A and 3B). The number of total peptides identified in these replicates varied between 19,720 to 125,912 for iTRAQ labeled and 10,506 to 33,091 for labelfree MS/MS, and the identification of variant peptides correlated to the total number of total peptides identified in each replicate (iTRAQ: r 2 ϭ 0.665; label free: r 2 ϭ 0.745). Despite this correlation, even process replicates with the highest peptide identifications identified less than 40% of the total unique variant peptides observed. (Supplemental Figs. 7A and 7B).
Novel Junction Peptide Identification-In addition to amino acid changes resulting from SNVs, alternative splicing and novel expression have been shown to effect tumor growth and disease progression (28). Intron/exon boundaries determined by RNA-Seq analysis were used to identify unannotated alternative splicing events (matching to two known exons Fig. 1B), partially novel splicing (matching to only one known exon Figs. 1C-1E), and completely novel splicing (matching to no known exons Figs. 1F and 1G) occurring in basal and luminal PDX tumors. The number of MS-verifiable peptides, being both the appropriate length and gene specific, was identified to determine the full proteomic potential due to novel splice junction (NSJ) peptides. Two percent of unannotated alternative splicing peptides, 1.3% of partially novel splicing peptides, and less than 0.01% of completely novel peptides were removed from the analysis due to their mapping to other known genes. An additional ϳ24% of novel junction peptides were found to be outside of the MS peptide length limitations (Fig. 4A). Based on this analysis, we predicted a total of 22,187 and 20,442 unannotated alternative splicing peptides, 217,715 and 180,100 partially novel junction peptides, and 238,891 and 164,273 completely novel peptides to be verifiable by MS for luminal and basal, respectively (Fig. 4A). Using these custom databases for spectral matching, we were able to identify less than 0.05% of the novel junction peptides for both basal and luminal tumors (Fig. 4B).
We included all RNA-Seq junction predictions as input for protein database creation, without taking into account the number of reads supporting each junction. We used this liberal inclusion to create the most complete database with the available MPS data, containing all possible proteomic changes, with the idea that the proteomic data can be used to filter through transcriptional false positives. It is likely, however, that junctions confidently identified by proteomics anal- Variant peptides identified by MS which were found to be somatic, with 10 or more peptide spectral matches (PSMs), variant expression at the RNA level, and evidence for protein expression in the associated gene (100ϩ PSMs across the gene). Includes associated gene name, chromosome, genomic start and end, genomic variant location, Ensembl protein identifier, original and variant amino acids, and the method it was identified in (iTRAQ, Label Free Low Resolution (LF-LR) or Label Free High Resolution (LF-HR)). All highly confident variants were found to be somatic in the luminal tumor only. ysis will have more supporting reads. We therefore considered the number of junction reads supporting these peptides postprocessing and found that novel junctions with MS identifications had significantly more RNA sequencing reads (median reads basal ϭ 11, luminal ϭ 6) compared with all junctions in the protein database (median reads basal ϭ 2, p ϭ 1.8e-4; luminal ϭ 2, p ϭ 2.898e-5). Upstream filtering of junction data containing five or more supporting reads drastically reduced our predicted novel junction peptide number to 76,053 basal and 91,609 luminal peptides, and ϳ50% of novel junction peptides identified by MS/MS met this criteria (80 of the 129 basal and 68 of the 147 luminal peptides) (Fig.  4B). The ratio of peptides identified to peptides predicted was similar between each of the three junction classes (Supplemental Fig. 8).
A total of 86 unique novel splice junction (NSJ) peptides were identified in by least one sample process replicate. Of the 66 NSJ peptides identified in basal tumors, eight supported unannotated alternative splicing, 32 indicated splicing of a known exon with a novel coding region, and 26 demonstrated the splicing of two novel exons. Similarly, seven peptides supporting splicing of two known exons, 34 with one known exon, and 26 completely novel exon splicing events were identified in luminal tumors. Of these, ϳ50% of the NSJ peptides were identified in both basal and luminal tumors (Supplemental Table 2). Approximately half of novel junction peptides were supported by only one peptide spectral match by only one process replicate (Fig. 4C).
We identified 16 novel junction peptides we considered to be highly confident, having at least five peptide spectral FIG. 3. Titration analysis of SNV and novel junction peptide discovery. Titration to determine the number of tandem MS process replicates needed to identify all variant peptides for iTRAQ (A) and label free (B) and novel junction peptides for iTRAQ (C) and label free (D) MS/MS. iTRAQ center A corresponds to process replicates 2, 3, 8, 10, 11, 13, and 14; center B replicates 4, 6, 9, 15, 16 and 18; and center C replicates 1, 5, 7, 12, and 17. Only label-free analysis from center E was used in this analysis (B, D). Basal and luminal tumor peptide identifications were pooled for titration analysis. Total list of SNV and novel junction peptides are included in Supplemental Tables 1-2. matches, more than five supporting RNA read, and evidence for protein expression in the associated gene (Ͼ50 PSMs) (Table III). In a few cases, NSJ peptide appeared to be present in both the basal and the luminal tumor, i.e. iTRAQ reporter ions were observed for both samples and RNA-Seq evidence existed for both samples. In these cases, the sample is listed as both basal and luminal, but sometimes the identification from label-free analysis was able to determine the sample of origin. The peptide with the most evidence (highest PSM count) was found for a novel coding region near BHMT2 in luminal tumors (Supplemental Table 2).
We completed similar titration analysis shown for the variant peptides to find the number of unique novel peptides cumulatively identified with each process replicate. As seen with the variant data, spectra from all replicates were required to attain the complete list of junction peptides in both iTRAQ and label-free analysis (Figs. 3C and 3D). The novel peptide identification was similarly correlated with total peptides identified than that seen with the variant peptide identification, with an R-squared value of 0.561 for iTRAQ and 0.385 for label-free tandem MS. Our results establish that the extent of novel and variant peptide identification is highly dependent on the depth of proteomic analysis, and advancements in MS sensitivity will result in more comprehensive variant peptide identification. To better understand this continued rise in SNV and NSJ peptides with increasing process replicate runs, we plotted a titration for reference peptide identification for iTRAQ and label free MS/MS (Supplemental Fig. 9). This analysis found a similar increase in reference peptide identification with each successive iTRAQ process replicate, though the label-free curve leveled out after ϳ12 replicate MS/MS runs. This indicates that increasing the number of process repli- cates run, particularly using different platforms and methodologies, can identify a wider range of novel peptides.
Although splicing and variants were considered simultaneously in our tumor-specific databases, no peptides containing both a novel junction and SNV were identified from our analysis. We also used QUILTS to incorporate novel peptides from a six-frame translation of 38 gene fusion calls, 13 in basal and 25 in luminal. A total of 954 luminal fusion peptides and 483 basal fusion peptides were included in the protein sequence search database, but no positive identifications were found using our proteomics analysis. We expect this is, in part, a probabilistic issue, with relatively few potential fusion proteins represented in the protein sequence database. Although no fusion proteins were identified, of the 40 fusion gene junctions found within a known gene boundary, 26 had some evidence of protein expression (PSMϾ0 along the protein) (Supplemental Table 3). DISCUSSION The field of proteogenomics has seen considerable growth in the last five years, requiring a focus on the integration of genomics and proteomics through peptide mapping and the use of sequencing data to attain a comprehensive view of the proteome. Tools that utilize MPS data to better annotate and identify proteins are essential for broad protein identification. Ideally, integration of sequencing and proteomic analysis will vastly improve our understanding of complex biological systems, in that more comprehensive and diverse data from the same system can provide a clearer view of its biological landscape. Additionally, integrating multiple datasets can supplement the limitations associated with each method. For example, proteomics can be used to better annotate the protein coding regions of the genome (3, 8 -10); DNA sequencing can predict protein coding variants in MS/MS (4); and RNA-Seq transcriptomics can supplement MS proteomic coverage (5,6,29).
This study demonstrates the current advantages and limitations of proteogenomic integration for interpreting cancer biology, specifically in the detection of novel protein isoforms. We determined the capacity of current MS proteomic methods for variant and novel peptide identification in a sample that has undergone both extensive proteomic (48 global MS/MS sample process replicates) and DNA/RNA MPS analysis. Though the ability to detect more than 700 peptides that confirm the expression of single nucleotide variations and 86 novel splicing events is notable, no one MS replicate was able to recover more than 40% of the full list of observable peptides, regardless of the instrumentation or methodology used. Further, titration curves for both variant and junction peptide analysis suggest that increased experimental process replicates would continue to increase the number of captured novel peptides (Fig. 3).
Although the sensitivity of MS-based proteomics has improved dramatically, the limited dynamic range of the analysis is obviously still an issue. For example, variant peptides of established breast cancer genes such as PIK3CA or TP53 were not detected in this study. Since peptide size and physiochemical properties have large effects on ionization and fragmentation (30), i.e. because some peptides are well suited for MS analysis while others are not, some issues with low protein coverage are currently unavoidable using tandem MS approaches. In addition, the overall dynamic range of the analysis is several orders of magnitude lower than the range of protein concentrations, even when extensive separation of peptides is performed. This is due to a low MS dynamic range, being only four orders of magnitude for a given measurement, and a known bottleneck in instrument speed (31). Although greater fractionation may partially resolve the issue, this approach is limited by sample quantity. Advancements in MS protein identification sensitivity must be made, either through improvements in sample preparation, instrumentation sensi- Novel junction peptides identified by MS which were found to have more than 5 PSMs across iTRAQ and label free analysis, 5 or more RNA reads, and evidence for protein expression in the associated gene (50ϩ PSMs across the gene). Includes the number of known exons involved in splicing, peptide spectral matches, sample in which the peptide was identified, gene or the closest gene on the same strand, junction location, and the method it was identified in (iTRAQ, Label Free Low Resolution (LF-LR), or Label Free High Resolution (LF-HR)). tivity and speed, or identification algorithms, if we are to ultimately achieve complete proteomic landscape.
The low recovery between predicted novel junction peptides and those identified in the proteomics data is particularly striking (Fig. 4B), and similar levels of discovery have been reported in other mammalian systems (5). This supports the notion that there is a substantial level of noise in the transcriptome, i.e. many transcripts never result in stable proteins. Thus, proteomics becomes an essential technology in validating novel protein isoforms and splice sites. This result also points to the high quality of the human genomic annotation, with minimal novel exon discoveries despite the great depth in sequencing and proteomics. Further, the overlap in novel junction peptides identified in the luminal and basal tumors (Supplemental Table 2) indicates that these peptides may correlate to either "normal" exon structure or, instead, may be attributed to a more cancer-specific splicing activity.
This study also demonstrates the current advantages and limitations in DNA/RNA MPS analysis. A portion of the peptides spanning novel junctions were only supported by a one or a few RNA-Seq but had strong MS peptide evidence. Similarly, many variants had only low quality score from WGS or did not have supporting RNA-Seq evidence but had strong MS evidence. This makes proteomics the method of choice for determining which variants will be incorporated into stable proteins or cause protein misfolding and degradation and in the search for drivers of cancer.
The paucity of translated somatic variants in the basal-like sample in comparison to the luminal tumor should be further investigated in larger datasets. Translational controls and degradation mechanisms may in play to restrict the repertoire of mutant genes that are translated, and these maybe another source of inter-tumor heterogeneity. CONCLUSIONS Proteomics is essential in validating genomic changes, including SNVs and novel splicing to assess the degree to which these genomic alterations are translated and therefore biologically active. Although peptide coverage is restricted by current technologies, with only 10% coverage of variant peptide sequence even with multiple fractions and process replicates, our ability to validate genomic variation in cancer using proteomics is still substantial. This low variant peptide coverage may occur for several reasons. Tryptic digestion is clearly one limitation, and the use of additional enzymes and fractionation approaches are under investigation. The low rates of peptide detection may not just be a sensitivity issue but reflect biological effects. For example, the very low detection rates for novel splicing events may reflect the fact that many are not efficiently translated or quickly degraded. Similarly, lack of peptide expression for some SNV may be due to somatic-mutation-driven translational effects or protein instability/degradation. The application of proteogenomic integration methods to larger datasets and improvements in peptide identification and MS/MS sensitivity will help clarify these issues in the future.