A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics*

Shotgun proteomics data analysis usually relies on database search. However, commonly used protein sequence databases do not contain information on protein variants and thus prevent variant peptides and proteins from been identified. Including known coding variations into protein sequence databases could help alleviate this problem. Based on our recently published human Cancer Proteome Variation Database, we have created a protein sequence database that comprehensively annotates thousands of cancer-related coding variants collected in the Cancer Proteome Variation Database as well as noncancer-specific ones from the Single Nucleotide Polymorphism Database (dbSNP). Using this database, we then developed a data analysis workflow for variant peptide identification in shotgun proteomics. The high risk of false positive variant identifications was addressed by a modified false discovery rate estimation method. Analysis of colorectal cancer cell lines SW480, RKO, and HCT-116 revealed a total of 81 peptides that contain either noncancer-specific or cancer-related variations. Twenty-three out of 26 variants randomly selected from the 81 were confirmed by genomic sequencing. We further applied the workflow on data sets from three individual colorectal tumor specimens. A total of 204 distinct variant peptides were detected, and five carried known cancer-related mutations. Each individual showed a specific pattern of cancer-related mutations, suggesting potential use of this type of information for personalized medicine. Compatibility of the workflow has been tested with four popular database search engines including Sequest, Mascot, X!Tandem, and MyriMatch. In summary, we have developed a workflow that effectively uses existing genomic data to enable variant peptide detection in proteomics.

DNA sequence variation is associated with diseases and differential drug response. As a paradigmatic example, cancers are diseases of clonal proliferations caused by mutations in oncogenes and tumor suppressor genes (1). After several decades of searching through traditional biology approaches, many mutant genes have been causally implicated in oncogenesis (2). Facilitated by the new genomic techniques such as SNP (single nucleotide polymorphism) arrays and deepsequencing, the identification of cancer genes has made enormous progress over the past several years (3)(4)(5)(6)(7). The genomic abnormalities of cancer are expressed through aberrant proteins and proteomes and their altered functions. Although proteins reflecting the genomic changes in cancer have the potential to become clinically meaningful biomarkers, their discovery and validation has proven to be challenging. As a result, few biomarker candidates have translated into clinical use.
Over the past decade, mass spectrometry (MS)-based shotgun proteomics has emerged as a high-throughput, unbiased method for the identification of proteins in complex samples (8,9). Its application to tumor specimens holds great potential in identifying mutant proteins in human cancers. However, because shotgun proteomics data analysis usually relies on database search and because commonly employed protein sequence databases do not contain protein variation information, the application of shotgun proteomics to the detection of protein sequence variants remains a big challenge.
Several research groups have made valuable efforts on enabling the identification of variant peptides based on the exhaustive search of all possible sequence variants. A modified version of Sequest provides automated search of human hemoglobin gene variants through dynamically generating all possible single-nucleotide variations and then constructing a database that translates these sequences to peptides (10). Roth et al. (11) developed a human protein database tailored for the "top-down" MS approach by combinatorial consideration of protein variability in a search. Similarly, the errortolerant search in Mascot (12) and the refinement search in X!Tandem (13) allow exhaustive test of all amino acid substitutions that can arise from single-base nucleotide substitutions in each protein. Because of the greatly expanded search space, it is difficult to apply meaningful measure of statistical significance for the variant identifications and the results require careful interpretation (12).
An effective approach to limit the search space of protein variants is to consider only those derived from known coding SNPs. A SNP annotation method was presented by Bunger et al. in which MS/MS spectra were searched against reference protein databases and a separate SNP database created from peptides from the National Center for Biotechnology Information (NCBI) dbSNP database (14). Schandorff et al. established the MSIPI protein sequence database through elongating the original International Protein Index sequences with coding-SNPs from dbSNP, sequence conflicts, and N-terminal peptides (15). More recently, a web-based platform Sys-PIMP was created for identifying human disease-related mutant sequences based on the X!Tandem search of shotgun proteomics data (16). SysPIMP collects human disease-related mutant sequences from the Online Mendelian Inheritance in Man (17), Protein Mutant Database (18), and SwissProt database (19).
Despite these exciting developments, the problem of applying shotgun proteomics to the identification of protein variants in human cancers has not been addressed adequately. First, mutations, especially cancer-specific ones, are not specifically considered in existing approaches. NCBI's dbSNP database provides a general catalog of genome variation to address large-scale sampling designs required by association studies. It has been an invaluable resource for applying genetic approaches to understanding the etiology of different cancers (20). However, cancer somatic mutations are collected in the Catalogue of Somatic Mutations in Cancer (http://www.sanger.ac.uk/genetics/CGP/cosmic) (21) and other cancer specific databases (22) rather than dbSNP. As a result, most cancer-specific mutations have been omitted from previous studies. Recently, we developed a human Cancer Proteome Variation database (CanProVar 1 , http://bioinfo. vanderbilt.edu/canprovar/) (23) that comprehensively integrates proteome variation data from a variety of cancer specific variation data sources including HPI (24,25), COS-MIC, OMIM, and large-scale mutational profiling studies on cancer genes and cancer genomes (6,7). Confirmed coding variations in NCBI's dbSNP are also included in CanProVar. This cancer-centric proteome variation repository provides an opportunity to create a protein sequence database that can facilitate protein variant detection in shotgun proteomics analysis of human cancer samples.
Second, although limiting protein variants to known coding SNPs and mutations could effectively reduce the search space as compared with the exhaustive test of all possible amino acid substitutions, this method still significantly increases the number of entries in a protein sequence database, which in turn increases the risk of false positive identi-fications. Many previous reports failed to address this critical problem (14). In the study by Bunger et al. (14), a peptide is assigned as an "alternative allele" SNP if the search score for its match against the dbSNP is at least 15% higher than the score for corresponding reference hit. The threshold of 15% was chosen based on manual examination to provide the best balance between false positives and false negatives (14). Although it was proven successful in this study, selection of the score threshold requires manual examination by experienced researchers and cannot be generalized and automated. Other problems introduced by adding variations to sequence databases include (1) efficient storage of variation information in the database, (2) compatibility of the database with different search engines, and (3) interpretability of reports that include both variant and wild-type peptides.
In this paper, we present an integrated workflow to address the above problems. First, we created a variationcontaining protein sequence database based on the CanProVar database. Next, we developed a workflow for identifying both wild-type and variant peptides simultaneously from shotgun proteomics data. We used data sets from colorectal cancer cell lines and human patient samples to demonstrate our workflow. Identified variants were validated through genomic sequencing. Moreover, we tested the compatibility of the workflow with popular search engines including MyriMatch (26), Sequest (27), Mascot (28), and X!Tandem (13). A postprocessing tool was also developed to generate easily interpretable reports based on the output from different search engines. Finally, we benchmarked our workflow against the exhaustive search-based methods.

EXPERIMENTAL PROCEDURES
Proteomics Data Sets-The human proteomics datasets from colorectal adenocarcinoma cell lines (RKO, SW480, and HCT-116) and three colorectal tumor specimens were generated in the Ayers Institute at Vanderbilt. The cell lines were obtained from American Type Culture Collection (ATCC, Manassas, VA) and grown and harvested within 6 months of date of purchase, or grown from frozen stocks that had been made within 6 months of original purchase. They were grown in 10% fetal bovine serum and penicillin and streptomycin supplemented medium at 37°C with 5% CO 2 . SW480 was grown in RPMI 1640 medium, whereas HCT-116 and RKO were grown in McCoy's5A medium. Cells were grown to 80% confluency, the growth medium was aspirated, cells were washed once in 1ϫ phosphate-buffered saline and collected in 1ϫ phosphate-buffered saline. Cells were centrifuged at 300 ϫ g for 5 min and supernatant was discarded. Cell pellets were stored at Ϫ80°C until cell lysis could be carried out. Biological replicates were harvested ϳ1 week apart from the identical cell culture. These replicates were processed separately and independently through the complete analysis procedure. Colorectal tumor specimens were obtained from the Vanderbilt colorectal cancer repository under an IRB-approved protocol that included informed consent from the patients. We obtained three Stage III sigmoid carcinoma specimens based on availability of the biological material and confirmed for the presence of more than 70% tumor cells by a certified pathologist (Dr M.K. Washington). A total of 60 m thickness for each of the frozen specimens was sectioned and collected into microcentrifuge tubes.
Mass spectrometry methods have been described in detail (29,30). In summary, proteins from cell line or tissue samples were reduced, alkylated with iodoacetamide, and digested with trypsin. The resulting peptides were separated on isoelectric focusing strips that were cut into 15 (for cell lines) or 20 (for human tissues) separate fractions. Each of these fractions was analyzed by a second separation on a liquid chromatography column, followed by MS/MS analysis on an LTQ-Orbitrap. Binary spectral data present in the raw files were converted to the mzML format using the msConvert tool in the Pro-teoWizard library (v2.0.1757, 01/27/2010) (31).
Search Parameters-We tested our workflow against four popular database search engines, including MyriMatch(26) (v1.5.6), Sequest(27) (TurboSEQUEST v27), Mascot(28) (v2.2.04), and X!Tandem(13) (X!TANDEM TORNADO v2008.02.01.3). MyriMatch was used as the primary search engine in this study. All cysteines were assumed to be carbamidomethylated, and methionines were allowed to be oxidized. A precursor error of up to 0.007 m/z was permitted, whereas fragment ions were required to fall within 0.5 m/z of their expected locations. Ambiguous identifications that mapped to three or more peptide sequences with equal scores were excluded. One missed cleavage was permitted and no nonspecific cleavage was allowed. The configurations for all search engines are provided in supplemental File S1.
Genomic Sequence Verification-Genomic DNA from cell lines RKO, SW480, and HCT-116 was isolated using a DNeasy® kit (Qiagen). After identification of putative variant peptides by shotgun proteomics, the corresponding exons encoding the protein sequences were amplified using a HotStarTaq® Master Mix Kit (Qiagen). The following polymerase chain reaction (PCR) conditions were used: 96°C ϫ 15 min, followed by 40 cycles of 95°C ϫ 30 s, 60°C ϫ 30 s, 72°C ϫ 60 s, and a final extension of 72°C ϫ 10 min. A list of all the primers used for the PCR amplifications is provided in supplemental File S2. Excess primers and nucleotides were digested using ExoSAP (USB). Sequencing reactions were performed by using Applied Biosystems Version 3.1 Big Dye Terminator chemistry and then analyzed on an Applied Biosystems 3730XL Sequencer. All sequence chromatograms were read in both forward (F) and reverse (R) directions. Fig. 1, our workflow for identifying wild-type and variant peptides based on shotgun proteomics data includes three steps: database creation, peptide identification, and post-processing.

Setup of the Workflow-As illustrated in
The variation-containing protein sequence database was created based on the Ensembl protein database (human, v53) and the CanProVar database (23). Missense variations, nonsense variations and single amino acid deletions and insertions were included in the database. After the naming convention in dbSNP, each cancer-related variation in CanProVar was given an identifier prefixed with "cs." For each single amino acid alteration, the sequence covering the enclosing tryptic peptide and the two flanking tryptic peptides was taken as an independent entry in the FASTA format. Peptide entries with less than 4 residues were excluded because they cannot be confidently identified in shotgun proteomics. Adding the flanking peptides allows for the identification of peptides with missed enzyme cleavage (15). This database construction approach shares the same space-saving advantage as appending sequence variants to the original protein sequence, which was adopted in the study of Schandorff et al. (15). We chose to keep these peptides as independent entries because related variation information can be easily recorded in the sequence header, which includes corresponding protein ID, the start and end positions of the peptide in the protein, and the identifier of the variation in database dbSNP or CanProVar. These new peptide entries resulted in an increase of about 3.4% in the tryptic peptide database size. Our protein sequence database comprised the complete Ensembl protein database (v53, 47,509 entries) and an additional 97,637 peptide entries with variations from 29,873 Ensembl proteins. Among these, 10,254 peptide entries carried cancer-related variations. We named this protein sequence database MS-CanProVar. Reverse sequences were appended as decoy sequences for false discovery rate (FDR) estimation (32). MS-CanProVar can be downloaded at http:// bioinfo.vanderbilt.edu/canprovar.
After creation of the database, shotgun proteomics data from a cancer sample can be searched against the database using a database search engine (Fig. 1A). The next important step is the confidence evaluation of the peptide identifications, i.e. FDR estimation. It has been suggested that a higher risk of false positives could be associated with variant peptide identifications as compared with that for wild-type ones (14). In order to systematically investigate this problem, we searched the SW480 dataset against MS-CanProVar with MyriMatch and used the standard FDR estimation method (32) with no special treatment to variant identifications. Peptides with an FDR Ͻ0.05 were separated into a wild-type group and a variant group, and the score distributions of these two groups were plotted ( Fig. 2A). The score distribution for the variant group showed a significant shift toward the low-score end. Similar results were observed in data from other cancer cell lines (data not shown). These results suggest that although the two groups of peptides were identified using the same FDR cutoff, the variant group does have a higher risk of false positive identifications. Follow-up genomic analysis further confirmed this concern. As shown in Table I, among the 11 putative variant peptides randomly chosen with FDR Ͻ 0.1, only six were confirmed with genomic sequencing. With a threshold of FDR Ͻ 0.05, the confirmation rate was six of nine.
For FDR estimation, all forward sequences are considered as expressed and present. Nevertheless, for a specific sample, only some of the forward sequences are expressed. Moreover, the proportion of expressed sequences among all variant sequences in the database is expected to be significantly lower than that among all wild-type sequences, i.e. variant sequences are expected to have a lower prior proba-bility of being present in a specific sample. However, the FDR estimation is for all matches above the selected score threshold without discriminating wild-type from variant sequences. Therefore, the combined FDR estimation will lead to a higher false negative rate for wild-type peptide identifications and a higher false positive rate for variant peptide identifications. This might not be a big problem for wild-type identifications because variant sequences only comprise a small fraction of the database. Nevertheless, when we consider only variant peptide identifications, the real FDR for the subgroup could be much higher than the combined estimation.
To address this problem, we first estimated the FDRs for wildtype and variant peptides separately. Specifically, only variant peptides and corresponding decoys were considered for the FDR estimation of variant peptide identifications. With this naïve separate FDR estimation, more stringent score cutoffs were set for the variant peptides than for the wild-type ones in most of the experimental runs, and the risk of high false positives for the variant group was reduced according to the score distribution plot (Fig. 2B). However, in some exper-imental runs, a lower search score cutoff was set for the variant peptides than that for the wild-type ones. Indeed, for variant identifications, we found that the search score cutoff corresponding to a preselected FDR level (e.g. 0.05) varied dramatically across the experiment runs, a phenomenon we did not see in the wildtype searches. This may be because of the small number of matches found in the variant peptides and variant decoys. In the target-decoy search strategy for FDR estimation, one can estimate the total number of false positives that meet a specific score threshold by doubling the number of selected decoy matches. This represents the number of observed incorrect decoy matches, combined with the hidden incorrect target matches. When the number of total matches is very low for a given subset of peptides, the estimate of false positives becomes highly variable. As a result, no improvement on the genomic confirmation rate was observed (Table I).
To achieve a more robust estimation of the total number of false positives, we proposed to combine information based on decoys from both variant and wild-type sequences and calculate FDR for variant identifications based on the following formula: Here, R and F v are the numbers of reverse matches and forward variant matches above the score threshold, respectively. R Ϫ and R Ϫ are the numbers of reverse matches and variant reverse matches falling below the score threshold, respectively. The R v Ϫ /R Ϫ ratio provides an estimate of the proportion of variant sequences in the database.
In this formula, the number of false positives in variant identifications is estimated by the total number of false positives and the proportion of variant sequences in the database. Our assumption is that there is no difference between the decoys from wild-type sequences and variant sequences for FDR estimation. This formula may provide a more accurate estimation of the number of false positives for variant peptide identifications because the estimate is based on data that are less subject to variation.
The score distribution plot showed that the new FDR estimation could significantly improve the confidence of variant peptide identifications (Fig. 2C). Genomic sequencing verification for the detected variations also showed that the new FDR estimation method significantly outperformed both combined and naïve separate FDR estimation. As shown in Table  I, variant peptide identification based on the new method achieved a confirmation rate of six of seven, an improvement as compared with the rate of six of nine acquired based on combined or naïve separate FDR estimations. Moreover, although the seventh mutation ABCF1N198D was not confirmed at the genomic level, this change may actually happen after translation through the deamidation of the asparagine residue. On the basis of these verification results, the new refined FDR estimation approach for variant peptide identifications was employed in our workflow (Fig. 1B). In the last step of the workflow, both wild-type and variant peptides are identified based on the refined separate FDR estimation and an easily interpretable report is generated. (Fig. 1C).
Application on Human Cancer Data Sets-With the procedure described above, we performed database search and peptide identification for three data sets from colorectal cancer cell lines RKO, HCT-116, and SW480, respectively. Myri-Match was used as the search engine, and the FDR threshold was set to 0.05 for both wild-type and variant peptides. Thus, 6284, 9145, and 20,023 unique peptides were identified in SW480, RKO, and HCT-116 samples respectively, which were mapped to 1148, 1784, and 2927 indiscernible protein groups using IDPicker (33,34). The number of variant peptides was 20, 27, and 34 for SW480, RKO, and HC-116 respectively (supplemental Files S3 and S4), corresponding to 0.3%, 0.3%, and 0.2% of all peptides identified in each cell line.
We randomly selected 10 and nine putative variant peptides from the RKO and HCT-116 data sets for genomic sequencing verification and the confirmation rate were 8 of 10 and nine of nine, respectively. Combining the genomic sequencing result for SW480, the overall confirmation rate for all three cell lines was 88% (23/26). A complete list of the variant peptides and associated information can be found in supplemental File S3. In the HCT-116 data set, we detected a variation G13D in KRAS (Fig. 3). KRAS was one of the first FIG. 2. Search score distributions for the variant (red) and wildtype (green) peptides identified with FDR < 0. 05 in the SW480 data set. A, Under regular FDR estimation, an apparent shift to the low-score end was observed in the distribution for variant identifications as compared with that for the wild-type identifications. B, Naïve separate estimation reduced the bias to a certain degree. C,The new refined separate FDR estimation approach proposed in this study further improved the quality of variant peptide identifications. genes identified as a transforming gene (oncogene) capable of driving tumor formation in experimental model systems. The G13D variation is not only a known mutation in the HCT-116 cell line but has also been found in 21% of 335 colorectal tumors in a large-scale mutational profiling study (35).
In addition to the cancer cell lines, we also applied the procedure on three data sets from clinical colorectal tumor specimens. A total of 37,827 unique peptides were identified in these data sets, which were mapped to 5581 indiscernible protein groups using IDPicker. The number of distinct variant FIG. 3. Sequence validation of the KRAS G13D identified in the HCT-116 data set. A tandem MS spectrum with m/z 507.304 was identified as peptide LVVVGAGDVGK. The peaks from y4 to y9 ions and b8 to b11 ions indicate a Ϫ58 Da mass shift corresponding to the substitution of glycine with aspartic acid. The inset on the top right corner shows genomic sequencing of PCR product from region surrounding the mutation with corresponding predicted amino-acids. Sequencing revealed a heterozygous point mutation. Consistently, the wildtype peptide LVVVGAG-GVGK was also detected in the HCT-116 data set.
peptides detected in these samples was 204 (supplemental Files S5 and S6), corresponding to 0.5% of all identified peptides. As shown in Fig. 4A, 101, 78, and 139 variant peptides were detected in each of the three patients, respectively. Five peptides carried known cancer-related mutations, and 4 of them were found in more than one patient (Fig. 4B).
The mutation TP53P309S was found in two colorectal cancer patients in this study. Mutations in TP53 are the most commonly observed mutations in any cancer-associated genes with ϳ50% of all human cancers harboring inactivating mutations in this tumor suppressor gene (36). The majority of TP53 mutations cause increased half-life of a functionally inactive p53 protein leading to loss of cell cycle control, resistance to programmed cell death (apoptosis) and the capacity of infinite growth (immortality) in cells harboring such mutations. The mutation TP53P309S has been reported in the SW480 cell line. Inhibiting mutant TP53(R273H/P309S) expression in SW480 reduces cell proliferation, in vitro and in vivo tumorigenicity, and resistance to anticancer drugs (37,38). The mutation SMARCA4W764R was also observed in two patients. Although not reported in colorectal cancer previously, this mutation has been reported in lung cancer (39). A variety of other mutations within SMARCA4 (also named BRG1) were found in several cell lines derived from carcinomas of the breast, lung, pancreas, and prostate etc., and SMARCA4 has been suggested as a drug target for cancer treatment (40). As a subunit of mammalian SWI/SNF chromatin remodeling complexes, SMARCA4 is a critical regulator of TP53 and has been found to be necessary for the proliferation of malignant cells (41). Test for Compatibility with Multiple Search Engines-To ensure compatibility between our workflow and popular proteomics search engines, we tested the procedure with Sequest, Mascot, X!Tandem as well as MyriMatch. The output files of these search engines are written in the pepXML format or can be transferred into the pepXML format via converters. The variation information for each variant peptide is included the pepXML files. Currently, there is no specific software available to extract this information. Therefore we created a tool CanProVar-Parser that can be used to estimate FDR, perform identification for both variant and wildtype peptides, and parse peptide information for reporting. Peptide-related information in a report includes protein mapping, variations, spectral count, FDR value, match rank and spectrum source. The CanProVar-Parser is written in Perl and can be downloaded from http://bioinfo.vanderbilt.edu/canprovar. Applying our procedure on the RKO data set identified 27, 22, 29, and 29 distinct variant peptides using MyriMatch, Sequest, Mascot, and X!Tandem, respectively. Twenty-five variant peptides were detected by two or more search engines (Fig. 5). It is not unusual to get moderately overlapping results from different search engines, and integrating results from multiple search engines has been proposed as a way to improve peptide identification (42)(43)(44). The ability to use our procedure with different search engines makes it possible to perform this type of integration.
Comparison with Exhaustive Search-based Methods-Relying on exhaustive search for all amino acid substitutions that can arise from single base nucleotide substitutions in each protein, the error tolerant search in Mascot (12) and the refinement search in X!Tandem (13) allow the detection of variant peptides without using existing information on genomic sequence variations. To benchmark our method, we performed analysis on the RKO data set using these exhaustive search-based methods and the Ensembl protein database. We controlled the FDR at a 5% level for the wild-type peptide identifications. For variant peptide identifications, we followed the suggestion from Mascot (http://www.matrixscience.com/ help/error_tolerant_help.html): (a) they must have scores of at least the identity threshold for wild-type identifications; and (b) they must have scores in excess of the highest scoring match to the wild-type sequences. Accordingly, the error-tolerant First, we compared the variant peptides identified by exhaustive search using the two search engines and found very limited overlap (Fig. 6). Specifically, 95 and 90% of the identifications were unique to Mascot and X!Tandem, respectively. In contrast, the overlap between the two search engines using our method was much higher, with only 27% nonoverlapping identifications for both search engines. The extremely small percentage of overlap between exhaustive search results from the two search engines raises a concern of potentially high false positive rates (i.e. low specificity). Nevertheless, exhaustive search-based methods identified many more variant peptides than our method even if we only considered common identifications from the two search engines, suggesting a possibly higher sensitivity of these methods.
To gain some insight into the sensitivity of the exhaustive search-based methods, we compared their variant identifications against the eight variant peptides detected by Myri-Match in the RKO cell line and confirmed by genomic sequencing (supplemental Files S3). Surprisingly, despite the large numbers of variant peptide identifications, for each search engine, only three out of the eight peptides were identified (Fig. 6). In contrast, with our method, seven out of the eight were identified by both Mascot and X!Tandem. Although a conclusion cannot be made based on the limited number of true positives, these results do not provide evidence for a superior sensitivity of the exhaustive searchbased methods. DISCUSSION We have created a protein variation-containing database MS-CanProVar and a workflow for the simultaneous identification of wild-type and variant peptides based on the database. A novel FDR estimation method was introduced in the workflow to ensure high reliability of the variant identifications.
Hundreds of variant peptides were identified from three colorectal cancer cell lines and three tumor specimens used in this study. Most of the variants were derived from the dbSNP database and are likely to represent polymorphisms. Whether these polymorphisms are associated with cancer will require large-scale association studies. Some known cancer-related mutations have been identified, including those associated with cell proliferation, tumorigenesis, and drug resistance.
A major concern on the use of variation-containing databases for shotgun proteomics data searching is the high risk of false positive identifications (14). In this study, we systematically investigated this risk by comparing the search score distributions of wild-type and variant peptide identifications and proposed a modified FDR estimation method to automatically handle this issue. By contrast, existing studies require manual selection of a more stringent threshold for variant peptide identifications (14). It is also worth mentioning that in our workflow, although FDR estimations were carried out separately for variant and wild-type peptide identifications, the database search was done at the same time, similar to Schandorff et al. (15). In Bunger et al. study (14), separate searches were performed for the reference and variant databases. When a variant database is searched separately, a best match to a variant peptide may be because of the absence of the competition from the truly presenting wild-type protein.
Genomic sequencing was used to provide an objective evaluation of the reliability of the peptide variants identified using our workflow and confirmation rates of around 90% were achieved. Besides false discoveries generated by the workflow, inconsistency between proteomics identifications and genomic sequencing results can also be explained by mass shifts because of various peptide modifications (14). For example, the alteration ABCF1N198D detected in the SW480 data set might be because of deamidation as this alteration was not confirmed by genomic sequencing. Oxidation (ϩ16), formylation (ϩ28), and acetylation (ϩ42) are other common modifications on peptides (14). Discerning whether a mass shift has resulted from a sequence variation or post-translational modification may require sequencing for confirmation. For example, although the mass shift in MYBBP1AQ8E could be explained by the deamidation of the glutamine residue, this alteration was confirmed at the genomic level (Table I).
As pointed out by Schandorff et al. (15), searching against variation-containing protein databases should provide a new dimension to clinical proteomics projects. In cancer care, detection of expressed mutant peptides and proteins of individual patients by proteomics techniques may have an important impact on the development of personalized medicine. Although only five known cancer-related mutations were detected in the tumor specimens from three colorectal cancer patients, each patient showed a specific mutation pattern (Fig. 4B). These mutation patterns provide both germline and somatic mutation information at a proteome level that could potentially facilitate personalized cancer care. As compared with exhaustive search-based methods, limiting protein variants to those derived from known coding SNPs and mutations could effectively reduce the search space and thus lead to more reliable identifications. However, this advantage also simultaneously imposes a major limitation of dependence on known genomic sequence variations. The number of known cancer-related mutations detected in this study was moderate. Although this can be partially explained by the potentially low stability of mutated proteins, an obvious explanation is the limited database coverage. Cancer-related mutations in the current CanProVar database distribute highly unevenly in human proteins. Most proteins have very few cancer-related mutations whereas some well-known cancer genes have reported mutations in many positions in their protein sequences, such as TP53, CTNNB1, and PIK3CA. More than a hundred different cancer-related mutations have been reported in these proteins. This bias might be explained by the extreme instability of these important cancer genes, but it may also reflect lack of study of other genes. Ongoing large-scale cancer genome projects, such as the Cancer Genome Project of the Sanger Institute, The Cancer Genome Atlas project of the National Cancer Institute, and the National Human Genome Research Institute, will rapidly expand our knowledge on mutations in human cancers (21,39). We will continuously incorporate results generated from these studies into CanProVar and MS-CanProVar to improve the sensitivity of our analysis workflow.
Although exhaustive search-based methods identified many more variant peptides, evaluation in this study based on limited true positives did not provide evidence for a superior sensitivity of these methods over our workflow. Recently, a sequence tagging-based "de novo" algorithm has been proposed as an attractive alternative for variant peptide identification (45). It will be interesting to perform a thorough comparison of these complementary approaches in order to highlight their distinct values. Moreover, only missense variations, nonsense variations and single amino acid deletions and insertions were included in MS-CanPro-Var. Other protein sequence variations such as splice variants and post-translational modifications are also critical in cancer studies and have been detected by shotgun proteomics (46,47). Future work is required to improve our database and workflow for the inclusion of existing knowledge on these variations.
In summary, we have developed a workflow for variant peptide detection in shotgun proteomics studies. The workflow achieves a good balance between reliable variation detection and overall sensitivity of peptide identification. Compatibility of the workflow with popular database search engines has been extensively tested. Reliability of the identifications has been confirmed by genomic sequencing. Applying this workflow on human cancer proteomics studies should provide novel insight into cancer predisposition and potential personalized therapy.