Deep Coverage of the Escherichia coli Proteome Enables the Assessment of False Discovery Rates in Simple Proteogenomic Experiments*

Recent advances in mass spectrometry (MS) have led to increased applications of shotgun proteomics to the refinement of genome annotation. The typical “proteo-genomic” workflows rely on the mapping of peptide MS/MS spectra onto databases derived via six-frame translation of the genome sequence. These databases contain a large proportion of spurious protein sequences which make the statistical confidence of the resulting peptide spectrum matches difficult to assess. Here we performed a comprehensive analysis of the Escherichia coli proteome using LTQ-Orbitrap MS and mapped the corresponding MS/MS spectra onto a six-frame translation of the E. coli genome. We hypothesized that the protein-coding part of the E. coli genome approaches complete annotation and that the majority of six frame-specific (novel) peptide spectrum matches can be considered as false positive identifications. We confirm our hypothesis by showing that the posterior error probability distribution of novel hits is almost identical to that of reversed (decoy) hits; this enables us to estimate the sensitivity, specificity, accuracy, and false discovery rate in a typical bacterial proteo-genomic dataset. We use two complementary computational frameworks for processing and statistical assessment of MS/MS data: MaxQuant and Trans-Proteomic Pipeline. We show that MaxQuant achieves a more sensitive six-frame database search with an acceptable false discovery rate and is therefore well suited for global genome reannotation applications, whereas the Trans-Proteomic Pipeline achieves higher specificity and is well suited for high-confidence validation. The use of a small and well-annotated bacterial genome enables us to address genome coverage achieved in state-of-the-art bacterial proteomics: identified peptide sequences mapped to all expressed E. coli proteins but covered 31.7% of the protein-coding genome sequence. Our results show that false discovery rates can be substantially underestimated even in “simple” proteo-genomic experiments obtained by means of high-accuracy MS and point to the necessity of further improvements concerning the coverage of peptide sequences by MS-based methods.

MS-based proteomics has become an indispensable tool for studying in vivo protein expression on a global scale (1). Briefly, in a typical "shotgun" proteomic experiment, the whole proteome of an organism is extracted and digested by a protease (e.g. trypsin). The resulting complex peptide mixtures are usually further fractionated and separated via liquid chromatography (LC) before ionization and analysis in the mass spectrometer. Recent innovations in MS technology (2)(3)(4) enable high peptide sequencing rates with high mass accuracy and sensitivity, placing the routine analysis of entire proteomes within reach (5,6).
Modern genome annotation uses computational ab initio approaches to predict coding regions and gene models from raw sequencing data (7,8). As the ultimate evidence of gene expression is the detection of its product, transcriptomic data are commonly used to train gene prediction algorithms (9). Similarly, MS-based proteomics is increasingly used in genome annotation. In a typical proteo-genomics experiment, MS/MS spectra of peptides are searched against databases derived via in silico six-frame translation of the whole genome sequence (10 -14). This approach has been applied, alone or in combination with transcriptomic data, in order to refine genome annotation in several organisms, including C. elegans (15), P. pacificus (16), S. cerevisiae (17), S. pombe (18), A. thaliana (19), S. nodorum (20), T. gondii (21), A. gambiae (22), mouse (23), and human (24,25). Bacteria are especially well suited for MS-assisted genome annotation because of their relatively simple genome structures and small genome sizes, which lead to overall better sequence coverage in a typical proteomics experiment (26 -33).
The use of six-frame databases in proteo-genomics experiments is challenging because of their large sizes, which increase the search space as well as affect the sensitivity of database searches (34). Additionally, these databases contain a high proportion of artificial sequences resulting from frames that are not transcribed (13,35). These spurious protein sequences are difficult to discriminate from the true protein sequences, which makes the statistical confidence of the resulting peptide spectrum matches (PSMs) 1 difficult to calculate.
Here we take advantage of the small size (4.6 Mb), simple architecture, and high annotation level of the Escherichia coli genome and use it as a benchmark model for proteo-genomic data interpretation. We derive a comprehensive dataset of proteins expressed in the exponential growth of Escherichia coli and map the corresponding MS/MS spectra onto a sixframe translation of the E. coli genome. We hypothesize that the protein-coding part of the E. coli genome approaches complete annotation, and we consider six frame-specific (novel) PSMs as wrongly identified. This enables us to estimate the factual false discovery rate in a simple proteogenomic experiment. We show that the posterior error probability (PEP) distribution of novel peptides is almost identical to that of decoy (reversed) hits, which validates our assumption and points to the accumulation of false positive PSMs within novel peptide identifications. Our dataset comprises 2600 E. coli proteins, approaching the identification of the complete proteome expressed during exponential growth (36), but covers only 31.7% of the protein-coding genome sequence.

EXPERIMENTAL PROCEDURES
Bacterial Cell Culture-Wild-type E. coli strain K12 (isolate BW25113) (37) was inoculated in 5 ml lysogeny broth Luria/Miller medium at 37°C under vigorous shaking for 24 h (A 600 ϭ 1.9), then 1 ml of the stationary culture was spun down at 260 ϫ g for 10 min in order to remove any remaining from the Luria/Miller medium. The bacterial cells were washed twice with M9 minimal medium consisting of M9 salts (6.78 g/l Na 2 HPO 4 , 3 g/l KH 2 PO 4 , 0.5 g/l NaCl, 1 g/l NH 4 Cl, Sigma-Aldrich) supplemented with additional 0.5% (w/v) glucose, 33 M thiamine, 1 mM MgSO 4 , 0.1 mM CaCl 2 . Next, the resultant pellet was resuspended in a final volume of 1 ml M9. Immediately after, 5 l of this culture were used to inoculate 5 ml of fresh M9 medium containing 0.25 mg/ml of lysine (Sigma-Aldrich). Overnight, minimal medium cell cultures were grown at 37°C under vigorous shaking to an A 600 ϭ 0.5 and used to inoculate (1:100 dilution) 125 ml of fresh minimal medium containing 0.25 mg/ml lysine. The cell cultures were grown to A 600 ϭ 0.5, harvested via centrifugation at 3345 ϫ g for 10 min, washed with phosphate buffered saline, and snap-frozen in liquid nitrogen.
Protein Extraction-The frozen cell pellets were resuspended in 3 to 5 ml lysis buffer (pH 7.5) containing 2 mg/ml lysozyme (Sigma-Aldrich) in 50 mM Tris/HCl buffer, 1 mM EDTA, and 5 mM of each of the following phosphatase inhibitors: glycerol-2-phosphate, sodium fluoride (Sigma-Aldrich Karlsruhe, Germany), and sodium orthovanadate (Alfa Aesar). Cell wall lysis was performed at 37°C for 15 min, and DNA was comminuted by benzonase (1875 U) (Merck) for an additional 10 min. For the solubilization of membrane proteins, lithium dodecylsulfate (Sigma-Aldrich) was added to a final concentration of 1% (w/v) and samples were incubated at 37°C under vigorous shaking for 15 min. Cell debris was removed via centrifugation at 3345 ϫ g for 5 min and repeated centrifugation of the supernatant at 11,300 ϫ g for 10 min. The crude protein extract was methanol/chloroform precipitated, and the protein precipitates were redissolved in denaturation buffer containing 6 M urea/2 M thiourea in 10 mM Tris buffer. For estimation of the protein concentration, each extract was measured via Bradford assay (Bio-Rad).
SDS-PAGE and In-gel Digestion-In-gel digestion was performed as previously described (16). Briefly, extracted proteins were separated on a NuPage Bis-Tris 4 -12% gradient gel (Invitrogen). The gel was stained with Coomassie Blue and subsequently cut into 15 slices. The resulting gel pieces were destained by being washed three times with 10 mM ammonium bicarbonate (ABC) and acetonitrile (ACN) (1:1, v/v). Proteins were then reduced with 10 mM dithiothreitol (DTT) in 20 mM ABC for 45 min at 56°C and alkylated with 55 mM iodoacetamide in 20 mM ABC for 30 min at room temperature in the dark. After being washed two times with 5 mM ABC and one time with ACN, the gel pieces were dehydrated in a vacuum centrifuge. Proteins were digested with either trypsin (Promega Fitchburg, WI) or Lys-C (Wako Neuss, Germany) (12.5 ng/l in 20 mM ABC) at 37°C overnight. The resulting peptides were extracted in three subsequent steps with the following solutions: (i) 3% TFA in 30% ACN, (ii) 0.5% acetic acid in 80% ACN, and (iii) 100% ACN. After evaporation of the ACN in a vacuum centrifuge, peptide fractions were desalted using Stage-Tips (38).
In-solution Digestion-Protein extracts were reduced for 1 h at room temperature with 1 mM DTT and subsequently alkylated with 1 mM iodoacetamide for 1 h at room temperature in the dark. Proteins were pre-digested with Lys-C (1:100 w/w) for 3 h at room temperature. After dilution with 4 volumes of 20 mM ABC, proteins were digested overnight at room temperature with either trypsin (1:100 w/w) or Lys-C (1:100 w/w).
Off-gel Isoelectric Focusing-Peptides derived from the in-solution digestion were separated according to their isoelectric point using the 3100 OffGel fractionator (Agilent Santa Clara, CA) following the manufacturer's instructions. Peptide mixtures were separated into 12 fractions using 13-cm Immobiline DryStrips with a pH 3-10 gradient (GE Healthcare). Separation was performed at a maximum current of 50 A until 50 kVH were reached. Peptide fractions were acidified with acidic solution (30% ACN, 5% acetic acid, and 10% TFA in water) and desalted using Stage-Tips.
Strong Anion Exchange Chromatography-Peptides from the insolution digestion were desalted using solid phase extraction. Strong anion exchange chromatography was performed as described elsewhere (39). Briefly, desalted peptides were loaded at pH 11 onto an anion exchange column containing six layers of Empore/Disk Anion Exchange tip inner diameter (New Objective Woburn, MA) packed in-house with reversed-phase ReproSil-Pur C18-AQ 3-m resin (Dr. Maisch GmbH Ammerbuch-Entrigen, Germany). Peptides were injected into the column with solvent A (0.5% acetic acid) at 700 nl/min using a maximum pressure of 280 bar. Peptides were then eluted using an 81-min or a 221-min segmented gradient of 5%-50% solvent B (80% ACN in 0.5% acetic acid) at a flow rate of 200 nl/min. The mass spectrometer was operated in data-dependent mode. Survey full scans for the MS spectra were recorded between 300 and 2000 Thompson at a resolution of 60,000 with a target value of 1E6 charges. The 15 most intense peaks from the survey scans were selected for fragmentation with collision-induced dissociation at a target value of 5000 charges. The fragment spectra were recorded in the linear ion trap. Selected masses were included in a dynamic exclusion list for 90 s.
MS Data Processing-Acquired MS data were preprocessed by MaxQuant (v.1.2.2.9) (40) in order to generate peak lists that could be submitted to a database search. Derived peak lists were submitted to the Andromeda (41) and Mascot v2.2.0 (Matrix Science, London, UK) search engines to query the genome database translated into all six reading frames. The genome sequence of E. coli (42,43) was downloaded from the NCBI homepage (accession number NC 000913.2). The translation into all six reading frames was done from stop codon to stop codon by applying the bacterial and plant plasmid code (translation Table XI) using the Transeq tool that is part of the Emboss software package (44). We required a minimal length of six amino acids for each resulting putative open reading frame (ORF), which corresponds to the minimal peptide length that we required in the database search. To that database we added decoy sequences using the SequenceReverse.exe tool shipped with MaxQuant software. The resulting database consisted of 263,159 putative ORFs, 248 commonly observed lab contaminants, and 263,407 reversed sequences.
A database search was performed with the precursor mass tolerance set to 6 and 7 ppm for Andromeda and Mascot database searches, respectively. The fragment ion mass tolerance was set to 0.5 Da for both search engines. Full enzyme specificity for trypsin and Lys-C was required, and up to two missed cleavages were allowed. Oxidation of methionine and protein N-terminal acetylation were defined as variable modifications, and carbamidomethylation of cysteine was defined as a fixed modification.
The resulting lists of PSMs were further processed by MaxQuant and Trans-Proteomic Pipeline (v4.5 RAPTURE rev 0) (45). Andromeda database scores calculated by MaxQuant were converted to PEPs as described in Ref. 41. We calculated q-values by sorting the PSMs by their PEPs in ascending order. For each PSM we calculated the ratio between the number of decoy hits and the number of target PSMs having PEPs below the PEP of the actual PSM. Mascot result (.dat) files were converted to pepXML format and further processed by the PeptideProphet (45) module as part of the Trans-Proteomic Pipeline (TPP). We used the accurate mass binning option, excluded singly charged peptides, and used decoy hits to model the score distribution of false positives for semi-supervised mixture modeling. The false discovery rate (FDR) was controlled by filtering PSMs according to the probability assigned by PeptideProphet. The corresponding probability threshold was calculated by the calctppstat.pl perl script as part of the TPP, and the "Approx. P threshold for FDR" was used to filter the list of PSMs.
Acquired MS data were additionally searched against a recent annotation of the E. coli genome (UniProt reference proteome set; downloaded on January 18, 2012; 4309 protein entries) using MaxQuant v1.2.2.9 operating with the same database search parameters as described above. FDRs on peptide and protein group levels were set at 1%.
Proteo-genomic Workflow-Detected peptide sequences that resulted from searching the six-frame database were matched to the UniProt E. coli proteome database using BLASTP (Blast 2.2.25ϩ) (46,47) to check whether they mapped to annotated proteins. We chose UniProt as the database because it offers a comprehensive and unified resource of protein sequences and all 4309 E. coli proteins are part of the Swiss-Prot section, which is manually annotated and reviewed. Therefore, this database should represent a high-quality annotation of the theoretical E. coli proteome. All peptides that produced a perfect match in the UniProt E. coli database were considered as annotated. In order to retrieve the genomic coordinates of detected peptides, we mapped their sequences to the genome database using TBLASTN. Because the BLAST algorithms are not optimized to find all occurrences of small sequences, we set the maximal E-value to 1E4 and the number of alignments to 20 in order to ensure that the typically short peptide sequences could be found in the genome and proteome databases. To map these peptides unambiguously, we required a full-length alignment and 100% similarity. Multiple occurrences of the same peptide in the genome or proteome were considered separately. All peptides that did not produce a perfect match in the proteome database were defined as initial candidates in the list of novel peptides. To address the ambiguity of leucine and isoleucine, the initial set of novel peptides was checked once again using regular expression matching. Peptide sequences that could not be found in the proteome database because of any isobaric amino acids were removed from the initial set of novel peptides. In a second BLAST iteration, all six-frame ORFs that were detected by one or more novel peptides were matched to the proteome database as well as the non-redundant protein database (NCBI nr). In addition, we resubmitted the spectra of novel peptides to query the NCBI nr database using the Mascot search engine in order to check the consistency between PSMs derived from searching the six-frame translation and NCBI nr database. Together with the genome coordinates of the peptides and the annotated proteins, we used this information to classify the novel peptides into different types of annotation conflicts.
The proteo-genomic pipeline and further downstream data analysis were implemented in R v2.13 (48).
Calculation of Protein Abundances-We implemented the exponentially modified protein abundance index (PAI) (49) to estimate protein abundance. Briefly, the exponentially modified PAI is defined as 10 PAI Ϫ 1, with the PAI (50) being the number of observed peptides divided by the number of observable peptides per protein. For the calculation of exponentially modified PAI values for our dataset, we focused on the tryptic part of our dataset, which comprised all detected proteins. The mass range used to define observable peptides was set from 600 Da to 6000 Da.
Functional Annotation Analysis-Gene Ontology annotation of the E. coli K12 proteome was derived from the Gene Ontology Annotation (UniProt-GOA) database (51), downloaded on February 29, 2012. We applied a two-sided hypergeometrical test to see whether specific annotation terms were significantly enriched or depleted among the set of proteins of interest. Derived p values were further adjusted to address multiple hypothesis testing using the methods proposed by Benjamini and Hochberg (52). The adjusted p values were Ϯlog10 transformed and visualized by the function heatmap.2 that is part of the R package gplots. RESULTS We derived a comprehensive dataset of E. coli proteins by harvesting the cells in the exponential phase of growth, extracting the proteome, and applying three separation methods (strong anion exchange chromatography, off-gel isoelectric focusing, and gel-based LC-MS) in combination with protein digestion using two proteases, trypsin and Lys-C. We ana-lyzed the resulting peptide mixtures via nano-LC-MS on an LTQ Orbitrap Velos mass spectrometer. We measured the precursor (peptide) ion masses at high resolution and mass accuracy in the Orbitrap analyzer while performing peptide fragmentation and fragment ion measurement at low resolution in the linear ion trap analyzer. In total, we acquired 1,941,724 mass spectra in about 6 days of measurement time. The average absolute mass accuracy of the identified PSMs was 0.34 ppm, and 99% of the PSMs were measured within 1.8 ppm, which enabled us to use narrow (up to 7 ppm) precursor mass tolerance windows during database searches. We mapped these spectra onto the six-frame translation of the raw genome sequence in order to assess sensitivity, specificity, accuracy, and the factual FDR in a typical bacterial proteogenomic experiment (supplemental Fig. S1). Separately, we mapped the spectra to the annotated genome sequence (UniProt reference E. coli proteome database) to assess genome coverage by detected peptide sequences.
Assigning Statistical Confidence to Six-frame Database Search Results-The translation of the E. coli genome sequence from stop codon to stop codon resulted in 263,159 putative ORFs, which were generally short database entries with a median length of 20 amino acids; details about the six-frame protein database used are summarized in supplemental Fig. S2. Most of these ORFs represent spurious sequences, as usually only one reading frame at a given locus is translated; this means that on average, five out of six sequences are artificial database entries. To increase confidence in the interpretation of proteo-genomic data analysis, we used two common workflows for processing and statistical assessment of MS/MS data: MaxQuant (40), based on the Andromeda search engine (41) and target-decoy approach (TDA) for FDR estimation (53,54), and TPP (55), used with the Mascot search engine (Matrix Science) and mixture model approach (MMA) for FDR estimation (45). Searching the acquired MS data against the six-frame database using Mascot and Andromeda search engines and controlling the FDR at 1% yielded markedly different numbers of identified MS/MS spectra and peptide sequences (Table I, supplemental Tables S1 and S2). The application of the Mascot search engine in combination with the MMA identified almost 24% fewer peptide sequences and 48% fewer MS/MS spectra at the same FDR than the Andromeda search engine in combination with the TDA. This was not surprising, as the more conservative character of the MMA in controlling FDR was reported previously (35,56).

False Positive Identifications Accumulate among Novel
Peptide Hits-We investigated whether identified peptide sequences were present in the annotated protein-coding portion of the genome. Peptides that could not be assigned to any annotated protein in the UniProt E. coli database were referred to as six-frame-specific or novel peptides. Assuming that the protein-coding part of the E. coli genome approaches complete and correct annotation, these peptides can be considered as false positives and can be used to assess the performance of the applied proteo-genomic search strategies. In order to validate this hypothesis, we processed the MS data without any control of the FDR and classified the resulting peptide sequences according to whether they were annotated ("target," 44,872 hits), reversed hits ("decoy," 35,370 hits), or novel peptides ("novel," 31,075 hits) (Fig. 1A).
The PEP values of peptides that could be assigned to annotated proteins followed a very tight distribution around a median PEP of 3.26e-6, indicating that a high percentage were true positive identifications. Notably, the absolute numbers of decoy and novel hits were very similar, and the corresponding PEP values followed almost the same distribution, with median PEP values of 0.790 (decoy) and 0.787 (novel) (Fig. 1B). The PEP distribution of the novel hits resembled that of PSMs to a search space that contained a small fraction of correct hits, which pointed to the fact that very few true positive hits were expected to be found among the novel hits in E. coli (i.e. the genome is almost completely annotated). We postulated that if there were still many novel genes/peptides to discover, the proportion of correct hits among the novel peptides would be larger and the PEP distribution of novel and decoy hits would be significantly different. To demonstrate this, we considered a hypothetical case of partial genome annotation in which 20% or 50% of E. coli genes were unknown (i.e. not annotated). We then randomly sampled a subset (80% and 50%, respectively) of the proteins contained in the UniProt E. coli database to define the corresponding "partially annotated" E. coli proteome. We then re-classified the detected peptide sequences into "target" and "novel" (as shown in Fig.  1A) according to the newly defined E. coli "annotation" and visualized their PEP distributions. As expected, we obtained significantly different PEP distributions of novel hits and decoy hits (supplemental Fig. S3). Application of the TPP workflow further confirmed these results (supplemental Fig. S4). Taken together, these findings support our initial hypothesis of almost complete genome annotation of the protein-coding  ); however, their number is negligible compared with the overall number of PSMs and will not significantly affect the overall outcome of our analysis. We calculated q-values (57) for all peptides that could be assigned to the existing protein database using the novel peptides as decoy hits and correlated the calculated values to "standard" q-values (Fig. 1C). Overall there was a very high correlation (r ϭ 0.99997) of q-values calculated based on decoy peptides (x-axis) and novel peptides (y-axis). The high correlation between the two distributions of q-values shown in Fig. 1C pointed to the underestimation of target-decoy FDRs, as the novel peptides were not considered in the FDR estimation. In total, 313 peptide sequences passing the default constraint of 1% FDR were specifically found in the genomic six-frame translation and are not annotated according to the UniProt E. coli database. Of all peptide sequences, 68.1% were identified by both workflows, whereas only nine peptides (2.8% of the total novel peptides) were identified as novel in both datasets (supplemental Figs. 5A and 5B). The poor overlap of detected novel peptides can be interpreted as follows: (i) There was a small fraction among novel peptides that were likely true positive identifications. Assuming that the overlap between the two workflows represented true positives, this would imply that 2.8% of the 313 novel peptides were not false positive identifications. (ii) The same assumption implicates that the majority of novel peptides (97.2%) were randomly distributed in the two datasets, further supporting our initial hypothesis of an almost complete annotation.
We next focused on the nine novel peptides identified by both data processing workflows. The corresponding PEPs of these peptides were noticeably better than those of other novel peptides (supplemental Fig. S6) and therefore had the greatest likelihood of being correctly identified. Manual inspection of the corresponding MS/MS spectra showed good agreement with the inferred amino acid sequence in eight out of nine cases, which we then classified into potential annotation conflicts, or cases where we found evidence for an erroneous gene model contained in the UniProt E. coli database (Table II). The fact that most of the best-scoring novel peptides were known annotation conflicts and therefore true positive hits pointed to the fact that our calculations were conservative in nature (representing the worst-case scenario). Their presence also points to the substantial number of annotation conflicts even in the simplest genomes. The presence of at least one obvious false positive even in this "golden" set of novel peptides indicated increased FDRs in this part of the dataset. Examples of a novel peptide resulting from a known annotation conflict and a novel peptide resulting from false identification are presented in Fig. 2. All nine novel peptide sequences and their annotation details are presented in supplemental Fig. S7 and supplemental Table S3.
Assessment of Proteo-genomic Workflows-The assumption of approaching a complete annotation of the proteincoding part of the E. coli genome enabled the calculation of various features of the applied proteo-genomic pipeline, such as sensitivity, specificity, accuracy, and FDR. Because the true FDR is unknown, we use the term "factual FDR" (FDR fact ) as an estimate of the true FDR, as discussed in Ref. 58. The general strategy for calculating these values is depicted in supplemental Fig. S8A. An experimental outcome-in our case the result of the proteo-genomic workflow-is compared with a "golden standard," which in this case is the annotated protein-coding part of the E. coli genome. We classified all peptide sequences returned by MaxQuant and TPP into the four possible contingencies (true positive, false positive, false negative, and true negative) of this comparison (supplemental Fig. S8B). Based on the derived contingency tables, we assessed sensitivity, specificity, accuracy, and the factual FDR as a function of the FDR utilized by both approaches (Fig. 3).
To assess FDR fact , we used the number of obvious false positive identifications (decoy plus novel peptides) as an estimate for the number of latent false positives among the list of all detected peptide sequences (true positives plus false positives), equivalent to the TDA. Both workflows demonstrated  A, schematic representation of the erroneous initiation of the fes gene. Annotated proteins are shown in blue, detected peptides are depicted in black, and six-frame ORFs are shown in green. ORFs that were hit by peptides are shown in dark green. The novel peptide (VGSESWWQSK) was located upstream of the predicted protein N terminus. The corresponding six-frame ORF encompassed the complete sequence and employed the same reading frame as the fes gene. B, corresponding MS/MS spectrum of the novel peptide depicted in A annotated with a comprehensive series of b and y fragment ions. C, schematic representation of a dubious novel peptide identified at a 1% FDR by both data processing workflows used in this study. Although an adjacent cluster of peptides was detected that mapped to the tref gene, the novel peptide (LSIRIQPPK) utilized a different reading frame. D, MS/MS spectrum of the corresponding novel peptide shown in C poorly annotated with b and y fragment ions. high specificity and accuracy, which are essential for discriminating false positive from true positive identifications. The sensitivity of the TPP-based workflow was consistently lower (on average 42.3% lower) than that of the MaxQuant workflow across different FDR thresholds. Strikingly, the factual FDR as a function of the decoy FDR utilized by MaxQuant increased linearly, with a constant ratio FDR fact /FDR decoy of about 3.5 in our particular study, pointing to the FDR underestimation that occurs when using the TDA. Conversely, the FDR fact did not approach the probability-based FDR used in the TPP workflow, confirming the conservative character of the MMA. We expect that these properties of the MMA and the TDA will be similar in other proteo-genomics datasets of similar size and complexity and even more pronounced in proteo-genomic analyses of larger and more complex genomes.
The Expressed Proteome of E. coli in Exponential Growth Phase-The dataset derived in this study represents one of the most comprehensive proteomics datasets of E. coli. In order to assess proteome coverage, we searched the acquired MS spectra against the UniProt proteome database using MaxQuant operating with default parameters. Resubmission of the 1.9 M spectra to the Andromeda search engine identified 42,780 non-redundant peptide sequences (supplemental Table S4) corresponding to 2626 distinct E. coli proteins (supplemental Tables S5 and S6) with an FDR of 1% at the protein level. A detailed summary of all sub-datasets concerning the different fractionation methods and the two enzymes used can be found in supplemental Table S7. Although 2626 proteins represent about 61% of the annotated proteome, a dataset of similar size was reported before (36). In that study, combined proteome and transcriptome (microarray) analysis detected 2602 and 2543 E. coli gene products, respectively, of which 2219 proteins were identified in our dataset (supplemental Fig. S9A). Therefore, we estimate that E. coli grown in batch culture under aerobic conditions does not express more than about 2700 proteins and conclude that our dataset approached full coverage of the E. coli proteome. A Gene Ontology (59) term enrichment/depletion analysis of the expressed proteome revealed an underrepresentation of functions related to motion (e.g. flagellum organization (p ϭ 3.45e-7), motor activity (p ϭ 4.14e-6)) and transposons (e.g. transposition (p ϭ 2.33e-16), transposase activity (p ϭ 7.19e-11)). Therefore, these functions characterize the part of the E. coli proteome that we did not identify in our dataset. Further details about the functional analysis of the expressed proteome can be found in supplemental Figs. S9C and S9D and supplemental Tables S8 -S12, respectively.
Coverage of the E. coli Genome with Identified Peptides-This comprehensive dataset enabled us to address general features of bacterial proteomics experiments, especially in the context of the coverage of the genome sequence by detected peptides. We first defined the protein-coding part of the genome by mapping the 4309 proteins present in the UniProt E. coli database onto the chromosome (4.6 Mb) (Fig. 4A). This analysis revealed that 86.8% (4.0 Mb) of the genome is annotated in the protein database and therefore protein coding. We next used sequences of all 2626 proteins identified in our dataset to estimate the size of the expressed part of the genome, which corresponded to 65.4% (2.6 Mb) of proteincoding genome regions. Finally, mapping of the detected peptide sequences onto the chromosome captured 1.27 Mb of the raw genome sequence, matching 31.7% of the proteincoding part of the genome (Fig. 4B). The number of MS/MS events with which each nucleotide was represented ranged from 1 to 1344 with an average coverage of 20 MS/MS and median number of 7 MS/MS events per nucleotide (Fig. 4C).

DISCUSSION
The performance of different search strategies in proteogenomic applications was explored in a number of previous studies. For example, the application of TDA to searches of protein databases derived via six-frame translation has been assessed recently (35), and previous reports point to important general considerations for the application of this approach in database searches (60,61). There is a global consensus that the increased size of the databases obtained via six-frame translation decreases the sensitivity and specificity of database searches and that the spurious protein sequences present in such databases make the statistical confidence of the resulting PSMs difficult to assess (13,62). To circumvent this issue, we hypothesized that the annotation of the protein-coding part of the E. coli genome approaches completeness. If correct, this would enable us to consider novel PSMs as false positive identifications and to assess general features of a typical bacterial proteo-genomic dataset, such as sensitivity, specificity, accuracy, and factual false discovery rate. Our hypothesis was confirmed by almost identical distributions of the PEP values of the novel and decoy hits, as well as by the low number of detected novel peptides. We note that several of the detected novel peptides were true positive hits, but because of their extremely low number we expect their effect on the reported values to be minimal.
We used two complementary MS/MS data processing frameworks: MaxQuant implementing the target-decoy approach, and Trans-Proteomic Pipeline/Peptide Prophet using the mixture model approach for FDR assessment. Although both achieved deep proteome coverage, the MaxQuantbased workflow identified a significantly greater number of peptides, whereas the TPP-based workflow had a significantly lower factual FDR. However, TPP also led to decreased sensitivity, resulting in a smaller number of identified spectra, which can significantly affect the coverage of a genome sequence by detected peptides. In our view, the MaxQuant workflow led to a better tradeoff between maximal peptide identification rates (sensitivity) desired in proteo-genomic studies and FDR and is better suited for global genome reannotation studies; in contrast, the MMA and semi-supervised model are better suited for applications that require high specificity (e.g. the detection of splice variants or single nucleotide variation (SNVs)).
Somewhat surprisingly, our data point to a substantial underestimation of the FDR even in "simple" proteo-genomic experiments utilizing high-accuracy mass spectrometry. Although we cannot perform the same assessment for proteogenomics experiments in higher organisms (because of the lower quality of their annotation), we expect that these issues will be even more pronounced due to the size and complexity of their genomes (e.g. effects of alternative splicing). Therefore, special care should be taken to control the search space and calculate the FDR as accurately as possible. Several strategies for decreasing the search space in proteo-genomic databases have been proposed (63); we argue that the use of high accuracy is one of the most effective ways to achieve this. In this context, the use of "high-high" acquisition methods (ones in which the survey and MS/MS scans are acquired at high (ppm to sub-ppm) accuracy) (2, 3) will further improve the confidence of detected PSMs and become indispensable in future proteo-genomic experiments. These acquisition strategies still come with a prolonged duty cycle (i.e. lower number of acquired MS/MS events) on most of the currently used MS platforms, but recent advances in several highaccuracy analyzers (64) will soon enable the routine acquisition of high-accuracy MS data at all levels. However, novel peptides detected in such experiments should still undergo thorough investigation before being treated as true positive identifications, regardless of the acquisition method or proteo-genomic workflow used.
The comprehensive proteome dataset derived for the purpose of this study enabled us to assess another important aspect of a proteo-genomics experiment: coverage of the genome sequence by identified peptide sequences. The field of proteomics is getting to the remarkable stage of the identification and quantification of all gene products expressed under specific conditions, and this especially applies to organisms with small and relatively simple genomes, such as bacteria and yeast (5). In addition to the detection of a gene product, genome re-annotation also requires high coverage of the genome sequence. In our study, we achieved comprehensive detection of the expressed E. coli proteome that was in agreement with previous studies (36); however, the identified peptide sequences covered 48.5% of the estimated expressed, 31.7% of all protein-coding, and only 27.5% of the total genome sequence. Interestingly, in the part of the E. coli genome covered by peptide sequences, each nucleotide was detected by an average of 20 MS/MS scans (median: 7 MS/MS scans), which corresponds to 20-fold base coverage in genomic terminology. As next-generation sequencing studies routinely achieve up to 50-fold base coverage of 99.9% of the genome sequence (65, 66), our results demonstrate the limitation of using MS-based proteomics for the sole purpose of genome annotation. Despite the fact that improvements are constantly being made in MS technology, it is hard to see how the genome sequence coverage by detected peptides will be improved to the level achieved by the next-generation sequencing technology. In addition, proteomics can obviously address only the protein-coding part of the genome, which will be a major problem in large genomes such as those in humans, in which only about 1% of the total genome sequence is protein coding. Therefore, we believe that the major impact of proteogenomics will be not in genome re-annotation, but in the analysis of features that are beyond the reach of genomics, such as posttranslational modifications of proteins in the context of individualized protein databases derived via next-generation sequencing. However, the routine application of proteomics in these areas will require further substantial improvements aimed at increasing the sequencing speed/coverage (MS level) and specificity/sensitivity (bioinformatic workflows).