GAPP: A Proteogenomic Software for Genome Annotation and Global Profiling of Post-translational Modifications in Prokaryotes*

Although the number of sequenced prokaryotic genomes is growing rapidly, experimentally verified annotation of prokaryotic genome remains patchy and challenging. To facilitate genome annotation efforts for prokaryotes, we developed an open source software called GAPP for genome annotation and global profiling of post-translational modifications (PTMs) in prokaryotes. With a single command, it provides a standard workflow to validate and refine predicted genetic models and discover diverse PTM events. We demonstrated the utility of GAPP using proteomic data from Helicobacter pylori, one of the major human pathogens that is responsible for many gastric diseases. Our results confirmed 84.9% of the existing predicted H. pylori proteins, identified 20 novel protein coding genes, and corrected four existing gene models with regard to translation initiation sites. In particular, GAPP revealed a large repertoire of PTMs using the same proteomic data and provided a rich resource that can be used to examine the functions of reversible modifications in this human pathogen. This software is a powerful tool for genome annotation and global discovery of PTMs and is applicable to any sequenced prokaryotic organism; we expect that it will become an integral part of ongoing genome annotation efforts for prokaryotes. GAPP is freely available at https://sourceforge.net/projects/gappproteogenomic/.

Thousands of prokaryotic genomes have been sequenced with the rapid advance of massively parallel sequencing technologies. Protein-coding gene annotation is a prerequisite for understanding the functional elements of the genome. Previously, conventional genome annotation was mainly dependent on computational ab initio algorithms to predict coding regions and gene models from raw sequencing data (1,2). Computational ab initio algorithms, which are hard to avoid errors such as missing genes and incorrect gene boundaries, are limited in their prediction accuracy. Recently, the use of information from RNA-seq or expressed sequence tag libraries has dramatically improved the genome annotation confidence (3). However, abundance of transcripts does not necessarily reflect cellular protein levels, and the correlation between protein and mRNA levels is generally slight (4 -6). The existence and function of many gene products remain to be confirmed. Thus, MS has emerged as an invaluable highthroughput method for accurate genome annotation, a concept called "proteogenomics" (7)(8)(9).
Proteogenomics is defined as the use of proteomic data, often derived from MS, to improve and refine genome annotation (10,11). Pioneering work by Yates et al. (12) and Jaffe et al. (13) opened new avenues to high-throughput gene annotation. In recent years, proteogenomics has emerged as a promising and indispensable approach to genome annotation (10,11). It has been applied for genome annotation including identification of novel genes, correction and validation of predicted genes in various organisms (14 -21). Proteogenomics has become particularly important for prokaryotes where genomes are sequenced on a daily basis and in silico gene annotations are prone to errors (22,23). It has been successfully applied to re-annotate bacterial and archaeal genomes and has provided invaluable information for accurate genome annotation of these prokaryotes (18, 24 -36). Besides improving genome annotations, proteogenomic studies can provide valuable information about the presence of post-translational modifications (PTMs) 1 on the proteome-wide level (36 -38). PTMs are a vital cellular control mechanism, which regulates every protein property such as folding, conformation, activity, and functions (39). Almost all the proteins undergo certain amount of PTMs, and this reversible process occurs in most cell compartments to determine the functions of modified protein (40,41). More than 600 different types of PTMs have been reported by the RESID database, and many more are still being identified (42). Identification of PTMs in a protein has improved dramatically, mainly because of rapid advancement of MS technology (43,44). However, proteomewide identification of PTMs remains a highly challenging task and is a much needed analytical tool (45). This situation is exemplified by very few software tools being developed to comprehensively annotate PTMs events at the proteomic level (40).
Over the last few years, computational proteomics has become a dramatically growing field, and a handful of tools have been developed to execute complete proteogenomic analyses (11,46,47). Two excellent reviews have described a comprehensive overview of the various problems commonly encountered and their current solutions for this growing research area (8,11). A number of automated software, including several visualization and database-building tools, have been developed for integration of MS-based proteomic data into genome databases (48 -59). However, there is still a lack of automated software for proteogenomic analyses that incorporate both genome annotation and proteome-wide PTM analysis (11). To fill this gap, we developed a one-stop open source software termed GAPP, which provides a complete proteogenomic pipeline for carrying out both genome annotation and large-scale PTM analysis on a proteome-wide level against prokaryotes. It is an open source and publicly available on Sourceforge (https://sourceforge.net/projects/ gappproteogenomic/). A single command is sufficient to convert data formats, create and search databases using one or more search engines, analyze and integrate results statistically, calculate false discovery rates (FDRs), group and annotate proteins, and discover and annotate PTM events globally. Thus, GAPP enables users to improve genome annotation and identity PTM events from the same experimental proteomic data sets concurrently. To test the effectiveness and versatility of GAPP for annotating prokaryotic genomes, we analyzed proteomic data from the Helicobacter pylori strain 26695 (H. pylori) (14), one of the major human pathogens that are responsible for many gastric diseases such as duodenal ulcers and gastric cancer. Previously, Mü ller et al. identified 71% of the predicted proteome using a typical proteogenomic approach (14). An increasing number of proteomic studies have also been performed to understand the mechanisms involved in the stress responses (60) as well as pathogenic processes (61,62). However, a protein database with high quality is still not achieved for this major human pathogen. In this study, we used GAPP to perform an in-depth proteogenomic analysis of H. pylori using publicly available proteomic data (14). We search spectral data against the genome-translated database of H. pylori and selected peptide identifications at Ͻ1% FDR to reannotate the H. pylori genome. We identified a large number of known and missing genes and present a holistic view of PTM events of H. pylori.

EXPERIMENTAL PROCEDURES
Data and Databases-All MS data for proteogenomic analysis on H. pylori were obtained from PRIDE proteomics data repository (63) with the data set identifier PXD000054. A total of six different data sets that included two biological replicates were used (14). The genome and all annotated protein sequences were obtained from the NCBI ftp site (NCBI reference sequences NC_000915.1; 11.07.2012). In total, 1,102,122 MS/MS spectra were selected for further analysis.
Experimental Design and Statistical Rationale-Development of GAPP-We developed a software called GAPP, an automated standalone pipeline for in-depth proteogenomic analysis and PTM discovery in prokaryotes. Three pregenerated search algorithms X!Tandem, MSAmanda, and MSGF have been integrated into the GAPP tool for proteogenomic analysis. The unrestricted search algorithm MODification via alignment (MODa) is already integrated as a part of the standard distribution for PTM discovery. Moreover, integration with the results from other existing search algorithms is easily assembled into this tool. It is highly configurable, and any identification results from existing frameworks can be used for further process. GAPP tool uses common file formats (such as FASTA, mascot generic format (MGF), GFF, and TXT) for configuration, and most outputs are in the CSV format, making it easy for other programs to read and write the configuration files. All inputs for the GAPP tool are defined in a configuration file. The core software was developed in Java and Perl, and the executables are distributed for Windows platforms. The source code is free for noncommercial use and available at https://sourceforge.net/projects/gappproteogenomic/.
MS Data Analysis-Proteins were identified by database searching using the GAPP tool with several parameters: (1) two missed cleavages were allowed for trypsin and LysC, whereas three were set for AspN; (2) precursor ion mass tolerance was 5 ppm; (3) fragment ion mass tolerance was 0.5 Da; and (4) fixed PTM was carbamidomethylation of cysteine residues. For parameters not common across search algorithms, we used the search algorithm defaults.
The unrestrictive database search algorithms, MODa (MODification via alignment) (64) were integrated into GAPP to find all known and even possibly unknown types of PTMs at once, allowing for arbitrary modification with mass shift up to 250 Da per peptide. The parameters were set as follows: (1) PPMTolerance was 5 ppm; (2) Frag-Tolerance was 0.5 Da; (3) BlindMode was 2, which allows arbitrary modifications per peptide; (4) MinModSize was Ϫ250 and maxMod-Size was ϩ250; (5) two missed cleavages were allowed for Trypsin and LysC, whereas three were set for AspN; and (6) fixed posttranslational modification was carbamidomethylation of cysteine residues. Peptide spectrum matches (PSMs) identified with a 1% FDR threshold were considered using anal_moda.jar for further PTMs analysis. After confirming modification type, the MaxQuant computational proteomics platform (version 1.3.0.5) was used to further search against the protein database for determining the localization of modifications to specific amino acids in the modified peptide sequence as previously described (65). The estimated FDR thresholds for modification site, peptide and protein were specified at maximum 1%.

RESULTS
Overview of GAPP-GAPP is designed as an automated pipeline that contains two modules: (1) Module I (genomicRef) allows MS/MS data to be analyzed against the genome and proteome and (2) Module II (annoMod) is designed to discover PTM events. GAPP is easy to configure and run in one command line. The main input data type accepted is MGF (mascot generic format) format, which can be easily converted by msconvert tool (66) in this pipeline, and the output format includes CSV. The workflow and modules are outlined in Fig.  1. MS raw data files as well as processed data files are publicly available through the repository PeptideAtlas (www. peptideatlas.org) (data ID: PASS00860).
Module I (genomicRef)-All automated MS data analyses including peptide and protein identification, and annotation are performed by Module I. Three publicly available search algorithms, namely X!Tandem (67), MSAmanda (68), and MSGF (69), have been deployed within this step to improve the sensitivity for peptide identifications. All search criteria can be specified by the user using a configuration file. In addition, the results from other existing search algorithms can be integrated into this pipeline. All identified peptides from different search algorithms were then mapped onto the sixframe translated genome database and the existing predicted protein database using BLASTP. Finally, a set of unique orphan peptides that were not represented among the existing predicted proteins, were identified. These novel peptides, mapped exclusively to unique locations on the genome, were designated as genome search-specific peptides (GSSPs) as previously reported (65,70,71). Because all the results from the existing search algorithms were combined to improve the sensitivity for peptide identifications, it is conceivable that a spectrum would be matched to multiple different peptide sequences that identified by different search algorithms. Therefore, these peptides were not selected for further anal-yses. Moreover, the purpose of this pipeline was to annotate genomic loci as completely as possible, a peptide mapped to different genomic loci, namely a shared peptide, was also discarded in analogy to the strategy used in a previous publication (72).
For estimation of identification error rates, GAPP used the target-decoy search strategy to calculate FDR for all employed algorithms' results (73)(74)(75). Only peptides that pass the FDR threshold (1%) from multiple search algorithms can be merged for identifying credible annotated peptides. Because of nonuniform distribution of false positives between annotated and novel peptides, the actual FDR of novel peptides could be much higher than the combined FDR estimation (1%). To further improve the identification accuracy of novel peptides, GAPP calculated the FDR using a more stringent filtering strategy, namely separated FDR (76,77). The FDR for novel peptides was estimated according to the following equation: where D ϩ is the number of identified decoy peptides with scores above the score threshold, T n ϩ is the number of identified novel peptides in the target database above the score threshold, D n is the number of identified decoy novel peptides, and D is the total number of identified decoy peptides. D n /D is an approximation for the fraction of novel sequences in the search space. On the basis of stringent FDR control (FDR n Ͻ 1%), the list of highly reliable GSSPs was generated by GAPP.
Once peptide hits generated by multiple search algorithms are directly mapped back to their original unique genomic locations, open reading frame (ORF) predictions will be performed by investigating the identified peptide sequence, peptide score, peptide length, and possible presence of start codons. Moreover, the proteins identified with only a single unique peptide and shared peptide were then not considered for further analysis. Finally, the ORFs will be categorized: (1) mapping to an intergenic region, (2) mapping to a different reading frame region with an annotated gene region, (3) mapping to an opposite strand to an existing annotated gene region, and (4) partially overlapping an annotated gene's N terminus. In addition, the GFF file (General Feature Format) containing all novel peptides will also be provided for further visualization in genome browsers (such as the UCSC Genome Browser and the Integrative Genomics Viewer). Additional functionality for novel protein sequence alignment and gene ontology (GO) functional annotations are added to the pipeline. The identified novel genes were first mapped onto the Uniref database (ftp://ftp.uniprot.org/pub/databases/uniprot/ uniref/uniref50/uniref50.fasta.gz) via blastp and the E-value for each gene with a cutoff score (E-value Ͻ0.00001) was selected. GAPP then assigned the most suitable functional labels to these novel genes by mapping to existent annotation associations according to the GO function database (http:// archive.geneontology.org/latest-lite/go_20160326-seqdb. fasta).
Module II (annoMod)-To identify unexpected PTM types at once on the proteome-wide level, we used unrestrictive or blind approaches to analyze mass spectra. MODa (MODification via alignment) algorithms were integrated into GAPP to search high-throughput MS data against proteome database, allowing for arbitrary modification with mass shift up to 250 Da per peptide. Moreover, to assign the locations of modifications to specific amino acids, the specific PTM searches are performed by MaxQuant to further validate the PTM events.
Identification of Novel Protein-Coding Regions in H. pylori-To demonstrate the utility and scalability of GAPP, we benchmarked GAPP on a proteogenomic spectral data set from a publicly available, high-throughput MS/MS data set on H. pylori. All 1,102,122 spectra from six different data sets were searched against the H. pylori genome database. After eliminating the 507 peptides mapped to multiple genomic locations, a total of 511,560 peptide-spectrum matches corresponding to 35,487 unique peptides were identified with 1% FDR threshold. The whole peptides identified from the pro-teogenomic analysis of H. pylori are listed in supplemental  Table S1.
Notably, about 84.9% (1248) of the existing predicted H. pylori proteins were identified with at least two unique peptides, indicating the high coverage of protein identification. More importantly, we identified a set of unique orphan peptides (302) that were designated as GSSPs and mapped to unique locations on H. pylori genome. Out of these peptides, 171 peptides either mapped to regions of the genome where no gene was annotated or did not match the gene model to which they were mapped. In total, we identified 20 novel protein-coding genes and modified four existing gene models in H. pylori, all of which were assigned by at least two unique GSSPs (supplemental Table S2 and supplemental Data S1). All annotated spectra for GSSPs were submitted to MS-Viewer (78) (http://msviewer.ucsf.edu/prospector/cgi-bin/ msform.cgi?formϭmsviewer) with the search key: lijxwmvfb3 (X!Tandem), 925qqmgeso (MSGF ϩ ) and e6bair7m6k (MS-Amanda). In contrast to the report by Stephan et al. (14), we observed more novel protein-coding genes using GAPP on the H. pylori genome. A graphical representation of all the GSSPs, novel protein-coding genes, and corrected proteins is presented in Fig. 2A. The length of 16 novel proteins was Ͻ300 amino acids, and the average length of these proteins was 145, suggesting that most novel protein-coding genes are encoded by short ORFs (supplemental Fig. S1).
Among of these novel protein-coding genes, 17 were intergenic and 3 were on different frame. Because the presence of orthologous genes in other species will give a powerful evidence for the novel protein-coding gene models, 11 novel proteins had been previously annotated in other species. Additionally, further functional annotation was performed by assigning the identified novel protein-coding genes to biological process, molecular function, and cellular component categories. As shown in Fig. 2B, we found that 8 (40%) novel proteins were completely annotated according to their biological processes and molecular functions, implying the existence and importance of the identified novel proteins in H. pylori. Fig. 3A shows an example of a novel protein-coding protein identified with 15 peptides mapping to an intergenic region of the genome. Moreover, one of novel protein-coding genes mapped onto different reading-frame regions is present in supplemental Fig. S2. The GSSPs mapped upstream of annotated gene models on the same frame and strand indicate the changes in annotated gene models. Fig. 3B depicts an example of a correction to a gene model via extension of the N terminus. Two peptides map upstream to the currently annotated gene model for locus HP1302, which is also annotated in H. pylori HPAG1 as 30S ribosomal protein. Details of all novel genes and corrected gene models identified by GAPP are listed in supplemental Table S3.
Global PTM Discovery in H. pylori-The proteogenomic study can not only be employed to reannotated genomes but also be used to obtain PTM events (79,80). Recently, we reported a proteogenomic analysis and global profiling of PTMs in model cyanobacterium Synechococcus sp. PCC 7002, providing a holistic view of PTM events in this model cyanobacteria (65). In the present study, the MODa algorithm was first integrated into the proteogenomic pipeline for analyzing the mass spectra and identifying PTMs at once (81). As expected, we identified 20,057 unique peptides from 1308 proteins by MODa. Among these, 56,220 spectra of 7543 unique peptides and 1083 proteins contained PTMs with sizes accepted by the MODa search regardless of the modification classification in Unimod database (www.unimod.org). Details of all the PTMs are provided in supplemental Table S4.
However, the unrestrictive database search algorithms are used to identify unexpected modifications and allocate posttranslationally modified peptide to MS/MS spectra, but it is often difficult to assign the accurate localization of modifications to specific sites. To this end, the MaxQuant computational proteomics platform was used to search for the specific PTM and determine the localization probability of modifications in peptides (82). We chose 22 common modification types in prokaryotic organisms to further search against H. pylori protein database using MaxQuant. In all, 1854 unique peptides encompassing 22 different PTMs from 805 H. pylori proteins were identified with high confidence (FDR Ͻ1%). To improve the search accuracy and confidence, peptide identifications of which the score for localization is Ͻ 40 were eliminated from the search results. Finally, 3415 modification sites on 1854 unique peptides from 805 proteins were identified (Fig. 4A). A detailed list of the identified peptides with each PTM is present in supplemental Data S2.
As shown in Fig. 4B, we further classified the entire set of predicted, MS-detected novel, and modified proteins into a wide range of functional classes by searching against the NCBI COG (Clusters of Orthologous Genes) database. Notably, we found that certain functional classes were statistically overrepresented (p Ͻ 0.05) in certain sets of modified proteins. For instance, the identified proteins with nitrosylation, geranylation, and diphthamide PTMs were overrepresented in coenzyme transport and metabolism, defense mechanisms, and inorganic ion transport, suggesting that these PTM events may be extensively involved in diverse functions of H. pylori.
Proteogenomic Analysis of Bradyrhizobium japonicum USDA110 -To test GAPP's generalizability and applicability pylori. An overview of peptide data against the genome was generated using Circos software. Diverse data features analyzed in this study were mapped onto H. pylori genome. The concentric circles from the periphery to the center represent several factors: (i) H. pylori chromosomes, (ii) proteins encoded by the genome, (iii) GC content of the H. pylori genome, (iv) novel peptides identified in this study, (v) novel proteins discovered in this study, 1: intergenic genes; 2: different frame with annotated genes discovered in this study, (vi) revised gene models (N-terminal changes in annotated genes). B, Histogram representations of identified novel proteins' distribution according to their biological processes, molecular functions, and cellular localization.
in annotating prokaryotic genomes, it was also applied on Bradyrhizobium japonicum USDA110 data (Pride Accession 10099 -10104 and 10114 -10116), a model organism to study agriculturally important rhizobium-legume symbiosis. We confirmed 37% (3110) of known genes, discovered 51 high reliable previously unannotated protein coding genes, and corrected annotation for 10 genes that show rapid proteogenomic analyses of another prokaryotic genome (supplemental Data S3). Among of these novel protein-coding genes, 28 were intergenic, 5 were on different frame, and 18 were on the strand opposite to an existing gene annotation. More importantly, GAPP was used to perform the global profiling of post-translational modifications (PTMs) and revealed a large repertoire of PTM events using the same proteomic data. Our results provided a rich resource that can be used to examine the functions of reversible modifications in B. japonicum. Details of all identified PTM events and modified proteins are provided in supplemental Data S3 and available at www.peptideatlas.org/PASS/PASS00860. DISCUSSION GAPP is a one-stop proteogenomic software for prokaryotic genome refinement and global identification of PTM events from the same experimental proteomic data sets. This software allows concurrent querying of proteomic and genomic databases to refine genome and identify diverse PTM events. This includes MS data and database preprocessing, database searches, FDR calculations, statistical result integration, biological interpretation, and global PTM discovery. In addition, GAPP is a simple and efficient proteogenomic software, which requires minimal operator intervention.
GAPP integrated MS data and database preprocessing as well as multiple database search algorithms into the pipeline; this is different from other existing tools (72,83). All database search algorithms in this software were designed to run in parallel on a single machine or on a computer cluster to increase the search efficiency. Two advantages of GAPP are the automation of steps and the simplified command line API. To simplify installation, GAPP has been packaged into the software with all functions installed and configured, enabling plug-and-play use. Therefore, GAPP is accessible to a broad class of users including laboratories with limited bioinformatic skills and could be applied as a standard part of the genome annotation projects.
In proteogenomic analysis, multiple algorithmic search approaches can increase sensitivity and specificity in peptide identification more than any single algorithm (84). The targetdecoy strategy (TDS) is the most widely used strategy to estimate FDR (85) and many effective methods have been developed to improve the performance and reliability of TDS (86 -88). However, the error may be inflated in novel PSM identifications. To accurately identify novel peptides, it is necessary to filter the search results to control the FDR. Recently, more stringent filtering strategies, such as posterror probability (89) or separate FDRs for annotated and novel peptides (46,90) were employed in proteomic analysis. In this work, we used a more stringent strategy (novel FDR) to increase the accuracy of identified novel peptides (76,77). However, because of the lower expression levels or worse scores of the novel peptides in proteogenomic study, the FDR calculation is difficult to explicitly estimate to exclude false peptides (90). To further confirm the identified novel genes, GAPP applied a postprocessing of identification strategy, including the novel protein sequence alignment and GO functional annotation. This was done to validate identified novel genes, which is independent of the discovery of novel peptides and indirectly verifies the existence of the identified novel genes. Moreover, if the strand-specific transcriptome information is available, it will provide additional evidence for the identification of novel genes (65,91).
At present, identification of large numbers of PTMs from complex proteome mixtures remains a critical challenge because the database search space expands exponentially as the number of PTMs increases (11,36,92). The search space inflates dramatically with each additional variable modification added in the "restricted search" mode, leading to low interpretation rates of MS/MS spectra and high false-positive rates (93,94). Recently, the unrestricted database search algorithms, which do not require prior knowledge of the modifications present in the sample, are used in high-throughput proteome analysis (90). GAPP integrated the unrestrictive PTMs search strategy (MODa) into the pipeline (64) and identified a large number of novel modifications in H. pylori. However, MODa could not provide the accurate localization of modifications to specific sites in the identified peptide sequences. To address this problem, the restricted search mode was used to search for the specific PTM and determine the localization probability of modifications in peptides (82). As each strategy has its own strengths and weaknesses, the combination of both unrestricted and restricted algorithms could improve the overall accuracy of PTM identification.
In this study, the identification of many novel proteins and PTM events have expanded the proteomic landscape of H. pylori. These findings also reveal the capability and versatility of GAPP for in-depth annotation of prokaryotic genomes. We anticipate that GAPP could be applied to any sequenced prokaryotic organism to yield similar sets of novel discoveries and believe that it represents a useful contribution to the software toolset for proteogenomics. ʈ These authors contributed equally to this work.