Analysis of the Secretome and Identification of Novel Constituents from Culture Filtrate of Bacillus Calmette-Guérin Using High-resolution Mass Spectrometry*

Tuberculosis (TB) is an infectious bacterial disease that causes morbidity and mortality, especially in developing countries. Although its efficacy against TB has displayed a high degree of variability (0%–80%) in different trials, Mycobacterium bovis bacillus Calmette-Guérin (BCG) has been recognized as an important weapon for preventing TB worldwide for over 80 years. Because secreted proteins often play vital roles in the interaction between bacteria and host cells, the secretome of mycobacteria is considered to be an attractive reservoir of potential candidate antigens for the development of novel vaccines and diagnostic reagents. In this study, we performed a proteomic analysis of BCG culture filtrate proteins using SDS-PAGE and high-resolution Fourier transform mass spectrometry. In total, 239 proteins (1555 unique peptides) were identified, including 185 secreted proteins or lipoproteins. Furthermore, 17 novel protein products not annotated in the BCG database were detected and validated by means of RT-PCR at the transcriptional level. Additionally, the translational start sites of 52 proteins were confirmed, and 22 proteins were validated through extension of the translational start sites based on N-terminus-derived peptides. There are 103 secreted proteins that have not been reported in previous studies on the mycobacterial secretome and are unique to our study. The physicochemical characteristics of the secreted proteins were determined. Major components from the culture supernatant, including low-molecular-weight antigens, lipoproteins, Pro-Glu and Pro-Pro-Glu family proteins, and Mce family proteins, are discussed; some components represent potential predominant antigens in the humoral and cellular immune responses.

Tuberculosis (TB) 1 is one of the greatest killers worldwide, especially in developing countries. About one-third of the world's population has been infected by TB bacteria, and 10% of those infected have a lifetime risk of falling ill with TB (1). In 2011, ϳ8.7 million people fell ill and 1.4 million died from TB. It is notable that TB and human immunodeficiency virus (HIV) can form a lethal combination, each speeding the other's progress. Additionally, drug-resistant TB is growing and is present in virtually all of the countries surveyed (2). Although TB is a treatable and curable disease, co-infection with HIV and the emergence of drug-resistant strains have made treatment a heavy economic burden, and there are severe adverse drug reactions in patients. Currently, an important weapon in the fight against TB is Mycobacterium bovis bacillus Calmette-Gué rin (BCG) (3). The BCG vaccine has existed for over 80 years and has a documented protective effect against meningitis and disseminated TB in children (4). However, it does not prevent primary infection and, more important, does not prevent the reactivation of a latent pulmonary infection. Furthermore, the efficacy of BCG against TB has displayed a high degree of variability (0%-80%) in different trials (5). Therefore, the impact of BCG vaccination on the transmission of TB is limited, and new vaccination strategies as alternatives or complements to BCG are urgently needed, particularly against primary infection and latent pulmonary infection.
Bacterial secreted proteins, which are specifically released into the surrounding extracellular milieu, constitute a large and biologically important subset of proteins that are involved in cellular communication, adhesion, and migration (6). In Grampositive bacteria, secreted proteins can be anchored to the cytoplasmic membrane, associated with the cell wall, released into the extracellular milieu, or injected into a host cell (7). A significant number of mycobacterial proteins have been shown to be secreted or exported during growth; these proteins are central to pathogenesis, and some of them have been shown to be key T-cell antigens mediating protective immunity against TB (8). The secretome of mycobacteria is considered to be an attractive reservoir of potential candidate antigens for the development of new vaccines and diagnostic reagents. However, secretome analysis is quite challenging, and bacterial secretomes have often been under-studied. This scenario could be attributed to technical limitations such as the presence of low-abundance proteins or contamination by cytoplasmic or other normally nonsecreted proteins released following cell lysis and death (9). Several attempts have been made to define the secretome of M. tuberculosis using twodimensional gel electrophoresis or liquid chromatography (LC) coupled with different types of MS analysis. For example, Mattow et al. utilized two-dimensional gel electrophoresis coupled with MALDI-MS and capillary LC-electrospray ionization-MS/MS to identify 137 proteins from culture supernatant, only 42 of which had previously been described as secreted proteins (8). Okkels et al. applied a narrow-range pI gradient two-dimensional gel electrophoresis separation combined with MALDI-MS and electrospray ionization MS/MS to characterize eight ESAT-6 spots, among which four species of full-length ESAT-6 were identified (10). Malen et al. used two-dimensional gel electrophoresis coupled with a MALDI-TOF MS/MS approach to identify 257 mycobacterial proteins, including 159 secreted proteins (11).
While some previous secretome studies on M. tuberculosis have been performed, a comprehensive analysis of the BCG secretome has not been undertaken. Therefore, our knowledge of important secreted immune constituents of BCG and their functions against TB is still ambiguous. In 2003, Florio et al. identified 12 proteins in the culture filtrate (CF) of BCG in the pI range of 6 -11 using two-dimensional gel electrophoresis and MALDI-TOF, and only three of them had not been described previously (12). Using a similar method, in 2010, Rodriguez-Alvarez et al. compared the secretomes of wildtype M. bovis and a PstS1-recombinant BCG vaccine substrain (rBCG38) and identified six conserved hypothetical proteins that are differentially expressed (13). Recently, Berredo-Pinho et al. reported the proteomic profile of culture filtrate proteins (CFPs) from M. bovis BCG Moreau and identified 101 different proteins, of which 53 were thought to be secreted proteins (14). Although proteome-wide studies performed on CFPs of BCG are still limited because of their frequently low concentrations, recent developments in Fourier transform MS with high resolution and accuracy at both the MS and MS/MS levels might substantially promote comprehensive investigations of secreted protein profiling (15). Moreover, to reduce the complexity of the extracted CFPs, a onedimensional gel electrophoresis separation outperformed previous strategies with respect to the number of identified proteins, reproducibility, and throughput (9). In the present study, BCG CFPs were separated by means of one-dimensional gel electrophoresis, and 12 gel slices were cut. The resulting peptides were separated via reversed-phase LC and analyzed using a high-resolution LTQ Orbitrap Velos to improve the identification coverage and reliability. In total, 239 CFPs were identified with high confidence, including 185 potential secreted proteins or lipoproteins. We also obtained 17 novel protein products that were not annotated in the BCG database and for which we performed RT-PCR at the transcriptional level to support the existence of these proteins. We found 103 secreted proteins that have not been reported in previous studies on the mycobacterial secretome and that are unique to our study. Additionally, 52 existing annotated proteins were confirmed with correctly assigned translational start sites (TSSs), and 22 proteins were validated by extension with TSSs based on N-terminal peptides. This study is a secretomic repertoire of BCG, and some of the potential prominent antigens implicated in protective immune responses will most likely contribute to the design of future vaccination and diagnostic strategies against TB.

EXPERIMENTAL PROCEDURES
Strains and Sample Preparation-M. bovis var BCG NCTC 5692 was grown in 5 l of liquid Sauton medium, and cells were collected in the mid-exponential range (A 600 of 0.4 -0.5) after 14 days of incubation with gentle agitation. The culture supernatant and cells were separated via filtration through first a 0.45-m-pore-size membrane and then a 0.22-m-pore-size membrane (Millipore, Bedford, MA). The CFP samples were prepared as described elsewhere, with some modifications (8). Briefly, the resulting CF addition of a protease inhibitor mixture (Roche, Germany) was treated with 0.015% (w/v) sodium deoxycholate under shaking and incubated for 10 min at room temperature and subsequently subjected to a trichloroacetic acid (10%, v/v) precipitation procedure. The resulting solution was incubated overnight at 4°C and then centrifuged at 4000g for 15 min to collect the precipitates. After being washed twice with ice-cold acetone and allowed to air dry, the protein content of the precipitates was quantitated via the bicinchoninic acid protein assay. The protein pellets were then suspended in SDS-PAGE loading buffer and dissolved for 2 h. The samples were boiled for 10 min and subsequently centrifuged at a maximum speed for 30 min, and the resulting supernatant was subjected to SDS-PAGE.
In-gel Digestion-Approximately 10 g of protein from the CF was loaded onto a 12% SDS-PAGE gel (1.0 mm thick, width/length of 8.6/6.8 cm) and stained using colloidal Coomassie Blue stain. After the excess stain had been removed, each lane was cut into 12 bands and subjected to an in-gel tryptic digestion protocol as described elsewhere (16). Briefly, sliced bands were washed twice with 50% acetonitrile in 50 mM ammonium bicarbonate (NH 4 HCO 3 ) for 15 min at room temperature, and then dehydrated by 100% acetonitrile. The in-gel reduction was performed using 10 mM dithiothreitol at 37°C for 45 min followed by alkylation using 55 mM iodoacetamide at room temperature for 30 min in the dark, and an in-gel digestion was conducted using modified trypsin (trypsin/protein ratio of 1/10 (w/w), Promega, Madison, WI) at 37°C for 16 to 20 h. All of the tryptic peptides extracted from the gel slices were desalted using ZipTipC18 (Millipore, Bedford, MA) and were solubilized in 0.1% formic acid for subsequent LC-MS/MS analysis.
In-house Database Construction-For proteomic discovery, we constructed two in-house databases. The first was the six reading frame translation database of the BCG genome (downloaded from NCBI). This construction was performed by translating the entire genome in all six reading frame options, three forward and three on the reversed DNA strand (17). Briefly, the codons TAA, TAG, and TGA were selected as the stop codons in a certain frame, and putative open reading frames (ORFs) were generated by translating sequences from the first nucleotide to a stop codon. The next putative ORF was started at the next nucleotide following the previous stop codon. This procedure was performed on both DNA strands of the chromosome in all three reading frames. Entries containing fewer than 15 aa or redundant sequences from repetitive genomic information were deleted for simplification. In total, we obtained a set of 111,825 possible entries. All of the entries here were named BCGRF000001 to BCGRF111825 with the frame tag. Moreover, the annotations with the same frames were replaced with original names in the BCG genome. For example, entry BCGRF000001 was renamed BCG0001, which was annotated in the BCG genome data set with the same frame as BCGRF000001. Additionally, sequences for common contaminants (338 unique entries) from two collections (248 from the Max Planck Institute of Biochemistry, 112 from the Global Proteome Machine Organization Common Repository of Adventitious Protein) were appended to the end of the target database FASTA file (supplemental Text S1). In total, the final database had 112,163 entries.
The second database was a specialized N-terminal extension database that was constructed as described elsewhere (18). All of the customized entries were merged into the extension database, except for those entries for which the start codons were the same as in the previous annotation. In total, from the annotated sequences listed in the BCG genome, 1805 alternative start site entries were collected in the extension database.
MS Analysis and Database Search for Protein Identification-Digested peptide mixtures were separated using a nanoAcquity ultraperformance LC system (Waters, Milford, MA) equipped with a C 18 reversed-phase microcapillary trapping (nanoAcquity Symmetry C 18 , 5 m, 180 m ϫ 20 mm) and an analytical column (nanoAcquity BEH 300 C 18 , 1.7 m, 100 m ϫ 100 mm). The outlet of the analytical column was coupled directly to a high-resolution LTQ Orbitrap Velos mass spectrometer (Thermo Fisher Scientific, Germany) using a nano-electrospray ion source. Peptides were eluted through the analytical column with a constant flow at 0.4 l/min using a 160-min gradient with aqueous solvents A (0.1% HCOOH) and B (0.1% HCOOH, 80% CH 3 CN). During the elution step, the percentage of solvent B increased in a linear fashion from 5% to 35% at 5-95 min, followed by an increase to 85% at 95-130 min, a column wash at 85% at 130 -145 min, and re-equilibration at 1% B at 146 -160 min. The eluted peptides were introduced into the mass spectrometer using a PicoTip Emitter (SilcaTip 360 m outer diameter ϫ 20 m inner diameter, 10 Ϯ 1 m) (New Objective, Woburn, MA) and were electrosprayed with a distally applied spray voltage of 2.0 kV. Full scan MS spectra with an m/z range of 380 to 2000 were acquired in profile mode with a resolution of 60,000 in the Orbitrap. The most intense precursor ions (up to 20, multiply charged (2ϩ or 3ϩ)) from the full scan were selected for fragmentation by collision-induced dissociation and were detected in an Orbitrap with a resolution of 7500. The dynamic exclusion list for MS/MS was restricted to 5000 entries, with a maximum retention period of 60 s and a relative mass window of 10 ppm. A normalized collision energy of 35% was used for the MS/MS, and the data were acquired in centroid mode. Additionally, an activation Q-value of 0.25 and an activation time of 10 ms were also applied for the MS/MS. Lock mass calibration using a background ion from the air (m/z 445.12003) was applied. In total, we performed 36 reversed-phase LC-MS/MS runs (each lane was cut into 12 bands) with three repeats.
The raw data were processed using Proteome Discovery software (version 1.3; Thermo Scientific, Germany) with the search algorithm MASCOT (Matrix Sciences, London, UK), and the MS/MS spectra were searched against three customized databases: the BCG protein database, the N-terminal extension database, and a six reading frame translation database. Enzyme specificity was set to trypsin/P, and a maximum of two missed cleavages were allowed. Cysteine carbamidomethylation was used as a static modification, and methionine oxidation and N-terminal acetylation were used as dynamic modifications. The initial maximal allowed mass tolerance was set at 5 ppm for precursor masses and then 0.8 Da for fragment ion masses. The reverse database search option was enabled, and a maximum targetdecoy-based false discovery rate of 1.0% for peptide and protein identification was allowed. At least two unique peptides were required for protein identification. All the raw mass spectra files have been deposited into the publicly accessible database PeptideAtlas and now are available with dataset Identifier PASS00133. The complete set of peak list files (mgf file format) converted from the raw files can also be accessed freely with dataset Identifier PASS00213.
Bioinformatics Tools for the Prediction of Secreted Proteins-SignalP 4.0 and TatP 1.0 software were used for the prediction of classical amino-terminal secretion signal peptides and Tat-dependent signal peptides, respectively. Non-classically secreted proteins were predicted using SecretomeP 2.0. Protein transmembrane helices were predicted using TMHMM 2.0. All of this software is publicly available from the Centre for Biological Sequence Analysis at the Technical University of Denmark. The theoretical molecular mass and pI value were obtained from the Proteome Discoverer software calculation. Lipoproteins were predicted using a Hidden Markov Model method PRED-LIPO for Gram-positive bacteria. The subcellular localization of the identified proteins was predicted using the PSORTb v4.0 program. Gene prediction programs used for prokaryotes were FgeneSB and GeneMark. Functional classifications were determined according to the Pasteur Institute functional classification tree. Homologous proteins were searched using the Blastp program.
RT-PCR Validation-RT-PCR was performed to provide transcriptional level evidence for genes corresponding to novel proteins identified in this study using a previously described protocol (18). Briefly, the total RNAs extracted from BCG cells using the SV Total RNA Isolation System kit (Promega, Madison, WI) were treated with RQ1 RNase-free DNase to remove any contaminating genomic DNA, and this was followed by heat inactivation of the endonuclease. cDNA synthesis was performed from 1 g of the total RNA using the SuperScript TM III Reverse Transcriptase (Invitrogen) according to the manufacturer's protocol. PCR was performed using 1 l of the resulting cDNA as a starting material according to standard procedures. PCR reactions that were conducted with isolated RNAs as the templates were used as negative controls to indicate the elimination of genomic DNA contamination, and a reaction with human ß-actin cDNA as a template was used as a positive control with an amplified product of 353 bp. The sizes of the amplified products were determined by an E-Gel Electrophoresis System using a 2% E-Gel pre-cast agarose gel and a 1 kb Plus DNA Ladder (Invitrogen). The genespecific primers used in this study were designed using Primer Premier 5.0 software and are listed in supplemental Table S1.

In Silico Characterization of Classical Secreted Proteins and
Lipoproteins in the BCG Genome-Proteins that are secreted through the general secretory (Sec) pathway normally have classical amino-terminal secretion signal peptide sequences (7). In order to generate a conceptual list of classically secreted proteins, we screened the BCG database for proteins that possess classical signal peptides using the software SignalP 4.0. However, there are two types of networks in the current SignalP 4.0 version: the SignalP-TM and SignalP-noTM methods. To obtain a more accurate prediction, we used SignalP-TM to predict those proteins that might include transmembrane (TM) regions and SignalP-noTM to predict those without TM regions. As a result, we first predicted proteins with TM regions using the program TMHMM 2.0 and obtained 634 proteins with one or multiple TM regions. These proteins were analyzed with the SignalP-TM predictor, and 39 of them were predicted to contain signal peptides. Those without TM regions were analyzed with SignalP-noTM, and 204 were predicted to possess signal peptides. In total, 243 proteins in the BCG protein database were predicted to contain amino-terminal signal peptide sequences (supplemental Table S2); these proteins are considered to be classical secreted proteins.
Lipoproteins are a functionally diverse class of secreted bacterial proteins that contain 1% to 3% bacterial genomeencoding proteins (19). The signal peptides of these proteins direct their export and post-translational lipid modification (20). For Gram-positive bacteria, lipoproteins are usually predicted using the program PRED-LIPO, which is based on regular expression patterns and outperforms the well-known LipoP method (21). However, even though the prediction method has a high specificity and very few false positives, we also manually validated the remaining proteins beyond the PRED-LIPO prediction with a Blastp analysis using orthologous lipoproteins from other species. We obtained 66 lipoproteins via the PRED-LIPO prediction and an additional 40 potential lipoproteins via manual Blastp analysis. In total, 106 potential lipoproteins were predicted in the BCG database (supplemental Table S3A).
Analysis of the CFPs Identified Using SDS-PAGE and Highresolution Fourier Transform Mass Spectrometry-To achieve the best identification of the BCG extracellularly secreted proteins under an in vitro culture condition, a strain was cultivated in liquid Sauton medium to limit contamination from medium-derived protein. Cells were harvested at the midexponential phase, when bacterial lysis is minimal, although not exiguous. Electrophoresis analysis showed that CFPs had a molecular weight majority ranging from 10 to 60 kDa. Several intensively Coomassie-stained bands were observed: one very intensive band was present at ϳ50 kDa, and three others were present at ϳ60, 35, and 25 kDa, respectively. After a search through Proteome Discovery software, the protein identification was filtered with an IonScore of no less than 40 and less than a 1% cumulative false discovery rate at the peptide level. Furthermore, we set the criterion that each protein detected was required to match at least two unique peptide sequences. In total, we obtained 1555 unique peptide sequences, representing 239 proteins (supplemental Table  S4). Among these proteins, 128 (ϳ54%) were presumed to be secreted proteins with classical amino-terminal secretion sig-nal peptides using the program SignalP, which indicatied that they were targeted for secretion via the Sec pathway ( Fig. 1). Additionally, 13 proteins were recognized by the TatP 1.0 algorithm as harboring Tat signal peptides (Fig. 1). The consensus sequence recognized by this algorithm is RR.
[FGAVML] [LITMVF]. It contains two invariant arginines in the first two positions and any amino acid in the third position (indicated by the dot), in addition to the variable amino acids indicated in the brackets. Both the Sec and Tat signal peptides are composed of three distinct regions: the N-, H-, and C-regions, which are cleaved by SPase I (12). Interestingly, BCG_2087c (BlaC) contained a complete Tat motif in its signal peptide sequence but was not recognized via the SignalP-noTM method. Another secreted antigen, FbpA, also contained a Tat motif with a cleavage site most likely between position 43 and 44, but it was not recognized via either the SignalP-TM or the SignalP-noTM method. Surprisingly, subcellular localization prediction using PSORTb v4.0 showed that these two proteins were localized to the extracellular compartment. We deduced that the proteins were actually secreted antigens that were missed by the SignalP prediction. Moreover, according to the program PRED-LIPO and manual Blastp analysis, 73 lipoproteins were unambiguously identified in this study (Fig. 1, supplemental Table S3B). Interestingly, 55 of these lipoproteins were also considered to be secreted via the Sec pathway because of their classical amino-terminal signal peptides.
Secreted proteins without signal peptides are known as leaderless secreted proteins and constitute a significant fraction of the secretome. It was reported that these proteins appear to have cytoplasmic functional roles as well as extracellular roles (22). As an alternative strategy, the non-classical secreted proteins could be identified via the SecretomeP method, which identifies proteins based on their specific biological and chemical properties or characteristics regardless of whether the protein carries a cleaved N-terminal signal peptide (23). This method has been trained on secreted proteins that were experimentally identified but not predicted by other algorithms and might complement the highly popular method for scanning classical secreted proteins, SignalP (24). Using SecretomeP, 103 proteins were predicted to be leaderless secreted proteins. Interestingly, 58 of them were also regarded as secreted proteins with classical signal peptides using the program SignalP. Therefore, excluding classical secreted proteins, 45 proteins were indeed determined to be leaderless secreted proteins lacking classical secretion signal peptides with high confidence, and they were probably produced as a result of Sec-independent secretion mechanisms ( Fig. 1).
In total, 185 proteins were predicted to be secreted by at least one of the four programs employed (Fig. 1). On average, more than six peptides were used to identify each CFP, and the amino acid sequence coverage was ϳ35.3%. For secreted proteins, five or six peptides were used to identify each one, and the amino acid sequence coverage averaged 31.2%.
Isoelectric Point and Molecular Weight Distributions of the CFPs-In this study, protein identification covered wide pI values and molecular weight ranges. The pI values ranged from 3.91 (PE-PGRS family protein PE_PGRS43b, BCG_2509c) to 12.19 (50S ribosomal protein L32 RpmF, BCG_1034); a detailed pI distribution is displayed in Fig. 2A. The majority of the proteins clustered between pI 4 and 7, which is in agreement with previous studies performed on CFPs (12). Among secreted proteins, most numbers of them ranged between pI 5 and 7. Interestingly, all the CFPs identified between pI 8 and 9 were secreted proteins.
The the lowest molecular weight among the proteins was 9.41 kDa (50S ribosomal protein L32 RpmF, BCG_1034), and the PPE family protein PPE6 (BCG_0345c) with a molecular weight of 194.08 kDa represented the largest secreted protein. The distribution of the molecular weights of CFPs is depicted in Fig. 2B, and the majority were found in the range between 10 and 60 kDa, which represented ϳ90% (215 out of 239) of all of the identifications. The molecular weight distribution of the secreted proteins was similar for the CFPs. Only one protein (bifunctional penicillin-binding protein 1A/1B PonA1, BCG_0081) ranged from 70 kDa to 80 kDa. Interestingly, the 50S ribosomal protein L32 RpmF had the lowest molecular weight, whereas its pI value was the highest. For the 185 secreted proteins, the average molecular weight was 34.5 kDa, and the average pI value was 6.5. For the nonsecreted proteins, the average molecular weight and pI value were 32.7 kDa and 5.8, respectively. For the 73 lipoproteins, the average molecular weight was 32.9 kDa, and the theoretical pI value was 6.2. There were no apparent differences in relation to the molecular weight and the pI value distribution between the secreted and non-secreted proteins.
Subcellular Localization of the CFPs-All of the CFPs identified in this study were subjected to the PSORTb v4.0 program in order to predict their subcellular localizations (Fig.  2C). The results showed that 48 proteins localized to the cytoplasmic membrane; 46 of them were secreted proteins, and 20 were lipoproteins. Because bacterial lipoproteins are a functionally diverse class of membrane-anchored or -associated proteins, it was not surprising that so many lipoproteins were identified in the cytoplasmic membrane compartment. Interestingly, several proteins were identified with multiple transmembrane helices; for example, BCG_3669c and BCG_0326 were predicted to have four and three transmembrane helices, respectively. Additionally, 72 and 26 CFPs were predicted to localize to cytoplasmic and extracellular compartments, respectively. Approximately half of the cytoplasmic proteins were predicted to be secreted proteins. All of the extracellular CFPs contained classical signal peptides except for two non-classical secretory proteins. Intriguingly, two of them (BlaC and FbpA) also contained complete Tat motifs in their signal peptide sequences. Six of the extracellular proteins were also lipoproteins. In addition, another two lipoproteins were found in the cell wall. It has been reported that some lipoproteins could be alternatively processed by signal peptidase I or II, and this mechanism could explain their localization in the extracellular environment or in the cell wall (12). No subcellular localization could be predicted for 76 proteins. Because the current version of the pSORTb v4.0 program is not perfect-for example, it cannot detect lipoprotein motifs for some lipoproteins and proteins that are located at multiple sites-this tool should be used with caution, and we are aware of this. In fact, the localization information for 41 of the lipoproteins is unknown. Further study to determine the localization information of these proteins should be pursued.
Comparisons with Other Studies on the Mycobacterial Secretome-To investigate the secreted proteins in CFs, a number of mycobacterial secretome studies have been undertaken (8,(11)(12)(13)(14)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34). Table I summarizes the major studies on the mycobacterial secretome performed to date. These studies focused on CFs of different mycobacterial substrains, including BCG variants. Combining all of the data from the studies published to date, there are 397 proteins that have been reported in different mycobacterial CFs. Among them, 148 proteins were considered to be secreted proteins, including classical and leaderless secreted proteins or lipoproteins. We compared the proteins identified in our study with those identified in previous studies and found that 82 of the secreted proteins were previously reported. Therefore, there are 103 secreted proteins that have not been reported previously and are unique to our study. It should be noted that Malen et al. obtained a total of 257 proteins using a combination of two-dimensional gel electrophoresis MALDI-TOF-MS and LC-MS/MS in a single study, but only 144 were identified by at least two peptides (12). They reported that 159 of them had predicted N-terminal signal peptides. However, when requiring at least two unique peptides per protein, only dozens of secreted proteins were identified in their study. In our study, all identifications were filtered with high confidence and required at least two unique peptides per protein. We presume that the higher identification rate here is most likely a result of the use of accurate high-resolution Fourier transform MS settings at both the MS and MS/MS levels, whereas many of the earlier studies used MS at lower resolutions and accuracies. Furthermore, CFPs were pre-fractionated using a onedimensional gel electrophoresis method followed by in-gel trypsin digestion, which decreases the complexity of the secreted proteins and has no bias against low abundance secreted proteins.
Translational Start Site Assignments-In genomic annotation, the majorities of TSSs are assigned by using bioinformatic methods or based on homology comparative genomic approaches; an accurate TSS is critical for the analysis of both the protein function and the transcriptional regulation (35). Because most TSSs are conceptually translated from predicted transcripts and no straightforward experimental methodologies can easily determine a TSS, it is difficult to correctly assign a TSS in a given gene. The true TSSs were usually significantly different when predicted by different bioinformatic methods (36). For example, although the M. tuberculosis H37Rv genome sequence has been available for more than ten years, ϳ50% of the gene annotations in the most used datasets from two independent institutions (the Sanger Institute and the Institute of Genomic Research-TIGR) have different TSSs (37). The N-terminal sequencing of proteins has been helpful in verifying the predicted TSSs to a certain extent, but this method is usually not applicable if the N-termini of proteins are blocked by modifications (38). Additionally, it is not a high-throughput method for a large quantity of proteins, and it is also time consuming and costly. Here, we utilized an MS-based proteomic strategy for assigning TSSs that was more universal and high-throughput than N-terminal sequencing. In this method, protein N-terminal peptides can be indicated by their non-tryptic nature at the N-terminus of the peptide. Such semi-tryptic peptides (i.e. N-terminal peptides with an initiator methionine residue or an initiator methionine cleaved) were detected by searching the protein database (18). We used these criteria to assign correct TSSs of CFPs and confirmed 52 existing annotations with predicted TSSs based on N-terminal peptides. Among them, 15 proteins were confirmed with N-terminal peptides with initiator methionine residues, and 42 were confirmed with the initiator methionine cleaved. Interestingly, five of them were confirmed with both the initiator methionine residues and the initiator methionine cleaved (supplemental Table S5).
It is of interest that some TSSs that are wrongly assigned can be corrected. Here, all of the MS-derived peptides were screened against the customized N-terminal extension database, and a minimal IonScore of 40 for an individual peptide

Analysis of the Secretome of BCG
was required. As a result, 98 N-terminal peptides were mapped to the database. After manual validation, 33 unique peptides that mapped upstream of the currently annotated TSSs of their corresponding proteins were obtained. These peptide hits indicated that the 5Ј ends of the corresponding genes should be extended. In total, the TSS extensions of 22 proteins were validated, of which 7 contained at least two unique peptides (Table II). Fig. 3 depicts an example of a gene model that has an extension of the N-terminus. Four unique peptides mapped upstream of the original gene product BCG_1741c (Fig. 3A), a catechol-o-methyltransferase. Additionally, 13 unique peptides also mapped to BCG_1741c when we searched against the BCG protein database. The extended gene sequence was searched using the gene prediction programs FgeneSB and GeneMark, and the result indicated an alternative gene model. Moreover, by performing a Blastp search against the non-redundant protein database, the N-terminal-extended protein, and not the original protein, shared a higher similarity with its homolog in M. tuberculosis H37Rv. Therefore, according to our proteomic results, the length of BCG_1741c should be extended to 249 aa instead of 196 aa (Fig. 3B). Furthermore, based on N-terminal peptides that had an initiator methionine, the accurate location of the extended TSSs of four CFPs could be confirmed. One such example is illustrated in Fig. 3C, where a peptide with a non-tryptic N-terminus, A. MPATSVANNSGSMVALATIEACPALPSR.L, mapped upstream of the original TSS of an annotated protein BCG_1317. Another seven unique peptides also mapped to this protein when we searched against the BCG protein database. The extended gene sequence was also supported by FgeneSB and GeneMark predictions and Blastp searching. It should be noted that although the extensions of 15 proteins were validated with different TSSs based on only one extended peptide each, most of them contained several unique peptides each when searched against the BCG protein database (supplemental Table S4). Consequently, the MS-based proteomic method used here could confirm proteins with predicted TSSs, identify proteins with different TSSs, and validate their extended TSSs, which could be used as evidence for extending the original length of the gene models. This strategy represents an effective and promising means for the experimental identification of TSSs that could be applied to other fractions of the BCG proteome, such as cytoplasmic and membrane proteins. We presumed that the identification of the different TSSs was most likely a result of the imperfect BCG genome annotated with the current bioinformatic methods. The results presented here suggest that some predicted ORF lengths in the genomic annotation probably require re-characterization.

Discovery of Novel Protein Coding
Genes-It is most intriguing that some novel peptides or proteins could be identified in the CF. However, the incompleteness of the current protein databases acts as a limiting factor when seeking novelty with MS/MS data. Here, we constructed an in-house database of BCG that included all possible "gene encoding products" (17). This database would, therefore, contain all possible ORFs, both those previously predicted and those that were not predicted. Peptides identified via MS were considered to be existing gene products from the genome (39). For example, Jungblut et al. identified six proteins that were not predicted by the genome annotation of M. tubercu-losis using a two-dimensional gel electrophoresis-MS approach (40). We used this proteomic strategy to provide an independent and complementary means of novel constituent identification that is an evidence-based detection, and not a theoretical prediction from genomic sequences.
In the present study, after we excluded peptides that map to currently annotated proteins (from the BCG database), the results from the customized six-reading-frame database search were used to provide a list of novel unique peptides. In total, 61 peptides mapped to regions of "proteins" in the six-frame database where no data were present in the annotated BCG database. To improve the confidence of novel "proteins," we required at least two unique peptides with a minimal IonScore of 40 per ORF. After manual filtering and validation, we could predict the presence of 17 novel protein coding genes. Table III lists the 17 novel constituents along with 37 supporting unique peptides and genomic coordinates. By performing the Blastp algorithm against the non-redundant protein database, we checked the conservation of these ORFs across related organisms. Among these novel proteins, eight have orthologs in other mycobacteria, and four have orthologs in other organisms. Significantly, the other five were completely novel constituents that had no homology with proteins from any organisms. Interestingly, a bioinformatics analysis indicated that two of the novel proteins (BCGRF042474 and BCGRF059986) were poten- tial classical secreted proteins, and five (BCGRF002933, BCGRF005243, BCGRF016070, BCGRF047639, and BCGRF051382) were leaderless secreted proteins. Fig. 4 depicts the identification of a novel ORF-encoding "protein," BCGRF059986, in the BCG genome along with the corresponding unique peptides. In detail, two peptides, R.GDLASGTLLVT-GVSPRPDAGGQQYVTIAGIITGPTVNEYAVYQR.M and R.MA-VDVDQWPTVGQILPVVYSPK.N, mapped to the novel "protein." The ORF sequence was supported by FgeneSB and GeneMark predictions. Furthermore, a Blastp search against the non-redundant protein database showed that the "protein" shared a high homology with the hypothetical protein MRA_3169 in M. tuberculosis H37Ra (Fig. 4A). The length of the novel "protein," BCGRF059986, should be 106 aa (Fig. 4B). Furthermore, as an extra validation step, we successfully designed primers for an RT-PCR experiment to verify the transcription of the mRNAs of the novel gene, suggesting that the novel ORF inferred by our method was reliable (Fig. 4C). We also confirmed the transcrip-tion of the remaining 16 novel discoveries using RT-PCR (Fig.  4D). PCR fragments of the expected sizes were observed, indicating that the novel genes were transcribed. Therefore, our proteomic results confirm these true novel gene models that have been missed in genome annotation.
The BCG genome sequence has been available for more than 5 years and has been re-characterized previously (41). It was surprising that many novel constituents were detected in CF, especially because most of them were already annotated in other mycobacteria but were missing from the primary genome sequence of BCG. Interestingly, the lengths of six novel proteins with confirmed TSSs were relatively short (an average length of 98 aa). It is likely that they were missed in the genome annotation of the reference strain because of their small size. Based on the results of our study, it is suggested that the approach of using MS-based proteomic data to identify novel proteins in CF might prove to be an essential complementary method in the future, along with computational methods for annotating genomes, especially for newly sequenced genomes. Functional Distribution and Analysis of the CFPs Functional Distribution of the CFPs-The annotated proteins in the BCG database have been classified into 12 distinct functional categories. The 239 proteins identified were distributed across nine of these categories (Fig. 2D). Most of them were involved in the cell wall and cell processes (functional category 3, 43.5%) and intermediary metabolism and respiration (functional category 7, 17.2%). Relatively few of them were involved in the PE/PPE family (functional category 6, 4.2%) and lipid metabolism (functional category 1, 3.3%). Only two transcriptional regulatory proteins (BCG_3091 and BCG_0702c) were detected in the CFPs. Interestingly, almost all identifications classified in the cell wall and cell processes functional category (102 out of 104) were secreted proteins (Fig. 2D).
In protein database annotation, proteins for which there are no proteomic data are annotated as "hypothetical" or "conserved hypothetical" (if there is supporting evidence of homology in other species) (42). Although these proteins are conserved across related organisms, they are uncharacterized because of dubious functionality based on homology searching (43). The detection of hypothetical proteins with proteomic data allows us to remove the "hypothetical" tag that is associated with many current annotations in databases; that is, we can confirm the existence of hypothetical proteins by using a proteomic approach. Here, we identified 44 proteins that were annotated as hypotheticals (functional categories 10 and 16). Interestingly, 29 of them were predicted as secreted proteins. Additionally, five were conserved hypotheticals with orthologs in Mycobacterium. The identification of hypothetical proteins through the use of proteomic data showed their existence in CF, and their functions are worth studying further.
Major Components from the Culture Supernatant-Proteins released from growing mycobacteria into the extracellular medium are usually believed to be responsible for the high efficacy of BCG, and recognition of these molecules could lead to early immunological detection of the infected macrophages and control of TB (44). The crude CFPs, therefore, have been extensively characterized and are considered to be FIG. 4. Identification of novel gene models based on peptide mapping to the genomic region. A, two unique peptides (red lines) with minimal IonScores of 40 mapped to the genomic region corresponding to a novel protein, BCGRF059986. The presence of this novel gene model was also supported by the FgeneSB and GeneMark programs (yellow box). This novel protein was found to be similar to a hypothetical protein MT3222 in M. tuberculosis H37Ra (purple box). B, protein sequence of a novel gene product. The identified region is in red. C, validation of the novel gene model BCGRF059986 via an RT-PCR approach. The amplified RT-PCR product confirmed the expression of new mRNAs for the novel gene. The size of the product was determined by means of an E-Gel Electrophoresis System using a 2% E-Gel pre-cast agarose gel. DNA Ladder, 1 kb Plus DNA marker (Invitrogen). For BCGRF059986, PCR reaction was performed using the novel gene cDNA as a template. For the negative control, PCR reaction was performed with RNAs as the template. No product displayed in this lane indicated that the RNAs were free of any contaminating genomic DNA. For ß-actin cDNA, a positive control, PCR reaction was performed using human ß-actin cDNA as a template, and the amplified 353-bp product was visualized. D, the transcription of the remaining 16 novel gene models was also confirmed via RT-PCR. PCR fragments of the expected sizes were observed, indicating that the novel genes were transcribed. PCR reactions performed with RNAs as the templates were used as negative controls.
an attractive source of candidate antigens for a new vaccine and diagnostic reagents. In this study, four low-molecularweight antigens were detected: CFP2, CFP6, CFP10A, and CFP17. These secreted antigens are thought to play important roles in the development of protective immune responses. CFP2 corresponds to MTB12 in M. tuberculosis, which was reported to constitute a major component of CF and have potential value as a subunit vaccine component to protect against infection by M. tuberculosis (45). CFP6 elicited high proliferative responses in healthy contacts and patients recovering from TB and also induced the release of a significantly high amount of IFN-␥ (46). Additionally, the 9.5-kDa antigen CFP10A had been the focus of a TB vaccine because of its capability to induce strong cellular immune responses in the host (47). CFP17 can induce both a high IFN-␥ release and a strong delayed-type hypersensitivity response (44). These proteins might have a promising future for the prevention and diagnosis of TB.
The antigen 85 (Ag85) complex, which comprises three proteins, Ag85A, Ag85B, and Ag85C, represents a promising candidate as a novel drug target and pathogenesis factor in mycobacteria (48). In this study, we detected all three of the secreted antigens and one related protein, FbpD, in the CF. The Ag85 antigens, which are ϳ35 kDa and have a pI score of 6.5 each, participate in cell wall biosynthesis and interact with the host macrophage as fibronectin-binding proteins (48). Furthermore, they are also involved in the response to isoniazid treatment. FbpD, also known as the secreted MPT51/ MPB51 antigen, can induce a high level of antigen-specific CD8ϩ T-cell response (49). It is interesting that this immunogenic protein was previously reported in the CF and also within the cell (50).
MPB53, MPB70, and MPB83 are among the most studied mycobacterial antigens. DNA sequence analysis shows that the gene mpb53 is localized close to mpb70 and mpb83 (51). MPB53, an 18-kDa protein detected in the CF of tuberculosis mycobacteria (including clinical isolates) but not nontuberculous mycobacteria, can induce strong, tuberculosis-specific antibody responses and could be a major protective antigen (52). MPB70 and MPB83, encoded as precursor proteins for export through the Sec pathway, can elicit strong T-cell responses and have been extensively explored for the sensitive and specific diagnosis of TB (53). Additionally, MPT63, a 16-kDa immune-protective extracellular protein, could be designated as MPB63 in BCG, which would be similar to other major secretory proteins from BCG, such as MPB53, MPB70, and MPB83. Because MPT63 is a mycobacteria-specific antigen (a Blastp search showed that MPT63 has homologs only in mycobacteria-related species) and is implicated in the virulence of mycobacteria, it has been considered as an attractive drug target and diagnostic reagent against TB (54).
Furthermore, three conserved secreted proteins (TB18.6, TB22.2, and TB39.8) were also identified in this study. Although their exact functions remain to be elucidated, they appear to be major T-cell antigens during infection with pathogenic mycobacteria (55).
Identification of Lipoproteins-Lipoproteins are synthesized as precursors in the cytoplasm and are then translocated across the cytoplasmic membrane by either the Sec or the Tat translocation system (19). In this study, we identified 73 lipoproteins in the CF, of which 55 contained classical secretion signal peptides. The majority were involved in the cell wall and cell processes category (functional category 3). Some lipoproteins are potent agonists of Toll-like receptor 2, which can initiate responses by antigen-presenting cells that influence both innate and adaptive immunity (56). For example, the Toll-like receptor 2 agonists LpqH and LprG participate in the regulation of adaptive immunity by inducing cytokine secretion in innate immune cells or regulating the activation of memory T lymphocytes (57). LprA is a cell-wall-associated lipoprotein that also can induce cytokine responses and regulate APC function (58). Three phosphate-binding transporter lipoproteins (PstS1, PstS2, and PstS3) that are members of a family of periplasmic proteins that act as high-affinity receptors for active transport systems in mycobacteria were unambiguously identified in CF. They play roles in the regulation of mycobacterial growth or metabolism and could be valuable candidates for rapid and specific diagnoses (59). In addition, MPB83, an important antigen described above, is a glycosylated lipoprotein processed by signal peptidase II (60). Interestingly, DsbF, which is a disulfide-bond-forming protein, can ensure the correct folding and disulfide bond formation of secreted proteins (61). ProX from the ATP-binding cassette transport system can bind the compatible solutes glycine betaine and proline betaine with high affinity and specificity, thereby serving as protein stabilizers (62). It was reported that SubI, a sulfate-binding lipoprotein of the ATP-binding cassette transport system, is involved in sulfur metabolism for mycobacterial growth (63).
PE and PPE Family Proteins-The PE and PPE family proteins are exemplified by the presence of Pro-Glu (PE) and Pro-Pro-Glu (PPE) motifs near the conserved N-terminus regions (64). In this study, ten PE/PPE family proteins were identified in CF. Interestingly, PE_PGRS19 was predicted to have a classical secretion signal peptide, and another seven PE/PPE proteins were predicted to be non-classically secreted proteins. It has been shown that extensive amounts of PE and PPE proteins are secreted through the ESX-5 system, which plays crucial roles in mycobacterial virulence (65). For example, as an important ESX-5 substrate, PE_PGRS30 is involved in phagosomal maturation arrest and replication in macrophages (66). Although currently these proteins are the subject of very few biochemical and structure-function investigations, they have been implicated in mycobacterial antigenic variation, which can induce strong immune responses in the host and have roles in mycobacterial virulence and pathogenesis (67).
Mammalian Cell Entry Family Proteins-Mammalian cell entry (Mce) family proteins are crucial for the virulence of mycobacteria and represent components of transport systems that interact with host cells (68). Structural analysis has indicated that some Mce proteins are similar to colicins or ß-barrel porins, which form channels through lipid bilayers (69). In this study, six Mce family proteins were detected in the CF; all contained N-terminal signal or anchor sequences. Although the exact functions of Mce proteins have not been fully determined, it has been demonstrated that some proteins exert their functions by promoting a change in the plasma membrane of cells and allowing an invasion of pathogens into cells (70). For example, Mce1C, mapped by eight unique peptides, was thought to be involved in host cell invasion during the initial phase of mycobacteria infection, functioning in a way similar to that of Mce1A (71). Additionally, Mce4F, which was predicted to be a steroid transporter, was proposed to have roles involving cholesterol and its metabolism in the pathogenesis of M. tuberculosis (72). Most interestingly, mce family genes are absent from the human genome. Therefore, Mce proteins might represent ideal candidate drug targets for better TB therapeutics (68).
Moreover, a number of other non-classical secreted proteins were also detected in the CF, including GroES, GlnA1, ribosomal protein RpsL, RplT, and RpsU. Some of these have necessary functions in the mycobacterial life. For example, GroES is necessary for the correct folding of a variety of proteins. Certain ribosomal proteins can serve as potent immunogens and have been applied in the skin's delayed-type hypersensitivity response (73). CONCLUSIONS In the present study, we obtained the BCG CFP repertoire, and a total of 239 proteins were identified with high confidence through the use of one-dimensional gel electrophoresis and high-resolution tandem mass spectrometry analysis. Out of these, 185 were considered to be secreted proteins or lipoproteins, which suggests that the CF was especially enriched with respect to secreted proteins. The 103 secreted proteins that have not been reported previously provided further insight into the BCG secretion proteome, which might be involved in the immunity of mycobacteria. Furthermore, we also identified 17 novel protein products that were not annotated in the BCG database. We further validated their existence at a transcriptional level via RT-PCR. Additionally, 22 proteins were validated to extend with TSSs based on N-terminal peptides. These data represent the largest number of mycobacterial secreted proteins reported in a single study, and some of these proteins might be potential candidates for vaccination and therapeutics.