A Pneumococcal Protein Array as a Platform to Discover Serodiagnostic Antigens Against Infection*

Pneumonia is one of the most common and severe diseases associated with Streptococcus pneumoniae infections in children and adults. Etiological diagnosis of pneumococcal pneumonia in children is generally challenging because of limitations of diagnostic tests and interference with nasopharyngeal colonizing strains. Serological assays have recently gained interest to overcome some problems found with current diagnostic tests in pediatric pneumococcal pneumonia. To provide insight into this field, we have developed a protein array to screen the antibody response to many antigens simultaneously. Proteins were selected by experimental identification from a collection of 24 highly prevalent pediatric clinical isolates in Spain, using a proteomics approach consisting of “shaving” the cell surface with proteases and further LC/MS/MS analysis. Ninety-five proteins were recombinantly produced and printed on an array. We probed it with a collection of sera from children with pneumococcal pneumonia. From the set of the most seroprevalent antigens, we obtained a clear discriminant response for a group of three proteins (PblB, PulA, and PrtA) in children under 4 years old. We validated the results by ELISA and an immunostrip assay showed the translation to easy-to-use, affordable tests. Thus, the protein array here developed presents a tool for broad use in serodiagnostics.

Streptococcus pneumoniae, also known as the pneumococcus, is a Gram-positive pathogen recognized as a major cause of pneumonia worldwide (1). It resides as a commensal in the nasopharynx of healthy carriers, but in susceptible individuals this bacterium can spread to other body locations and cause disease. The main group risks are the elderly, immunocompromised people and infants. In fact, ϳ800,000 children die each year because of pneumococcal disease, of which Ͼ90% of these deaths occur in developing countries (2,3). In addition, a high number of pneumococcal infection cases are diagnosed in the developed countries and can be associated with high morbidity in children and are an important factor that influences quality of life and produces significant mortality in adults (4,5). There are licensed polysaccharide-based vaccines to prevent pneumococcal infections, but their efficacy is limited (3). Therefore, pneumococcal pneumonia remains as an important health problem and once it has occurred, early diagnosis with accurate diagnostic methods is essential in order to provide patients with prompt and appropriate therapy and hence to improve outcome (1).
Although the major burden of pneumococcal infections is caused by pneumonia, the ability to identify S. pneumoniae as a causative agent in lung infections in children is quite limited. Blood cultures are often negative (6,7). The BinaxNOW test, which measures teichoic acid, is less specific in children than in adults, because healthy carriage in infants can produce false positive results (8). The amplification and quantification of pneumococcal genes (namely spn9802, ply or pcpA) by PCR has been also used, but with lower sensitivity than culture in blood samples in adults and inability to discriminate between carriage and disease in nasopharyngeal and sputum samples (9 -11).
The detection of antibody serological markers by any immunoassay is widely used for early diagnosis, epidemiological surveillance, or evaluation of vaccine immunogenicity against many pathogens, including the pneumococcus (6,12,13). Serological diagnosis of pneumococcal disease based on a single antigen is often challenging, because of the interference of natural antibodies elicited by previous colonization events. Therefore, to better discriminate between diseased and healthy people, a combination of antigens would be desirable. To this regard, proteomics offers an excellent platform to develop the necessary more sensitive and specific immunoassays that can be used for the aforementioned purposes.
It is well assumed that surface proteins are those with the highest chance to raise an effective immune response against pathogen infection, as they are sufficiently exposed and accessible to both T and B cells (14,15). Protein arrays are powerful tools to interrogate the pattern of host humoral responses to infections (16), which allows the study of many antigens simultaneously with a small amount of sample (17) to select a set of antigens with optimal sensitivity and specificity (18,19). In this work, we have selected a set of 95 pneumococcal surface proteins by experimental identification, using the proteomic approach of "shaving" of live cells with proteases and further liquid chromatography-tandem MS (LC/ MS/MS) 1 analysis (20). After producing the selected proteins as recombinant fragments, we have developed the first pneumococcal surface protein chip and probed it with a collection of sera from infected and control children, in order to find proteins that differentiate between pneumococcal or nonpneumococcal infection/health status. Three proteins were proven to discriminate with optimal sensitivity, specificity, and accuracy between nonpneumococcal disease and disease status for Ͻ4-year-old children. As a proof-of-concept, we have developed an immunostrip assay with such proteins, obtaining the same sensitivity, specificity, and accuracy, thus demonstrating the power of high-throughput technologies for discovering diagnostic biomarkers of infection and its possible translation to an easy-to-use clinical tool.

EXPERIMENTAL PROCEDURES
Ethics Statement for Human Sera Sampling and Use-This research was performed according to the principles expressed in the Declaration of Helsinki. All human sera (SDS1) were obtained from patients admitted to Hospital Universitario Infantil Virgen del Rocío (HUIVR) in Seville, Spain. All human sera were collected from children Ͻ14 years old. Sera were drawn either from patients with a diagnosis of pneumococcal pneumonia (the "patient" group), based on clinical features, radiological imaging, and isolation of the microorganism from a sterile site (blood or pleural fluid), or from healthy children or patients affected by other pathologies different from pneumococcal pneumonia (the "control" group). All sera from patients with pneumococcal pneumonia were obtained within 10 days of hospital admission. Two different sera sets were collected: 71 sera for the protein array test set, and 24 sera for the validation set. Written informed consent was obtained from parents or legal guardians of participating children and the Hospital Universitario Virgen del Rocío Ethic Committee approved the study (code no. 010470, certificate no. 14/2010), for sera to be used within the project in which this work was designed.
Bacterial Strains, Growth and "Shaving" of Live Cells-Twenty-four pneumococcal isolated from human patients (Table S1) corresponding to empyema cases were kept, grown, and "shaved" for surface protein identification as already described (21,22). Briefly, 100 ml of each strain were grown in a chemically defined medium (CDM) (23) supplemented with 20 g/ml ethanolamine. Bacterial pellets were washed twice with PBS and resuspended in 1 ml of PBS containing 30% sucrose (pH 7.4), and digested with 5 g trypsin (Promega, Madison, WI) for 30 min at 37°C. The resulting digestion mixtures were redigested with 2 g trypsin overnight at 37°C. Samples were cleaned using Oasis HLB extraction cartridges (Waters, Milford, MA).
Molecular Genotyping-Multilocus sequence typing (MLST) was performed using standard methodology (24). Briefly, internal fragments of seven housekeeping genes (aroE, gdh, gki, recP, spi, xpt, and ddl) were amplified by polymerase chain reaction and sequenced on each strand. Conventional primers were used, whose sequences are available at the MLST database (http://www.mlst.net). Alleles were assigned by comparing the sequence at each locus to all known alleles at that locus, and the combination of seven alleles determined the sequence type (ST). Allele and ST designations were made using the MLST website, hosted at Imperial College London, and funded by the Wellcome Trust.
LC/MS/MS Analysis and Protein Identification by Database Searching-All analyses were performed as described (21,22), using a Surveyor HPLC System in tandem with an LTQ-Orbitrap mass spectrometer (Thermo Fisher Scientific, San Jose, CA) equipped with nanoelectrospray ionization interface (nESI). The separation column was 150 mm ϫ 0.150 mm ProteoPep2 C18 (New Objective, Woburn, MA) at a postsplit flow rate of 1 ml/min. For trapping of the digest a 5 mm ϫ 0.3 mm precolumn Zorbax 300 SB-C18 (Agilent Technologies, Santa Clara, CA) was used. One fourth of the total sample volume, i.e. 5 l, was trapped at a flow rate of 10 ml/min for 10 min and 5% acetonitrile (ACN)/0.1% formic acid. After that, the trapping column was switched on-line with the separation column and the gradient was started. Peptides were eluted with a 60-min gradient of 5-40% of ACN/0.1% formic acid solution at a 250 nl/min flow rate. All separations were performed using a gradient of 5-40% solvent B for 60 min. MS data (Full Scan) were acquired in the positive ion mode over the 400 -1500 m/z range. MS/MS data were acquired in dependent scan mode, selecting automatically the five most intense ions for fragmentation, with dynamic exclusion set to on. In all cases, a nESI spray voltage of 1.9 kV was used.
Tandem mass spectra were extracted using Thermo Proteome-Discoverer 1.0 (Thermo Fisher Scientific). Charge state deconvolution and deisotoping were not performed. All MS/MS samples were analyzed using Sequest (Thermo Fisher Scientific, version v.27), applying the following search parameters: peptide tolerance, 10 ppm; tolerance for fragment ions, 0.8 Da; b-and y-ion series; oxidation of methionine and deamidation of asparagine and glutamine were considered as variable modifications; maximum trypsin missed cleavage sites, 3. The raw data were searched against an in-house joint database containing 30,673 protein sequences from all the 17 full sequenced and annotated S. pneumoniae strains available at the Uni-ProtKB site at the moment of the database construction (UniProt  taxonomic IDs 189423, 488221, 574093, 561276, 516950, 373153,  487214, 488222, 488223, 171101, 487213, 525381, 760887, 512566,  170187, 1069625, and 760888, all of them in their versions of May 5, 2014). Peptide identifications were accepted if they exceeded the filter parameter Xcorr score versus charge state with SequestNode Probability Score (ϩ1 ϭ 1.5, ϩ2 ϭ 2.0, ϩ3 ϭ 2.25, ϩ4 ϭ 2.5). With these search and filter parameters, no false-positive hits were obtained. Proteins were accepted if they were identified from two or more peptides. Strain R6 was used as reference for providing the accession numbers of the identified proteins; whenever a protein belonging to another strain was found, homology with a corresponding protein of strain R6 was given by using protein-BLAST. If homology with R6 was not observed, then the protein accession numbers of other strains were used. Primary predictions of subcellular localization were assigned by using the web-based algorithm LocateP (http:// www.cmbi.ru.nl/locatep-db/cgi-bin/locatepdb.py) (25).
Production of Recombinant Proteins-Recombinant proteins were produced as double fusion fragments containing an N-terminal GST fragment and a C-terminal His-tag using the pSpark® I vector (Canvax Biotech, Có rdoba, Spain), and expressed in Escherichia coli BL21, as described (21) and according to manufacturers' instructions. Briefly, recombinant products were purified either by Ni 2ϩ -agarose affinity chromatography from the E. coli intracellular fraction, dialyzed against PBS and used for protein array printing after measuring the protein concentration by the Bradford assay (26). All the SprXXXX proteins were expressed from the R6 strain. The proteins annotated as SP_XXXX were produced from the TIGR4 strain. The pblB gene (annotated as sph_0062 in Hungary 19A-6 strain) was cloned from the isolate #418; the gene SP70585_2286, from the 70585 strain, was also cloned the isolate #418; and the gene SPJ_1852, from the 670 -6B strain, was cloned from the isolate #49H. All the primers were designed using the genomes to which the annotated genes belonged.
Protein Microarray Fabrication and Probing-Affinity-purified recombinant proteins were printed on glass slides in quintuplicate (6 ng/spot) as detailed in Fig. 1A with split pins (4 ϫ 4 pin tool) using a robotic array spotter (Genomic Solutions, BioRobotics MicroGrid II 610, Huntingdon, UK). Proteins were distributed into 384-well plates at 2 wells per sample and 30 l per well. Each component was prepared at 250 g/ml in printing buffer (150 mM phosphate, pH 8.5, 0.01% sarkosyl) onto Nexterion Slide H 3-D glass slides. As negative controls, we used 12 commercially available irrelevant (i.e. nonrelated to pneumococcus) proteins from different biological sources (supplemental Table S2). The pins were dwelled into the sample wells and blotted 15 times before printing. The humidity level in the arraying chamber was maintained at 55-60% during printing. Each of the components was printed five times in a grid of 140 m diameter spots with 175-m pitch. Eight complete arrays were printed on each slide. Printed slides were placed in a slide humidity chamber overnight at 75% relative humidity and stored at Ϫ20°C until use. Probing with human sera was carried out in duplicate for each serum sample. Slides were blocked with 25 mM ethanolamine in 100 mM sodium borate buffer (final pH 8.5) and washed three times for 1 min in PBST and once for 1 min in H 2 O. Then the slides were allowed to dry by centrifugation (350 ϫ g for 15 min). After that, they were assembled on 16-well slide holders (Nexterion Slide H MPX 16, Schott, Louisville, KY) and 45 l of a dilution of different sera from the test set (1:200 in PBST) were incubated for 1 h protected from light at room temperature. The different samples were washed twice with 100 l of PBST for 2 min and then incubated with anti-human IgG-Cy3 (1:1000) or anti-human IgM-Cy5 (1:200), covered tightly with a seal strip, and incubated for 1 h at room temperature. The slides were removed from holders, washed twice for 10 min in PBST, then once in PBS for 10 min and finally centrifuged 350 ϫ g for 15 min. To process the array data, the slides were scanned with a Genepix 4000B microarray scanner (Molecular Devices Corporation, Union City, CA) at photomultiplier voltage settings so that no saturated pixels were obtained. Image analysis was carried out with Genepix Pro 4.1 analysis software (Molecular Devices Corporation). Spots were defined as circular features with maximum diameter of 140 m. Local background subtraction was performed and corrected median feature intensity was used for initial data processing.
ELISA-To validate the immunoreactivity results obtained by the protein microarray, the significantly discriminant proteins were validated by ELISA using the validation set sera (SDS1). The proteins were coupled to the plate individually or in combination at 1 g/ position. The sera were used at a 1:100 dilution. As secondary antibody, anti-human IgG coupled to peroxidase was used at a 1:1000 dilution. Reaction was developed and stopped according to manufacturer's instructions and the plate was read at 450 nm.
Immunostrip Printing and Probing-Antibody-based detection of proteins in immunoreactive strips was performed using 1 g of pneumococcus recombinant proteins Spr0247, Spr0561, Sph_0062, and a 1:1:1 mixture of them. As negative controls, 1 g of Lys9 (His-tag recombinant protein of S. cerevisiae expressed in E. coli BL21 with pET/100 TOPO cloning system, Invitrogen, Madrid, Spain, according to the manufacturer's instructions) and 1 g of commercial available trypsin (Promega) were used. As positive controls, 5 g of pneumococcal serotype 8 strain total protein extract and 0.5 g of commercial anti-human IgG produced in goat (Invitrogen) were used. Proteins and extracts were transferred to a nitrocellulose membrane and air dried. Nonspecific sites were blocked by incubation with 5% nonfat milk in T-TBS for 45 min. After two washes with T-TBS, a second 1 h incubation of the membrane with the validation set sera (SDS1), diluted 1:200 in 3% nonfat milk in T-TBS, was carried out. As secondary antibody, rabbit anti-human IgG conjugated to horseradish peroxidase (Sigma, St. Louis, MO), diluted 1:10,000 in TBS, was used. After 1 h incubation, membranes were washed three times with TBS and developed with ECL Plus Western blotting Detection System (GE Healthcare, Barcelona, Spain, according to the manufacturer's instructions). Densitometric analysis was performed using ImageJ v1.48 software.
Data and Statistical Analysis-For analysis of antibody binding to recombinant fragments on the microarray, local background subtraction from 10 surrounding spots was performed and corrected median fluorescence intensity was used for initial data processing. Then, the mean background signal of negative controls was subtracted from each raw spot value after sera hybridization. Negative controls represented hybridizations of nonpneumococcal proteins and buffer spots with sera and secondary antibodies. Both in nonpneumococcal proteins and buffer positions, no reaction with human sera was observed. After background subtraction, negative or zero values were assigned a net value of 0. Then, outlier values for each spot were removed. The two different hybridizations for each serum were averaged to report the signal mean intensity (SMI) values, and the mean and standard deviation (S.D.) were obtained from the five printed spots per protein in each patient and control groups. Finally, data normalization by background was carried out using Microsoft Excel as described (27). Absolute SMI values Ͻ500 were not considered for further statistical analysis, to avoid measures close to the detection limit.
The sera were stratified according to children's age in different groups, as described in detail in the "Results" section: group 1 (G1) comprising sera of children Ͻ4 years old; group 2 (G2) comprising children Ͼ4 years old; group 3 (G3) comprising controls; and group 4 (G4) consisting of patients sera. All the sera together were named the ALL group.
Normalized data were run in the MeV v4.9.0 software. The Wilcoxon-Mann-Whitney test was applied for experiments involving pairwise comparisons between pneumococcal-infected and control groups. The Benjamini-Hochberg (BH) correction was used to control the false discovery rate. Protein targets were considered as immunogenic candidates if antibody levels were significantly different between pneumococcal-infected and control groups with at least a 1.5-fold difference in their SMI values (BH-adjusted p values Ͻ 0.05). Hierarchical clustering was used to group sera samples and antigens into subsets, such that those within each cluster (subset) are more closely related to one another than samples assigned to other clusters. Clustering is based on the degree of similarity between the SMI for each individual. MeV v4.9.0 was used to perform the clustering analysis. Receiver operating characteristic (ROC) curve analysis was performed with MedCalc v12.7.8. Sensitivity, specificity and Area Under the Curve (AUC) were determined from the resulting ROC analysis. Extension of ROC curve analysis to combinations of antigens was performed as described (28).
For sample size calculation, a power analysis was carried out using an on-line calculator for microarray experiments (http://bioinformatics. mdanderson.org/MicroarraySampleSize/). For 95 protein antigens, a desired fold difference of 1.5 between controls and patients, the minimum sample size is 34 sera assuming 1 false positive (␣ ϭ 0.0105), or 23 sera assuming five false positives (␣ ϭ 0.0526).

Selection of Pneumococcal Proteins for Protein Array Design-
The overall aim of this work was to construct a protein array to profile the antibody patterns of sera from children with pneumococcal pneumonia and to assess its utility as a diagnostic tool. We based the design of our pneumococcal protein array on antigens experimentally identified on the surface of a collection of pediatric clinical isolates, as surface proteins are those with the highest probabilities to raise an effective humoral immune response. We applied a successful proteomic approach extensively used by our research group to identify in a fast and reliable way the most surface-exposed proteins, consisting of "shaving" live cells with proteases and further LC/MS/MS analysis.
We analyzed 24 clinical isolates collected from children with invasive pneumococcal disease (IPD), which corresponded to 12 different serotypes. As genetic diversity of pneumococcal surface proteins depends on noncapsular genomic background, we genotyped all the isolates by MLST. We found 20 different clonal sequence types (ST), including several major global clones recognized by the PMEN (http://www.sph. emory.edu/PMEN) as shown in supplemental Table S1. Next, we "shaved" the bacterial cells with trypsin and analyzed the resulting peptides. The number of proteins identified, including cytoplasmic proteins derived from experimental limitations of the strategy (29), ranged between 203 and 687, with yields of predicted surface proteins ranging between ca. 20 and 40%, as already described for adult clinical isolates (supplemental Table S3 and SDS2) (22). To include the best potential antigens in the array, we selected those proteins identified experimentally on the surface of the clinical isolates. The first round of selection was made on proteins being present on a high proportion of isolates (Ն50%). Table I shows the list of surface proteins found in Ն50% of clinical isolates. It comprised 17 cell wall proteins with the LPXTG anchoring motif, five membrane proteins with one TMD, eight proteins with more than one TMD, four secreted proteins, and three lipoproteins. The cell wall proteins ZmpB and IgA were identified as many different proteins in a diverse number of sequenced strains included in the search database, as they are highly variable in their N-term. However, they were included in the list as the R6 strain-annotated proteins Spr0581 (ZmpB) and Spr1042 (IgA). In addition, we included in the list the protein PblB, which we annotated as of unknown subcellular localization. For this protein, there is a discrepancy about its localization by different prediction algorithms. LocateP predicts a cytosolic location. However, PsortB assigns it into the "cell wall" category, although not unambiguously. This protein, found in 20 out of the 24 analyzed isolates, has been demonstrated in Streptococcus mitis to be surface-attached (30,31).
All the proteins for the array, except two, were identified experimentally in variable numbers of clinical isolates. Table II shows the 95 antigens that were selected for production as recombinant polypeptides to be further printed on the array, classified into subcellular localization compartments, according mainly to LocateP, and to their GO annotation (biological process category). In addition to the 37 proteins previously referred to as surpassing the threshold of being identified in Ն50% of the isolates, the rest of proteins were found in a variable number (between 3 and 11) of the analyzed isolates, except two proteins that were not identified but were selected because of their reported immunogenic and/or protective capacity: the cell wall protein SP_1772, and the membrane protein SP_2093. We also selected nine predicted cytoplasmic proteins, identified experimentally, for three reasons: (1) lack of a signal peptide, but with a clearly recognized extracellular localization and function (LytA, Ply); (2) lack of a signal peptide and main intracellular function, but often reported in pneumococcus and many other microbes to be extracellularly located and even displaying immunogenic/protective activity (possible "moonlighting proteins": Eno, GAPDH); and (3) controls to demonstrate the serodominancy and discriminatory capacity of extracellular antigens.
The 95 selected proteins were studied for production of recombinant polypeptides in E. coli, to obtain purified fragments to be printed on the array. As a general criterion, we selected the regions in which we found a high concentration of identified peptides by our "shaving" approach, normally coinciding with the most exposed domains. Then, we removed the signal peptides and transmembrane domains. Specifically, for lipoproteins we selected the region nearest to the C-term; for cell wall proteins, the region nearest to the N-term (except for ZmpB and IgA, as they have the LPXTG anchoring motif close to the N-term; for them, we selected the C-term region); for membrane proteins, any predicted extracellular domain in which we identified peptides experimentally; and for secreted and predicted cytoplasmic proteins, any region in which we identified peptides experimentally. The fragments cloned for each protein are shown in SDS3. For the vast majority of proteins obtained by Ni 2ϩ -agarose affinity purification from the E. coli soluble fraction, purity levels were Ͼ95%, as estimated by densitometry analysis of SDS-PAGE gels (supplemental Fig. S1).
Protein Microarray Fabrication and Sera Antibody Profile-After selection of the set of pneumococcal proteins as potential immunoreactive antigens, we built a protein microarray to    test its viability for detection of antibody profiles as a way to investigate humoral immune response in children with pneumococcal pneumonia, and to evaluate its potential as a serological tool to diagnose pneumococcal pneumonia in individuals or to discriminate among groups. Proteins were immobilized in quintuplicate on the slides, arranged by categories according to their predicted subcellular localization and nature of surface-attaching motifs (cytoplasmic, transmembrane, lipid-anchored, cell-wall or secreted proteins; Fig. 1A). Next, we used a collection of sera from children to assess the immunologic responses to the printed proteins. The collection comprised 38 sera from patients diagnosed with pneumococcal pneumonia that was mostly (95%) complicated a Protein categories were established according to LocateP subcellular predictions: lipoproteins were those predicted as lipid-anchored proteins; cell wall proteins, as those having an LPXTG motif; secretory proteins, as those possessing an SPI-type signal peptide; membrane proteins with one transmembrane domain (TMD), as those possessing either a C-or an N-terminally anchored transmembrane region; membrane proteins with Ͼ1TMD, those predicted as multi-transmembrane proteins; "surface proteins" means the sum of the previous categories; and cytoplasmic proteins, those without any exporting or sorting signal, and predicted as intracellular proteins.
with pleural empyema. We used as controls 33 sera from healthy children (n ϭ 21) or patients (n ϭ 12) that were admitted to the hospital because of diverse nonpneumococcal diseases and had similar mean age than that of pneumococcal pneumonia cases (SDS1). The array was reproducible in terms of both protein seroreactivity and sera profiling (supplemental Fig. S2). The sera were stratified according to chi-  Table II. Proteins were printed in quintuplicate and grouped in sectors which represented different subcellular localizations. Two positions of buffers (B) were printed in the right-above corners of each sector. B, Representative image of a chip, divided in the different sectors, after incubation with a human serum, followed by Cy3-labeled anti-human IgG; C, Representative image of a chip, divided in the different sectors, after incubation with a human serum, followed by Cy5-labeled anti-human IgM. D, IgG serological profile of all the studied sera of the test set (n ϭ 71; 33 controls aged 49.8 Ϯ 40.3 months, and 38 pneumococcal-infected patients aged 48.2 Ϯ 30.6 months) displayed as a heatmap of seroreactivity. The antigens are listed in rows and the sera grouped in columns (C1, controls Ͻ4 years old; C2, controls Ͼ4 years old; P1, patients Ͻ4 years old; P2, patients Ͼ4 years old). The reaction intensity is visualized according to a color scale, with green being the weakest, red being the strongest and black in between. study the effect of age stratification within either control or pneumococcus-infected children: group 3 (G3) comprising controls (C1 ϩ C2) and group 4 (G4) consisting of patients sera (P1 ϩ P2). When considering all the sera together, we referred to this as the ALL group (n ϭ 71; 33 controls aged 49.8 Ϯ 40.3 months, and 38 pneumococcal-infected patients aged 48.2 Ϯ 30.6 months). Then, we probed all the sera on the microarray and measured both primary (IgM) and secondary (IgG) humoral responses (Fig. 1B and 1C), whose overview can be visualized as a heatmap (Fig. 1D and supplemental  Fig. S3).
Identification of Antigens Related to Pneumococcal Disease-We used the normalized serological profiles of children patients and controls to search for serodiagnostic antigens that can discriminate between patients and controls, or to show age/stage-specific evolution. First, we looked at the most seroprevalent protein antigens, according to the SMI values of IgG and IgM levels. SDS4 shows the 20 most immunodominant ones for both IgG and IgM responses, stratified in the above mentioned groups. Considering all the 71 analyzed sera (the ALL group), the SMI values of IgG levels ranged between 13,324, corresponding to seroreactivity against PspC in patient sera, and 816, corresponding to NisP in control sera. As expected, for IgM levels the SMI values were lower. In the Ͻ4 years-old group (G1, 43 sera), many of the most seroprevalent antigens discriminated between patients and controls (considering patients/controls ratios Ն1) for IgG responses, while most of the responses for the G2 group showed no differences (patients/controls ratios Ϸ1). When comparing age groups, we found a general increase in SMI IgG levels in controls (G3), but not in infected children (G4). The control sera group (G3) was then rearranged in subgroups of 12-month intervals and we studied the kinetics of IgG levels against six proteins showing the highest differences between G1 and G2 groups: as shown in Fig. 2, the IgG levels decreased slightly from birth until 1-2 years old, remained relatively constant until 5-6 years old and increased clearly in the oldest subgroup. The same trend was observed for the rest of the anti-proteins IgGs in the same periods of lifespan (supplemental Fig. S4). Regarding the IgM levels, very little or almost no discrimination was obtained in any of the groups. Therefore, only IgG responses were used in subsequent analyses in the search of serodiagnostic protein biomarkers.
Then, we identified the protein candidates related with pneumococcal pneumonia as those showing significantly different IgG levels against such proteins between controls and pneumococcus-infected children groups, with at least 1.5fold differences in their SMI values (adjusted p values Ͻ 0.05, FDR ϭ 0.05) and absolute SMI values above 500, to avoid measures close to detection limits (Table III). In the ALL group, 10 proteins met these requirements, being PblB that showing the highest patients/controls ratio. Interestingly the number of discriminant proteins increased to 24 in the G1 group, i.e. that of children Ͻ4 years old. Of these, one third, i.e. eight proteins had the LPXTG cell wall-anchoring motif. These proteins were among those showing the highest patients/controls ratios (e.g. 5.52 for PrtA, 2.80 for PulA, 2.48 for NanA). Again, PblB showed a high ratio (3.93), being the most significant one. Several predicted secreted proteins were also found to differentiate between patients and controls, being PspC the most discriminant one (3.19). Two cytosolic proteins were also found (Eno and SpxB), but also other two proteins without classical signal/exporting peptides, for which there is extensive literature to be exported: LytA and Ply.
When we considered the IgM response, there was not any single protein that significantly discriminated between patients and controls.
Defining Serodiagnostic Antigen Biomarkers-In order to define a reliable set of proteins as serodiagnostic biomarkers, we tried to improve the sensitivity, specificity, and accuracy of such a serodiagnostic test based on our protein array. To this aim, we carried out a receiver operating characteristic (ROC) curves analysis to study the discriminatory power of different sets of proteins between patient and control sera. ROC curves were generated for each of the protein candidates of the different groups (Table III and SDS4), and the area under the ROC curves (AUC) for each individual antigen is listed in supplemental Table S4 in decreasing order. In the ALL group, the three most discriminatory proteins were PrtA, PblB, and PulA, with AUC values of 0.802, 0.768, and 0.762, respectively. The same three proteins showed also the highest dis- criminatory power in the G1 group, but with higher significance levels and a slightly different order (PblB, AUC ϭ 0.981; PulA, AUC ϭ 0.981; and PrtA, AUC ϭ 0.966). As stated previously, the best discriminatory power was obtained in the Ͻ4 years-old group, rather than considering both younger and older children together.
Then, we extended the analysis to sets of antigens using combinations of ROC curve analysis (Fig. 3). As inputs to the classifier, we used the highest-ranking AUC antigens in combinations of 2, 3, 4, …, n proteins. In the ALL group, there was not a significant improvement when combining several antigens: considering PrtA and PblB together (the two with the The sera were stratified according to children's age in two different groups: group 1 (G1) comprising sera of children Ͻ4 years old (21 controls, C1, and 22 infected patients, P1), and group 2 (G2) comprising sera of children Ͼ4 years old (12 control infants, C2, and 16 infected patients, P2). To compare the influence of immune system maturation degree, we considered also other two groups to study the effect of age stratification within either control or pneumococcus-infected children: group 3 (G3) comprising controls (C1 ϩ C2) and group 4 (G4) consisting of patients sera (P1 ϩ P2). b C: control sera; P: patient sera.
highest AUC values), there was a slight increase in sensitivity, but a decrease in specificity and accuracy (Table IV). However, a clear improvement was obtained in the G1 group: the best parameters were obtained with the combination of the first three antigens (PblB, PulA and PrtA), as the classifier predicted a 100% sensitivity, 95.8% specificity, and 97.9% accuracy to discriminate pneumococcal pneumonia patients from the control group, being those values equal or higher than considering only PblB or PblB ϩ PulA. Validation of Array Data and Immunostrip Probing-The serodiagnostic capacity of the protein array was validated by a colorimetric ELISA assay using an independent set of 24 sera from children Ͻ4 years-old (12 control infants, C, 31.0 Ϯ 12.4 months; and 12 pneumococcal pneumonia patients, P, 27.6 Ϯ 12.3 months, SDS1) (Fig. 4). With this test, we confirmed the results of the protein arrays, i.e. the combination of more than one antigen resulted in a clear discrimination between infected patients and controls in children Ͻ4 years-old according to the ROC curves (Fig. 4A). The AUC analysis showed higher values of sensitivity, specificity and accuracy (100%, 100 and 100% respectively) for the combination than when the antigens were considered individually ( Fig. 4B and supplemental Table S5).
Finally, in order to check the application of the discriminatory power of sets of proteins from protein array projects in affordable, ease-of-use serological tests, we developed an immunostrip assay using the three proteins with the highest discriminatory capacity (i.e. PblB, PulA and PrtA) and a mixture of them. The immunostrips were probed with the 24 sera of the independent validation set. As shown in Fig. 5, there was in general a higher IgG response in sera from infected patients compared with controls, both against the individual proteins and against the 1:1:1 mixture, whereas the reaction against positive controls was the same. The spot intensities were analyzed, and the AUC analysis showed the highest values of sensitivity, specificity and accuracy (100% for the three parameters) for the test to discriminate between patients and controls.

DISCUSSION
Protein microarrays are an ideal means to explore in a high-throughput way the humoral responses to many pathological conditions, including infectious diseases (32). In fact, there is extensive literature reporting the use of these platforms to measure the antibody profiles in collections of sera from patients affected by bacterial (19,(33)(34)(35)(36)(37)(38)(39), fungal (40),  and parasitic infections (41)(42)(43)(44)(45). They can be used as a diagnostic tool, for validation and/or discovery of vaccine candidates, and for epidemiological surveillance programs (46). Our work reports for the first time the development of a protein array of pneumococcal antigens, based on their experimental identification on the surface of relevant clinical isolates, and its use as a diagnostic tool to discriminate between pneumococcal-infected children and controls. Surface proteins play FIG. 5. Immunostrip test using the three best protein biomarkers discovered with the protein array. On the upper panel, the dot blot assay using the three individual antigens (PrtA, PulA, PblB) and their 1:1:1 combination is shown. As negative controls, the irrelevant yeast protein Lys9 (NC-1) and commercial trypsin (NC-2) were used. As positive controls, pneumococcal serotype 8 strain total protein extract (PC-1) and commercial anti-human IgG produced in goat (PC-2) were used. The 24 sera of the independent validation set (12 controls and 12 pneumococcal-infected children) were used at a 1:200 dilution to probe the nitrocellulose membranes. Down, the receiver operating characteristics (ROC) curve and interactive dot diagram of the combination of antigens are represented to show the diagnostic capacity of the assay (C: controls; P: patients). many important roles in the interaction between cells and their environment (47). They have many functions and are targets for drugs and vaccine development, as well as candidates for diagnostic biomarkers, as they have the highest probabilities to be recognized by the immune system (14,15,34). Proteomics offers excellent approaches and strategies to identify antigenic surface proteins in microorganisms, especially those most exposed and abundant. A successful approach to identify in a fast and reliable way the "surfome," i.e. the set of surface proteins, consists of "shaving" live cells with proteases, following LC/MS/MS analysis (20,48,49). When applied to a collection of strains or clinical isolates of a given species, it provides a very interesting overview of the presence, frequency, and abundance of the identified proteins throughout the studied population, which allows choosing good candidates for vaccine or diagnostic purposes.
In the present study, we selected the proteins for the protein array design using the "shaving" approach, after analysis of 24 clinical isolates from pneumococcus-infected children. Some of the isolates belonged to the most prevalent and virulent serotypes circulating in Spain in the last years (1, 3, 7F, 14, and 19A) (50). The identifications included numerous cytoplasmic proteins, which may be because of unavoidable lysis (21,29). However, both the numbers of total identified proteins and percentages of predicted surface ones were very similar to those described for adult clinical isolates (22). Moreover, many of the proteins identified in Ն50% of pediatric clinical isolates in this study were also found in Ն50% of adult clinical isolates. Although we based this protein array platform for being used as a clinical tool for children, it might be also used for adults.
Almost all the proteins selected for the array design were identified experimentally. Although Table I reflects only those found in Ն50% of clinical isolates, many others were identified in a high number (i.e. 6 -11) of isolates. Many of these were also identified in a high proportion of adult clinical isolates (22). Our results show the enrichment of LPXTG-cell wall proteins when compared with their predicted figures from the genome, as already demonstrated in previous works for this and other pathogens (20,21,48,49,(51)(52)(53)(54). We decided to include also some cytoplasmic proteins, as for some of them their surface localization and immunogenicity/protective activity, are described, like Eno (55,56) or GAPDH (57). In addition, LytA and Ply are predicted as cytoplasmic because they lack a signal peptide, but their final destination is extracellular (58). We only included three proteins not identified experimentally, either from in silico selection or being described in literature as immunogenic/protective: SP_1772 (PsrP), a cell-wall protein with proven immunogenic/protective capacity (59); SP_2093, a transmembrane protein without any described immunogenic/protective activity that was selected to test its seroreactivity in absence of experimental evidence; and HipO, a cytoplasmic protein that was selected as an internal negative control.
In our view, the most interesting results derived from the proteomics approach is the discovery of the broad surface expression of PblB (20 out of the 24 isolates). This is an unusual surface protein, as it neither has a signal peptide/cell wall sorting motif, nor is it strongly similar to any known bacterial adhesin. Still, it resembles a phage-encoded tail fiber protein (30). In S. mitis, it has been demonstrated that this protein is surface-exposed, and acts as a platelet-binding adhesin (30,31). In pneumococcus, there is no evidence on the function of pblB-like genes encoded by bacteriophages. It has been described that up to 76% of pneumococcal clinical isolates contain prophages (60,61). This is the first work that shows experimentally the surface location of this protein in pneumococcus. Moreover, the use of a compilation database containing all the pneumococcal protein sequences available so far has made possible to identify the protein in our study. The use of only reference strains (R6, TIGR4) would have resulted in the missing of this information, as these strains lack PblB.
We selected for recombinant fragment production the regions of the proteins experimentally found in the "shaving" approach. If possible, we selected highly conserved domains across the sequenced proteins available in databases. This was not always possible, as some proteins are highly variable, e.g. ZmpB and IgA. These two proteins were annotated respectively in our study as single entries, but different sequences with Ͻ95% similarity are available, with high variability in the C-term. However, we produced the C-term of both proteins from the R6 strain genome, as that region is the most exposed for both cases. About PblB, different forms are also available in the databases. As this is a large protein, we produced two fragments for this: one containing the C-term region, which is more conserved (PblB_a), and the other containing a region close to the N-term, being more variable (PblB_b).
There is a renewed interest in serological diagnosis of pneumococcal infection using surface proteins-based tests. Some epidemiological studies have been conducted, based on reduced sets of proteins showing unclear patterns of antibody response when applied to heterogeneous cohorts of patients (6,13,62). Serological diagnosis of pneumococcal disease is challenging because it should discriminate between infected patients and nonsymptomatic carriers (63), and because of difficulties in obtaining acute and convalescent sera from pediatric patients (1). Thus, to investigate the humoral responses to IPD in pediatric empyema cases, we developed an array to test 95 antigens. We used sera from patients with the most severe spectrum of pneumococcal pneumonia: bacteremic pneumonia and/or pleural empyema. This latter severe complication of pneumonia has increased worldwide over the last decade and is generally difficult to diagnose it etiologically by blood culture because of high rates of antibiotic treatment prior to sample collection (64,65).
Among the most IgG-reacting seroprevalent antigens, we found many proteins from all the subcellular compartments that have been already described to be immunogenic: the secreted proteins PspC, PcsB, and LytC, lipoproteins, transmembrane proteins, cell wall proteins like PrtA, classical cytosolic proteins like Eno, and cytosolic proteins described to be extracellular like LytA and Ply. The C-term fragment of PblB (PblB_a) was also found to be highly immunoreactive. The general increase in IgG levels as children get older may be explained by colonization, as the differences are observed in the controls (G3 group) but not in the pneumococcal-infected patients (G4 group). There was also an IgM response, but it was in general terms lower than that of IgG, and did not discriminate between patients and controls. Therefore, we based our study on the IgG responses.
As shown, group management by age stratification was useful to highlight discriminant responses. The clearest ones were those of the G1 group, i.e. children Ͻ4 years old. For that group, we obtained the highest number of discriminant antigens and with the highest patient/control ratios of IgG levels. There was a clear enrichment in LPXTG-cell wall proteins: out of the 24 discriminant proteins in the G1 group, eight belonged to this category. This is in agreement with the knowledge that in Gram-positives, the cell wall proteins are generally highly exposed, abundant, highly immunogenic, and even protective. Regarding cell wall proteins, ZmpB and IgA were hardly immunoreactive, which may be because of their variability, therefore hindering their recognition by nonhomologous counterparts. The classical cytosolic proteins Eno and SpxB were also discriminant, as well as the nonclassical extracellular proteins LytA and Ply. Among predicted secreted proteins, the one with the highest discriminant capacity was PspC. We also found that LytC clearly discriminated. We have previously described that this protein is carried by extracellular vesicles released by pneumococcus, and that LytC is highly immunoreactive and even immunogenic (66). However, one of the most seroprevalent antigens, PcsB, did not discriminate at all. Interestingly, we found a highly discriminant capacity for the nonwell-characterized pneumococcal PblB, which was seroprevalent but not among the highest ones.
This study shows that a combination of the highly discriminant proteins discovered in a multiantigen platform for serodiagnostic test can be useful to discriminate between infected and control people. In our case, the combination of PblB, PulA, and PrtA showed such a discriminatory power with very high accuracy, sensitivity, and specificity, as demonstrated by the AUC analysis. The results were also validated by ELISA, and the immunostrip assay also showed that the results may be translated into easy-to-use, affordable tests.
This study has several limitations. It can be argued that in our case the work considered a relatively limited number of sera, but as stated above, legal, ethical, logistic, and technical issues often limit sera sampling in children patients. However, the sample size was above the minimum, according to the power analysis. Although sampling of pneumococcal pneumonia patients was performed generally within the first few days of hospital admission, once microbiological diagnosis of pneumococcal disease was established, it was not performed as fixed time points since disease onset because of variable duration of symptoms prior to hospital admission. Nevertheless, our protein array is an excellent launching platform to be applied in different programs and sera populations. CONCLUSIONS We have developed a protein array for its use in the study of humoral responses to pneumococcal infection, based on the selection of experimentally identified proteins. The platform has proven its capacity to measure antibody levels in children's sera and to discriminate between pneumococcal-and nonpneumococcal infected infants. PblB has been shown experimentally for the first time to be expressed on the surface of a large collection of pneumococcal isolates, being immunoreactive and even discriminant. This platform is an excellent means to be used as a diagnostic tool and can be adapted to different population studies. Moreover, it may be also useful in programs of epidemiological surveillance and even for vaccine candidate discovery.
Acknowledgments-We thank Proteomics Facility, SCAI, University of Có rdoba, which is Node 6 of ProteoRed, ISCIII for mass spectrometry analysis. Protein arrays fabrication was performed at the Geomics Unit, SCAI, University of Có rdoba. We are especially indebted to Dr. Mercedes Cousinou for technical set up and development of the array, and to Dr. Mario Durá n-Prado for protein array design and analysis support.
DATA AVAILABILITY: The proteomics data have been deposited into the ProteomeXchange Consortium (67) (http://proteomecentral. proteomexchange.org) via the PRIDE partner repository (68) with the data set identifier PXD001740. * This research was funded by Project Grants FIS-P12/01259 (Spanish Ministry of Economy and Competitiveness), P09-CTS-4616 from Consejería de Innovació n, Ciencia y Empresa (Junta de Andalucía), PI-0207-2010 from Consejería de Salud (Junta de Andalucía) to MJRO, and by FEDER funds from the EU. IJM was recipient of a Ph.D. fellowship of the PIF Program from Junta de Andalucía. We are also indebted to members of the AGR-164 group, University of Có rdoba, for lab support.