Predicting Antidisease Immunity Using Proteome Arrays and Sera from Children Naturally Exposed to Malaria*

Malaria remains one of the most prevalent and lethal human infectious diseases worldwide. A comprehensive characterization of antibody responses to blood stage malaria is essential to support the development of future vaccines, sero-diagnostic tests, and sero-surveillance methods. We constructed a proteome array containing 4441 recombinant proteins expressed by the blood stages of the two most common human malaria parasites, P. falciparum (Pf) and P. vivax (Pv), and used this array to screen sera of Papua New Guinea children infected with Pf, Pv, or both (Pf/Pv) that were either symptomatic (febrile), or asymptomatic but had parasitemia detectable via microscopy or PCR. We hypothesized that asymptomatic children would develop antigen-specific antibody profiles associated with antidisease immunity, as compared with symptomatic children. The sera from these children recognized hundreds of the arrayed recombinant Pf and Pv proteins. In general, responses in asymptomatic children were highest in those with high parasitemia, suggesting that antibody levels are associated with parasite burden. In contrast, symptomatic children carried fewer antibodies than asymptomatic children with infections detectable by microscopy, particularly in Pv and Pf/Pv groups, suggesting that antibody production may be impaired during symptomatic infections. We used machine-learning algorithms to investigate the relationship between antibody responses and symptoms, and we identified antibody responses to sets of Plasmodium proteins that could predict clinical status of the donors. Several of these antibody responses were identified by multiple comparisons, including those against members of the serine enriched repeat antigen family and merozoite protein 4. Interestingly, both P. falciparum serine enriched repeat antigen-5 and merozoite protein 4 have been previously investigated for use in vaccines. This machine learning approach, never previously applied to proteome arrays, can be used to generate a list of potential seroprotective and/or diagnostic antigens candidates that can be further evaluated in longitudinal studies.

Of the five species of malaria parasites that infect humans, Plasmodium falciparum (Pf) 1 and P. vivax (Pv) are the most common. Interventions aimed at reducing transmission and improving diagnosis and treatment have led to a dramatic reduction in morbidity and mortality (1,2). For example, Pf fatalities have declined from an estimated one million to 655,000 annually (2). Although Pv is now recognized as the most widespread species worldwide and a significant cause of severe disease, this parasite, which can relapse months to years after the initial blood stage infection, is still largely ignored (3,4). Furthermore, mixed-species infections, most commonly of Pf and Pv, are more frequent than previously thought. Although blood smears suggest that Ͻ2% of cases are mixed-species infections, PCR-based diagnoses suggest that 55-65% of infections in Thailand, Papua New Guinea (PNG), and other countries in south-east Asia (5-7) are mixedspecies infections.
Natural immunity can be subdivided into antidisease immunity and antiparasitic immunity. Antidisease immunity (defined as the absence of symptoms) develops quickly, sometimes requiring only one or two infections in high transmission areas (8 -11). However, individuals living in high transmission areas develop non-sterile antiparasite immunity, resulting in lowlevel parasitemia and asymptomatic infections. This immunity is acquired much more slowly than antidisease immunity, may require repeated infections depending on the transmission rate, and is rarely sterilizing (12). Parasite densities in individuals that have acquired antiparasite immunity are average 10 4 -to 10 6 -fold lower than those in non-immune individuals (13).
Blood stage parasites activate innate responses, which in turn lead to significant levels of humoral and cellular adaptive immunity (reviewed in (14)). Antiparasitic immunity appears to be mediated primarily by antibody responses against blood stage antigens (15,16). Specific antimalarial antibodies can block invasion of host erythrocytes in vitro by both Pf (17) and Pv merozoites (18 -21). Additionally, certain antibody isotypes, in particular IgG3, can induce antibody-dependent cellular inhibition (ADCI) of parasite invasion and development in erythrocytes, which is strongly associated with protection against malaria parasites (13,22). Moreover, passive transfer of Pf antimalarial antibodies to infected patients can result in parasite clearance (15,16). Evidence from field studies suggests that the slow acquisition of antibodies to genetically variant circulating strains over several years is associated with antidisease immunity to Pf (23), but to a lesser extent to Pv (24). Cell-mediated immune responses also play a role in protection, particularly early in the immune response. A strong pro-inflammatory response mediated primarily by interferongamma (IFN-␥) and tumor necrosis factor-␣ (TNF-␣) contributes to the initial killing and clearance of parasite-infected red blood cells (25).
Identifying antibody targets that are associated with infection, disease, or immunity will support the development of vaccines, diagnostics, and tools for sero-surveillance. By comparing the humoral response profiles of defined populations possessing varying degrees of antidisease and/or antiparasite immunity, it may be possible to identify combinations or responses that are associated with protection against clinical disease and/or parasitemia. These responses could guide selection of antigen(s) for blood stage vaccines. Here, we applied Plasmodium genome sequence, proteomics, bioinformatics, and proteome array fabrication technologies to construct a Pf/Pv blood stage proteome array. The Plasmodium genome encodes over 5000 proteins (5538 and 5435 in Pf and Pv, respectively (26)), nearly half of which have been identified via proteomics at the different stages of malaria parasite life cycle (27,28). A total of 4441 recombinant proteins, representing 1922 Pf and 1936 Pv native proteins pre-viously reported or predicted to be expressed by the blood stages of these parasites were included on the Pf/Pv proteome arrays, which were then used to analyze antibody responses to both Plasmodium species in naturally-exposed individuals with clinically characterized infections.
The resulting data were interrogated using machine-learning algorithms to identify antibody responses that associated with disease status. We identified sets of antigen-specific antibody responses that can be used to distinguish between asymptomatic donors with parasitemias detectable by light microscopy or PCR and asymptomatic donors, some of which were identified by multiple comparisons. This study is a proofof-concept of the power of applying machine learning algorithms to biomarker discovery, and paves the way for future, more robust studies to identify novel malaria vaccine targets.

EXPERIMENTAL PROCEDURES
Serological Samples-Serum samples from children aged 6 months to 10 years in the Madang area on the north coast of Papua New Guinea (PNG) were used for this study. The study sites have a tropical, humid climate year-round, with a rainy season from Nov. to May; malaria is hyperendemic with limited seasonality (29). Both P. falciparum (Pf) and P. vivax (Pv) are common, and the disease burden is mainly in children under 10 years of age (29,30). To establish parasitemia, two thick and two thin smears per participant were evaluated under 100ϫ oil immersion and parasitemia was calculated from the number of asexual parasites per 200 leukocytes, assuming mean leukocyte counts of 8000/l. Two independent reads were performed, and a third of the first two reads disagreed. This was later confirmed by PCR: whole blood genomic DNA containing parasite DNA was prepared for species-specific PCR/LDR-based diagnosis of Plasmodium infection (31) to confirm parasitemia and species identification. Among samples that were positive for light microscopy, there were very few discrepancies between microscopy and PCR regarding the identified species; in these cases the species identification was changed to reflect the species-specific PCR result. Plasma was collected and stored for future studies.
Samples from symptomatic (clinical) cases (n ϭ 108) were collected as part of a morbidity surveillance study of children aged 0 -5 at three health centers near Madang (Yagaum, Mugil, and Alexishafen) in 2005-2006 (32, 33), and were defined as those individuals presenting at local health clinics with parasitemia over 1000 Pf asexual stage parasites per l of whole blood, or 250 Pv asexual stage parasites per l of whole blood, and fever within the last 24 h but without any severe malaria symptoms (32). Samples from asymptomatic children used in this study (n ϭ 116) were collected as part of a larger (n ϭ 1275) cross-sectional community survey in 2005 that included individuals of all ages (34,35) presenting with no fever or history of fever in the last 48 h, but with parasitemias detectable by light microscopy (LM) or PCR. For this study, we selected a subset of sera from children aged 0 -10 with PCR-confirmed, asymptomatic infection that were either positive (LM, n ϭ 55) or negative (PCR, n ϭ 61) by LM. Because of the rapid acquisition of antidisease immunity, in particular to P. vivax (36), a full age-matching to the symptomatic cases was not possible in the study population. However, in order to minimize the difference in age between the groups, samples from the youngest children with asymptomatic infection were preferentially selected. These groups were further subdivided into nine final groups according to the parasite species detected (Pf, Pv, or Pf/Pv infections) ( Table I). 18 US malaria-naïve control donors were used to determine the malaria-specificity of the responses.
Study Approval-All samples were collected in previous studies. Written informed consent for use of samples in future studies were obtained from a parent or guardian of all subjects. All studies were approved by the PNG Institute of Medical Research (IMR) Internal Review Board (IMR IRB 0901) and the PNG Medical Advisory Committee (MRAC 09.03).
Construction of Pf and Pv Blood Stage Proteome Array-We designed a proteome array containing 2208 Pf and 2233 Pv recombinant proteins, representing 1922 Pf and 1936 Pv native proteins, respectively. Blood stage proteins predicted to be secreted or presented on the surface of the parasite were selected. Pf genes had evidence for blood stage expression by microarray, proteomics, or expressed sequence tags (ESTs), or were predicted to be in the blood stage secretome (37,38). Pv genes had evidence for blood stage expression by microarray (39,40). These included both unique Pv proteins and Pv orthologs of Pf proteins. Proteins predicted to be on the surface of the parasite were defined as genes with signal peptides and/or transmembrane domain(s) according to PlasmoDB (26). Putative cytoplasmic proteins, lacking both signal peptides and transmembrane domains, were also selected (pI Ͻ 9.0 for Pf genes and pI Ͻ 7.6 for Pv genes, respectively). Genes encoding antigenically variant proteins, such as the Pf gene families PfEMP1s, rifins, stevors, surfins, the Pv vir genes, as well as pseudogenes were excluded. Both single and multi-exon genes were included, but exons of multiexon genes were cloned into separate plasmids. Furthermore, large genes or exons were further divided into overlapping segments in order to limit amplicon length to between 300 and 3000 nt. For example, a 5000-nt exon would be amplified as two 3000-nt segments that overlapped by 1000 nt. Amplicons were labeled with the exon number and the total number of exons, such as "1o2" for exon 1 of a 2-exon gene. Genes that were further divided into segments were labeled s1, s2, etc.
The array was fabricated as previously described (41). Briefly, coding sequences were PCR-amplified from Pv (Sal I strain [MRA-552, MR4]) or Pf (clone 3D7, [MRA-102, MR4]) genomic DNAs and cloned into the PXT7 plasmid with a T7 transcription terminator, and tagged with 5Ј polyhistidine (HIS) and 3Ј hemagglutinin (HA) epitopes. Recombinant proteins were expressed using E. coli cell-free in vitro transcription and translation reactions (RTS 100 HY kits from 5 PRIME, Gaithersburg, MD) according to the manufacturer's instructions. Proteome arrays were printed as previously described, with each recombinant protein spotted once in each array (41). Each array contained 256 negative control spots made with an in vitro transcription translation master mix without plasmid DNA. Once printed, recombinant protein expression was verified using antipolyhistidine (clone His-1, Sigma) and antihemagglutinin (clone 3F10, Roche) monoclonal antibodies, as previously described (41). All signal intensities were corrected for spot-specific background. A recombinant protein was deemed to be present on the slide if its mean fluorescence intensity was that of the mean of the "no DNA" control spots plus two standard deviations.
Overall, 1767 Pf recombinant proteins (80.02%) and 1470 Pv recombinant proteins (65.83%) were expressed with both the His and HA tags, confirming that the majority of recombinant proteins were fully expressed. These represented a total of 1558 Pf native proteins and 1328 Pv native proteins. Furthermore, 2026 Pf recombinant proteins (91.76%) and 1949 Pv recombinant proteins (87.28%) were expressed with at least one tag (representing 1781 and 1725 native proteins respectively), suggesting that these recombinant proteins were at least partially expressed.
Probing the Proteome Array with Human Sera-Sera were from symptomatic and asymptomatic children diagnosed with Pf, Pv, or both (i.e. Pf/Pv infection) by LM or PCR, as previously described (41). Sera were diluted to 1/100 in 1x Blocking Buffer containing 1 mg/ml E. coli lysate and incubated at room temperature for 30 min with constant mixing to remove E. coli-specific antibodies. Anti-EBNA1 or anti-human IgG were used as controls for the primary antibodies, and serially diluted anti-human IgG was used as a control for the secondary antibody. After rehydration for 30 min, the arrays were probed in triplicate with the pre-treated sera overnight at 4C with constant agitation, then washed and incubated with Cy3-labeled antihuman immunoglobulin. The slides were washed seven times, air-dried, and analyzed using a Perkin Elmer ScanArray Express HT microarray scanner. Intensities were quantified using QuantArray software and corrected for spot-specific background. Statistical analyses were performed as described previously (42). Briefly, the triplicate data points were averaged, and the data were calibrated and transformed using the variance stabilizing normalization (vsn) package (43) in the R statistical environment (www.r-project.org).
Computational Methods-To identify antibody responses that discriminated between the different groups, we first filtered the dataset. First, we removed antibody responses with signal intensities that were not significantly greater in malaria-exposed children than in the 256 empty wells. To do this, we fit the signals arising from the 256 empty wells to a normal distribution. Next, we assigned the response to each recombinant protein a p value by testing it against the right tail of this distribution. These p values were subjected to a standard FDR correction and those signal intensities with a resulting q-value Ͼ 0.01 (44) were set to zero. Any recombinant protein that did not elicit significant signal intensity by this analysis was removed. Next, we filtered out responses that were not significantly different in exposed children and the 18 malaria-naïve US donors. For each recombinant protein, we used a Welch's two-tailed t test to test the hypothesis that the signal intensities in 224 infected children were significantly different than those of the 18 US controls. The resulting p values were converted to q-values using Benjamini and Hochberg false discovery rate (FDR) correction (n ϭ 4441) (44). Those recombinant proteins with a q-value Ͼ 0.01 were removed. Next, we performed clustering analysis and replaced clusters of proteins that elicited similar responses in all exposed donors with a representative reactivity profile for each cluster. In brief, we clustered proteins that were had positive signals in more than 75% of children into groups, and we replaced the reactivities to individual recombinant proteins with a logical "or"ing representative signal of the reactivity ratios.
We further reduced the remaining antibody responses using an age filter to select for antibody responses that were associated with asymptomatic infections regardless of age group. Briefly, as asymptomatic children tended to be older than symptomatic children, and as the breadth and intensity of reactivities are associated with increasing pathogen exposure, age was significantly correlated with mean reactivity before filtering and clustering (for the entire data set, Pearson's rho ϭ .15, p value ϭ 0.026, n ϭ 224) and this was exacerbated by the filtering (rho ϭ .18, p value ϭ 0.006, n ϭ 224). We calculated the mutual information (45) between the reactivity to each recombinant protein and the individual's age and the disease status of the individual. Those recombinant proteins that shared more information with the disease status than with age (i.e. those that showed a stronger association with the individual's phenotype than with the individual's age (resulting rho ϭ 0.10, p value ϭ 0.120, n ϭ 224) were preserved, whereas the others were filtered out. In other words, if a recombinant protein was more predictive of age than of disease status, it was removed. Because this filtering required the algorithm to use the individual disease status, it was performed inside the crossvalidation loop as necessary. Next, we used machine-learning approaches to identify biomarkers that associated with disease status by comparing each of the six different asymptomatic groups separately to the corresponding symptomatic control. We first used crossvalidation to estimate whether a classifier was effective at predicting a given clinical category. We constructed evolutionary trees using the evTree algorithm (46), and performed eightfold cross-validation to measure the agreement and the Pearson's correlation between the predicted and actual categories. These metrics were reported alongside the final decision tree as a general measure of classifier accuracy. Finally, to select which recombinant proteins would be most informative as biomarkers associated with disease status when working in concert with other biomarkers, we used the mProbes (47) feature ranking algorithm with Random Forest (48) feature selection. mProbes estimates a false discovery rate by repeatedly running feature selection after adding noise features, generated by assigning the signal intensities for a real protein to random patients, to the data set. The reported false discover rate is the fraction of times that a noise feature was ranked as being more informative that the signal intensity to the recombinant protein in question. It should be noted that this latter analysis using the mProbes algorithm was not performed for the Pf/Pv.LM versus Pf/Pv.S comparison because the cross-validation analysis revealed it did not possess sufficient information to predict donor status.

RESULTS
Children Rapidly Acquire Immunity Against Development of Fever-In order to define the serological responses to P. falciparum (Pf) and P. vivax (Pv) infection in relation to host immunity, we obtained samples from 224 children in PNG, where both parasite species are endemic (29,30). Symptomatic cases (n ϭ 108) correspond to donors Ͻ5 years of age presenting at local health clinics with positive parasitemia, defined as having over 1000 Pf and/or 250 Pv asexual blood stage parasites per l of whole blood, and fever within the last 24 h, but without any severe malaria symptoms (32). Asymptomatic cases (n ϭ 116) were selected from a wider crosssectional field study that included all age groups, and were defined as children Ͻ10 years of age with no fever or history of fever in the last 48 h, but with Pf and/or Pv parasitemia detectable by PCR or light microscopy (LM) (33). Although symptomatic and asymptomatic samples were collected in different studies, and were not matched for age or parasitemia, this data set allowed us to query the natural immune response to both Pf and Pv simultaneously.
Nine groups were defined based on the combination of clinical status (i.e. symptomatic (S), asymptomatic but with parasitemia detectable by light microscopy (LM), or asymptomatic but with parasitemia detectable only by PCR (PCR)), and the parasite species detected (i.e. Pf, Pv, or both, indicated as Pf/Pv) ( Table I). The parasitemias of all samples from LM and S children are shown in Fig. 1A; parasitemia levels were similar in all LM asymptomatic children, regardless of the parasite species detected (Kruskal-Wallis, p value ϭ 0.24), and were lower than in symptomatic cases, although this difference was less pronounced in Pv infections. Mean parasitemias in symptomatic Pf infections were sevenfold higher than in symptomatic Pv infections (Kruskal-Wallis, p value Ͻ0.0001). These high Pf parasitemia levels are due in part to Pf's ability to invade both mature erythrocytes and reticulocytes, whereas Pv preferentially invades reticulocytes (49) and generally produces lower parasitemias than Pf. Moreover, symptomatic children who were infected only with Pv carried higher Pv parasite loads than symptomatic children with Pf/Pv infections (Fig. 1A).
Among the symptomatic children (all under the age of 5 years as per the original study design), those infected only with Pv were significantly younger than all other infected children (one-way ANOVA, p value Ͻ0.0001), with an average

falciparum and P. vivax infections
The characteristics of the 9 clinical groups and the non-exposed donors are indicated, including the Plasmodium species detected, the abbreviation used to designate the group throughout the manuscript and a short description of the clinical group (PCR: asymptomatic donors with parasites detected by PCR, LM: asymptomatic donors with parasites detected by light microscopy, S: symptomatic donors, Pf:donors with detectable P. falciparum infection; Pv:donors with detectable P. vivax infection; Pf/Pv: donors with detectable P. falciparum and P. vivax infection); as well the mean and range of parasitemia values observed by LM, the result of the PCR test for parasitemia, the presence or absence of fever, the mean and range of the age of donors, the percentage of donors older than 5 years, and the number of donors in each group (N). * -all symptomatic samples are from a study that includes only children younger than 5 years;ˆ-all malaria naïve controls are older than 18 years.  (Table I). This suggests that children acquire anti-Pv disease immunity (i.e. fever) more rapidly than anti-Pf disease immunity, in agreement with previous findings in PNG (36). However, it should be noted that the age distribution of the asymptomatic samples does not correspond to a random sample of asymptomatic infections in the general population because we selected samples from the youngest children in this group for our study. Despite this, the mean age of the children in the asymptomatic groups was higher than that of children in the symptomatic groups (5.6 and 5.9 years for the LM and the PCR children, respectively, one-way ANOVA, p value Ͻ0.0001). Among the asymptomatic children, there was no significant difference in the age range of those infected with different species of Plasmodium (one-way ANOVA, p value ϭ 0.15).
In general, younger children were more likely to carry high parasitemias, whereas older children had lower parasitemias (Fig. 1B). Although we selected the youngest children from the asymptomatic study, between 65 and 83.3% of the asymptomatic LM cases, depending on the infecting Plasmodium B.
A. species, occurred in children that were between 5 and 10 years of age (Table I and Fig. 1B), suggesting that children in this endemic site rapidly acquired antidisease immunity to the development of fever. Finally, we compared the cumulative incidence of symptomatic Pv, Pf, and Pf/Pv cases over time (Fig. 1C). We saw that in this cohort, the majority of symptomatic Pv infections are detected earlier in life than symptomatic Pf or Pf/Pv infections. For example, 85% of the symptomatic Pv cases were detected by age 2, whereas only 40% of the symptomatic Pf and Pf/Pv cases had occurred by that age. Although our dataset does not allow us to evaluate the effect of previous exposure to malaria parasites on the development of immunity, this result suggests that children acquire antidisease immunity to Pv more rapidly than they do to Pf (30,36,50).
Antibody Responses Against Parasite Proteins Depend on Symptomatic Status and the Infecting Parasite Species-Next, we assessed the reactivity of the children's sera to Pf and Pv proteins using a proteome array that contains 4441 recombinant polypeptides (2208 from Pf and 2233 from Pv) representing 1922 Pf and 1936 Pv native proteins, respectively. Proteins with evidence or prediction of expression during blood stages and secretion or surface localization were selected for inclusion in the array. We probed the arrays with sera from the 224 PNG children, as well as sera from 18 US malaria naïve controls (Table I), and quantified the resulting signals as previously described (41). The strategy used to analyze antibody responses to the Plasmodium recombinant proteins contained in the array is illustrated in Fig. 2. We first selected antibody responses against arrayed proteins (or indicators) that were significantly higher in sera from malariaexposed children than in US adult controls or in empty wells, using both z and t-tests, with a cut-off of q ϭ 0.01 ( Fig. 2A). This step excluded 635 (14.3%) antibody responses that did not differ between exposed children and empty wells, as well as 1586 (35.7%) antibody responses that were similar in malaria-exposed children and malaria-naïve US controls. The 2220 responses that remained were directed against 1026 Pf recombinant proteins (out of 2208 Pf proteins in the array, or 46.5%) and 1194 Pv recombinant proteins (out of 2233 in the array, or 53.5%)) derived from 2021 native proteins (928 Pf and 1093 Pv, respectively). Next, we performed clustering analysis, and identified sets of antibody responses that were highly similar among all exposed donors. These 373 antibody responses (8.4%) grouped into 16 clusters, each of which was replaced in the final dataset with a signature profile, resulting in the net removal of 357 antibody responses. The remaining 1863 antibody responses were then passed through an age filter that removed those that associated more strongly with the age of the donor than with their symptomatic status (n ϭ 1206, Fig. 2A). After applying these filters, 657 antibody responses (directed against 344 Pf and 313 Pv recombinant proteins, respectively, Fig. 2B) were retained for further analysis.
As shown in Fig. 2B and 2C, the nine groups defined on the basis of the infecting parasite specie(s) and symptomatic status exhibited distinct serological profiles. For example, among symptomatic children, those infected only with Pf (Pf.S) recognized significantly more of the arrayed Pf and Pv polypeptides than Pv.S or Pf/Pv.S children (Fig. 2B and 2C and supplemental Table S1). On the other hand, there was no significant difference in the number of polypeptides recognized by sera from Pv.S or Pf/Pv.S children. Moreover, the mean numbers of polypeptides recognized by Pv.S or Pf/Pv.S children were not significantly different from those recognized by Pv.PCR or Pf/Pv.PCR children, but was significantly lower than those recognized by Pv.LM or Pf/Pv.LM children. However, this was not the case for children infected only with Pf, because Pf.S children recognized significantly more polypeptides than Pf.PCR children, but not Pf.LM children.
The number of polypeptides recognized by Pf.PCR, Pv.PCR, and Pf/Pv.PCR children did not differ significantly. Among LM children, those with Pf/Pv recognized significantly more Pf and Pv proteins than those infected only with Pv, but curiously, those infected with Pf recognized significantly more Pv proteins than those infected with Pv. The mean fraction of Pf polypeptides recognized was 13.40% for Pf.LM, 7.28% for Pv.LM, and 24.06% for Pf/Pv.LM children (Fig. 2C). Overall, fewer Pv polypeptides were recognized by sera from LM children (9.22% for Pf.LM, 4.59% for Pv.LM, and 18.17% for Pf/Pv.LM children; Fig. 2C and supplemental Table S1; twotailed Wilcoxon p value ϭ .0313, t test p value ϭ .0017, n ϭ 6). Finally, for both Pf and Pf/Pv asymptomatic infections, the mean fraction of positive responses was significantly higher in LM cases than in PCR cases.
Our findings suggest that older, asymptomatic children with infections detectable by LM have broader serological responses than younger, symptomatic children, and that the breadth of these responses may be linked to protection against fever. Furthermore, although the interpretation of these data is limited by the fact that we cannot distinguish between infections detectable only by PCR that may reflect pre-existing antiparasite immunity resulting from previous episodes of parasitemia or resolving infections in which antibodies have waned, our results suggest that asymptomatic infections detectable by LM may associate with an active and effective immunological response against fever.

A Machine-learning Approach Can Classify Children into Different Disease Groups Based on Antibody Responses Determined via Proteome
Arrays-Having used the proteome arrays to profile antibody responses against Pf and Pv proteins in PNG children infected with Pf, Pv, or both species, next we asked whether these reactivity profiles could be used to accurately classify children into the different disease groups described in Table I. First, we used the evTree decision tree construction algorithm (46), to assess the dataset and determine whether the observed antibody responses to the arrayed proteins could be used to accurately predict whether any individual donor belonged to a symptomatic or asymptomatic group. To do this, we compared the antibody responses of donors in each of the six asymptomatic groups to those in the corresponding symptomatic groups (i. . We applied the filters described previously to eliminate responses that were similar to those observed in empty wells, as well as those that were not significantly different between PNG children and naïve U.S. donors or were nearly identical for all PNG donors. Then, to estimate whether or not the donors in each dataset could be accurately assigned to the symptomatic and asymptomatic groups, we used the evTree machine-learning algorithm with 8-fold cross validation (46), as illustrated in Fig. 3A. In brief, we divided the dataset into an 88% training set and a 12% validation set, and applied the age filter described above to remove responses that associated more strongly with the age of the donor than with their clinical status. Next, a decision tree was built from the training dataset and used to classify the validation dataset into the symptomatic and asymptomatic categories. This process was repeated eight times until the status of each donor was predicted exactly once in the validation dataset. Each cycle of cross-validation produced a different tree that    . 3. Machine-learning approach for biomarker selection. A, evTree cross-validation strategy used to compare antibody responses in each of the six asymptomatic groups to those in the corresponding symptomatic groups. After dividing each dataset into an 88% training set and a 12% validation set and applying the age filter, the evTree algorithm was trained with 8-fold cross-validation to generate decision trees that predicted whether donors were symptomatic or asymptomatic based on their antibody responses to subsets of arrayed proteins. B, Results of filtering and evTree cross-validation parameters for each of the six pairwise comparisons. The table shows the number of donors in each comparison (#D); the number of responses removed by the empty well (#EW), U.S. naïve donor (#US), and clustering (#CL) filters; the number of responses that remained after these filters were applied (I 1 ); the number of responses removed by the age filter (#Age) and the number of responses that remained after it was applied (I 2 ); the cross-validated accuracy (XV Acc.) and p value (XV Acc. p) of the resulting classifier; as well as the corresponding Matthew's correlation coefficient (XV Corr.) and p value (XV Corr. p) for the classifier. C, Overview of mProbes with Random Forest feature selection strategy used to identify antibody responses that discriminate between symptomatic and asymptomatic donors. After adding noise by randomizing labels for indicators that remain rafter the Age filter was applied (I 2 ), the algorithm identifies features that distinguish between symptomatic and asymptomatic donors. Shown is a representative subset of features selected from the Pf. LM versus Pf.S pairwise comparison. cross-validated accuracy (XV Acc.) greater than 65% and significant p values (XV Acc.p). Finally, each undivided dataset was passed through the age filter to determine the final number of significant features and the evTree algorithm to generate a final decision tree. For all datasets, the evTree was able to identify antibody responses to 1 to 3 polypeptides that were sufficient to classify donors into the S and A groups (supplemental Table S2). For example, for the Pf.LM versus Pf.S comparison, positive antibody responses against PVX_115450_1o2 and Pf11_0292_2o3 classified donors into the Pf.LM group with an accuracy of 96.6%. However, because there were many more polypeptide responses than donors, the system remained undetermined, such that many other combinations of antibody responses could be used to predict whether a child belonged to a symptomatic or asymptomatic group with similar accuracy. Therefore, we next used the mProbes algorithm with Random Forest feature selection (47,48) to identify antibody responses against arrayed proteins that distinguished between symptomatic and asymptomatic donors for the five pairwise comparisons that were determined by the crossvalidation analysis to contain sufficient information to predict donor status (i.e. with the exception of the Pf/Pv.LM versus Pf/Pv.S comparison, which did not comply with this criteria based on the results of applying the evTree analysis because the accuracy and correlation p values were Ͼ 0.05). mProbes with Random Forest built thousands of decision trees by repeatedly running feature selection after adding noise features generated by shuffling the labels within the dataset, and report a false discovery rate that corresponds to the fraction of times that a noise feature is ranked as being more informative than the actual data. This process selected features (i.e. antibody responses against arrayed polypeptides) that were most informative in distinguishing between symptomatic and asymptomatic donors within each dataset (Fig. 3C). The complete list of the most informative responses selected by the mProbes algorithm for each of the five comparisons is included in supplemental Table S3.
We noticed that several polypeptides were selected in more than one of the pairwise comparisons, including several instances in which sera from Pv donors recognized Pf proteins, and vice versa (supplemental Table S4). To further investigate this observation, we grouped the polypeptides targeted by the antibody responses based on orthology between Pf and Pv proteins, as defined by Ortho_MCL (26). These data are included in supplemental Table S5. Ortholog groups that were selected at least three times by different pairwise comparisons are shown in Table II, ranked by the number of times they were identified across multiple comparisons. The top two candidates correspond to the papain family of SERA proteins, and to MSP4. All other proteins in the table are annotated as hypothetical with the exception of RAD23. DISCUSSION We developed a P. falciparum (Pf)-P. vivax (Pv) proteome array containing 4441 proteins to characterize antibody responses to both parasite species in children under 10 years old living in endemic areas in PNG. Sera were obtained from symptomatic children who had attended local health clinics, and from asymptomatic but infected children identified in cross-sectional surveys. We found that asymptomatic children with high parasite loads detectable by light microscopy had antibody responses to more antigens than asymptomatic children with very low parasite loads detectable only by PCR, suggesting that antibody responses in the asymptomatic children were associated with parasite load. Conversely, children with symptoms exhibited very few humoral responses, suggesting a possible disregulation of the antibody response during acute infection. Using a machine-learning approach, we identified humoral responses against subsets of parasite proteins that can be used to predict whether a child is symptomatic or asymptomatic. The approaches we describe may be applied to a well-characterized study population to identify potential antigens for vaccine research or serological surveillance.
The proteome array technology has been used to identify novel antigens in several infectious agents, including Francisella tularensis (51), Burkholderia pseudomallei (52), Pf (53)(54)(55)(56), and Pv (41). The array described here detects antibody responses to both Pf and Pv proteins simultaneously, enabling the analysis of sera from malaria endemic areas where both parasites are transmitted. Combining whole proteome arrays with modern analytical tools is a very effective strategy to discover novel antigens for vaccines or diagnostics. Despite the advantages offered by proteome array technology, it is important to note that folding, multimerization, and post-translational modifications such as phosphorylation or glycosylation of arrayed proteins synthesized via in vitro transcription-translation will differ from the native proteins. However, the ability to screen thousands of potential antigens simultaneously is a substantial advantage over conventional approaches.
The samples used in this work were collected in two different studies: one hospital-based and the other field-based. As a result of this, all children in the symptomatic group were younger than 5 years of age, whereas those in the asymptomatic group were older. In fact, although we selected the youngest children among donors in the asymptomatic group, we were unable to age-match the symptomatic and asymptomatic cases because of the fact that there were few asymptomatic children younger than 6 years of age in the asymptomatic cohort. Unsurprisingly, symptomatic cases also carried higher parasitemia than asymptomatic cases. Furthermore, although we utilized a powerful algorithm to analyze antibody responses in the sera from 224 children to Ͼ4000 parasite proteins, this study relied on a limited number of banked samples collected in previous studies and we did not know the donors' sex, malaria history, or other factors that impact infection. Future studies with larger sample sizes will be required to increase statistical power and confirm that the responses we identified here with banked serum samples are associated with asymptomatic infections. These caveats, along with the cross-sectional nature of this study, do not allow for the detection of immune correlates of protection. However, this dataset allowed us to describe humoral responses to both Pf and Pv simultaneously, and to develop an analysis pipeline for use in future studies.
It has been previously shown in PNG (57)(58)(59) and other locations (23,60), that antibody titers to Pf proteins are higher in children with LM-detectable parasitemia than in children with LM-undetectable or no parasitemia. We confirmed this finding in Pf and Pf/Pv infections, for which the percentage of positive responses was significantly higher in the LM children than the PCR children. The high antibody titers in asymptomatic children with high parasitemia (LM children) compared with asymptomatic children with very low parasitemia (PCR children) suggests that in asymptomatic donors, the breadth of the antibody response correlates with antigen load, potentially because of the critical mass of antigen required to elicit a response. This may also indicate poor boosting of the memory response, as the very few parasites detectable by PCR in these children may not have been sufficient to boost antibody responses. Alternatively, these children may have developed cellular immunity through past infections that is able to control parasitemia independently of antibodies (61).
Long-term IgG production is maintained by short-lived plasma cells derived from a memory B cell population, or by long-lived plasma cells, which secrete antibodies for as long as several months after immunization (62,63). Studies in mice have shown that younger individuals develop fewer and shorter-lived plasma cells than adults (64,65). If this result is applicable to humans, it is possible that young children with malaria may require constant antigen stimulation in order to produce antibodies until they are able to develop long-lived plasma cells. This view is supported by several studies showing that antibody responses against Pf antigens are more persistent in adults than in children (66 -68). Furthermore, a more recent study showed that antibody titers declined more slowly with time in both older children and asymptomatic children than they did in younger children, suggesting roles for both antigen persistence and immunological memory in determining the longevity of the antibody response (69). Young symptomatic children may therefore have low antibodies because they have not been sufficiently exposed to the parasite and cannot sustain an antibody response. Alternatively, acute malaria has also been associated with a decreased response to tetanus toxoids, meningococcal polysaccharide, Hib conjugate, and whole cell vaccines for typhoid fever (70 -72). Acute malaria itself may therefore disregulate the immune response and prevent generation of antibody responses of sufficient magnitude for protection. However, it may also be the case that younger children rapidly acquire antidisease immunity independently of antibodies, such as a strong cellular response (61) and then slowly acquire antiparasite immunity as they grow older.
In this study we attempted to identify antibody responses that could be used to predict whether a child was asymptomatic, as these antibody responses could protect against development of clinical disease. For this, we turned to machine learning algorithms to identify sets of antibody responses that could discriminate between symptomatic (febrile) and asymptomatic (afebrile) cases detectable by LM or PCR for subsets of children infected with P. falciparum, P. vivax, or both. The machine learning community has long grappled with how to best determine which features, in this case antibody responses, are most informative. Typically these methods focus on selecting a subset of features (e.g. biomarkers) that maximize predictive accuracy, such as support vector machine (SVM) with backward feature selection (73) or partial area under the curve (AUC) (74), algorithms. However, recent research using simulated data implies that for biomarker discovery the best feature-ranking methods also estimate false discovery rates, which are not calculated by SVM and AUC algorithms. For example, the mProbes algorithm with Random Forest (47,48) seemed especially well suited for problems such as the one discussed here where the accumulated evidence suggests that immunity to malaria only develops after exposure to numerous different antigens, as mProbes with Random Forest can detect groups of antibody responses that work well in concert. However, to our knowledge this is the first time that mProbes has been used in the context of biomarker discovery.
Using this approach, we compared each asymptomatic group to the corresponding symptomatic group, and identified responses to subsets of Pf and Pv proteins that distinguished between asymptomatic and symptomatic donors within each pairwise comparison. In general, the absence of a response was indicative of a symptomatic infection, as symptomatic donors had on average very few responses. Interestingly, several proteins were identified by multiple pairwise comparisons. Moreover, when we classified these shared responses based on Pf and Pv protein orthology, we observed overlapping responses, for example, instances where sera from P. falciparum-exposed donors recognized P. vivax proteins and vice versa. This suggests that some Pf and Pv orthologs may share linear or conformational epitopes that elicit cross-reactive antibodies able to protect against symptomatic Pf, Pv, or Pf/Pv infections. Naturally-acquired crossreactive antibody responses to MSP5 (75) and CLAG9 (76) have been observed in other endemic settings. Alternatively, it is possible that these apparent cross-reactive responses do not stem from shared epitopes between P. vivax and P. falciparum orthologous proteins, but instead could result from previous infections with the other Plasmodium species. Importantly, because this study did not track malaria infection, our data cannot distinguish between these two possibilities.
Two of the orthology groups identified by multiple analyses contain proteins that have been previously studied as vaccine candidates. MSP4 is a 40 kDa GPI-anchored membrane protein expressed on the merozoite surface that appears to be essential because it is refractory to genetic deletion (77,78). Unlike most other known merozoite proteins, MSP4 is taken into the invaded erythrocyte without proteolytic processing and is detectable for several hours postinvasion (79). In a recent cross-sectional study in malaria-exposed individuals in the Brazilian Amazon (80), plasma from asymptomatic individuals reacted more strongly to recombinant MSP4 protein than those from symptomatic cases, but anti-MSP4 antibodies could not be independently associated with asymptomatic status. However, polyclonal antisera raised in rabbits inhibited the growth of asexual P. falciparum parasites in vitro, and in rodent models, MSP4 recombinant protein plus adjuvant (81)(82)(83) and MSP4 DNA vaccines (84,85) provided partial protection against blood stage challenge.
The serine-repeat antigens SERAs (86) are soluble parasitophorous vacuolar proteins that are co-expressed in late trophozoite and schizont stages, released upon schizont rupture, and appear to facilitate merozoite egress from rupturing schizonts (87). P. falciparum possesses nine SERA proteins, all but one of which are encoded in an 8-gene cluster on chromosome 2; the ninth gene encoding SERA9 is on chromosome 9. All SERAs have been classified as cysteine-like proteases because in several paralogs the catalytic active site Cys residue has been substituted by Ser. Six SERA proteins, three each from P. falciparum and P. vivax, respectively, were preferentially recognized by the PNG sera (Table II). PfSERA5, the most well-characterized SERA, is abundantly expressed and is refractory to deletion in asexual stages (86). Because antibodies to SERA5 have growth-inhibitory activity in vitro, including those isolated from sera of individuals naturally exposed to P. falciparum (88), SERA5 has been studied extensively as a potential antigen for blood stage vaccines. Although this study did not find that asymptomatic status was associated with antibody responses against PfSERA5, it did identify two putative P. vivax orthologs, PVX_003830 and PVX003800, to which antibody responses discriminated between the Pf/Pv.PCR and Pf/Pv.S groups, and Pv.LM and Pv.S groups, respectively (Table II). Responses to other Pf or Pv SERAs were also associated with asymptomatic status, suggesting that antibodies to multiple SERA family proteins provide protection against symptomatic malaria, perhaps by interfering with release of infectious merozoites from mature schizonts.
The other proteins that were identified by more than three analyses are all annotated as "hypothetical proteins." Our data showing that antibody responses against them associate with asymptomatic infections in naturally exposed donors suggests that they should be studied in greater detail.
Longitudinal studies in endemic areas are required to validate our findings from this cross-sectional study and to better assess the role of these antibody responses in immunity to malaria. Extending this study to distinct geographical locations might also allow protective antibody responses that are observed in multiple endemic sites to be identified. Furthermore, the use of this technology on longitudinal samples from areas of declining malaria incidence could provide a unique opportunity to identify markers of recent exposure that could be used as surveillance markers to support malaria elimination programs.
In conclusion, we have combined high-throughput laboratory methods with machine-learning analytical tools, providing a proof-of-concept for a novel approach to identify globally relevant novel antigens for vaccine research. This approach could be extended to screen the entire Pf and Pv proteome, and accelerate the search for new vaccine candidate antigens, diagnostic antigens, and serological surveillance markers for malaria.
Acknowledgments-We thank the volunteers and PNG IMR field teams that collected the samples, without whom this study could not have been possible. Pascal Michon and Harin Karunajeewa assisted with the collection of samples from clinical cases, Nicolas Senn coordinated the collection of the asymptomatic samples. We also thank Phil Felgner for developing the prototype of the proteome array that made this study possible. We also thank Gowthaman Ramasamy for bioinformatics support to select the proteins used on the array. * This work was supported by NIH/NIAID SBIR award 5R43AI75692.
ʈʈ These authors contributed equally to this paper. a Co-senior authors. Conflict of interest: The authors have declared that no conflict of interest exists.