Proteome-wide antigenic profiling in Ugandan cohorts identifies associations between age, exposure intensity, and responses to repeat-containing antigens in Plasmodium falciparum

Protection against Plasmodium falciparum, which is primarily antibody-mediated, requires recurrent exposure to develop. The study of both naturally acquired limited immunity and vaccine induced protection against malaria remains critical for ongoing eradication efforts. Towards this goal, we deployed a customized P. falciparum PhIP-seq T7 phage display library containing 238,068 tiled 62-amino acid peptides, covering all known coding regions, including antigenic variants, to systematically profile antibody targets in 198 Ugandan children and adults from high and moderate transmission settings. Repeat elements – short amino acid sequences repeated within a protein – were significantly enriched in antibody targets. While breadth of responses to repeat-containing peptides was twofold higher in children living in the high versus moderate exposure setting, no such differences were observed for peptides without repeats, suggesting that antibody responses to repeat-containing regions may be more exposure dependent and/or less durable in children than responses to regions without repeats. Additionally, short motifs associated with seroreactivity were extensively shared among hundreds of antigens, potentially representing cross- reactive epitopes. PfEMP1 shared motifs with the greatest number of other antigens, partly driven by the diversity of PfEMP1 sequences. These data suggest that the large number of repeat elements and potential cross-reactive epitopes found within antigenic regions of P. falciparum could contribute to the inefficient nature of malaria immunity.


124
The presence of biochemically similar epitopes can lead to cross-reactivity with antibodies and B-cell 125 receptors (BCR). While non-identical repeat elements may represent such potential cross-reactive epitopes 126 within a protein, similar epitopes may also be present across different proteins. How the quality of humoral 127 response may be impacted by the presence of cross-reactive epitopes remains largely unexplored, although 128 a study with viral variant antigens points to a frustrated affinity maturation process due to conflicting 129 selection forces from variant epitopes (Wang et al., 2015). A handful of cross-reactive epitopes have been 130 reported in P. falciparum (Wåhlin et al., 1992) and have been proposed to negatively impact the affinity ` 8 A prominent feature that stood out following high-resolution characterization of seroreactive regions was 310 the presence of repeat elements, where identical or similar motifs were repeated in tandem or with gaps 311 within a given protein (Fig 3). Previous studies focused on individual or targeted subsets of antigens in P.

326
Next, we investigated if seroreactive regions within seroreactive proteins were enriched for repeat elements.

327
Because the Falciparome is composed of overlapping peptides tiled across each gene, the contribution of 328 individual peptide sequences within each seroreactive protein can be further classified into those that are 329 seroreactive vs. those from the same protein that are non-seroreactive. This enables a comparison of repeat 330 elements among seroreactive and non-seroreactive peptides within each protein sequence.

332
To accomplish this, a k-mer approach was used to characterize repeat elements (Figure 4b, Methods).

333
Briefly, the frequency of all biochemically similar k-mers of sizes 6-9aa (approximately the size of a linear 334 B-cell epitope) was calculated for each protein. Then, each peptide in the protein was assigned a repeat 335 index based on the maximum intra-protein frequency of any repeat element it encompassed. To minimize 336 redundant representation, multiple peptides from a given protein deriving their repeat indices from the same 337 repeat element were collapsed such that a repeat element was represented only once for each protein ( Figure   338 4b). In this manner, the set of all 5171 non-VSA seroreactive-peptides was collapsed based on their repeat 339 elements to a set of 3091 non-redundant seroreactive peptides. The non-seroreactive peptides within each 340 seroreactive protein were also collapsed similarly.

342
Overall, seroreactive peptides yielded significantly higher repeat indices than non-seroreactive peptides 343 from seroreactive proteins, and this trend was more pronounced as a function of seropositivity (Figure 4c).

344
The median repeat index for non-seroreactive peptides was 1, while the median index for >10% and >40% 345 seropositivity was 3 and 13 respectively, for a kmer of size 7 (KS test p-value < 0.05 between successive 346 distributions). These results suggest that seroreactive peptides are dominated by repeat elements and those 347 with higher seropositivity also have progressively higher repeat indices. Examination of individual 348 proteins, including well characterized repeat-containing antigens such as FIRA, LSA1, LSA3, MESA and 349 GLURP, illustrate the relationship between seropositivity and repeat index (Figure 4d). This relationship 350 was consistently observed, regardless of kmer size from 6 to 9aa, and was insensitive to the level of 351 degeneracy or biochemical similarity used for determining repeat matches (Fig S4a). However, the presence 352 of a repeat element within any given peptide does not necessarily imply that the peptide will be seroreactive. Taken together, these data indicate that seroreactive proteins tend to be repeat-containing proteins, and 355 within these proteins, the individual seroreactive peptides tend to be those that contain the repeats.

356
Furthermore, seroreactive regions that are shared widely among individuals tend to feature higher numbers 357 of repeat elements.

359
Seropositivity is more dependent on exposure for peptides containing repeat elements than those 360 without repeat elements

362
To investigate whether the breadth of seroreactive repeat-containing peptides differed depending on 363 exposure-setting and age, seroreactive peptides were first binned into two categories: those with repeats, 364 and those without. Specifically, seroreactive peptides with a 7-mer repeat index >=3 were binned together 365 as "repeat-containing peptides" and those with a 7-mer repeat index <= 2 were binned as "non-repeat 366 peptides". For the set of non-repeat containing peptides, breadth (number of non-repeat peptides enriched 367 per person) was significantly higher in adults than children in both exposure settings (percent increase in 368 median breadth in adults over 4-6 year old children: moderate setting -28%; high setting -20%) ( Fig 5A).

369
However, within each set of age groups, there was no significant difference in breadth between the two 370 exposure settings.

372
For repeat-containing seroreactive peptides, breadth was calculated as follows. Each repeat-containing 373 seroreactive peptide was defined by the 7-mer (repeat element) that was used to calculate its repeat index 374 as described above. To avoid redundant counting, all repeat-containing peptides from a given protein 375 defined by the same repeat element were collapsed and counted only once. Similar to non-repeat peptides, 376 breadth of these peptides was higher in adults than children, reaching a similar level in both exposure 377 settings (percent increase in median breadth in adults over 4-6 year-old children: moderate setting -193%; 378 high setting -56%). In contrast to non-repeat peptides however, there was an exposure dependence in the 379 responses to repeat-containing peptides with age, such that children living in the high versus moderate 380 exposure setting had twice the breadth of repeat-containing peptides, reaching the same level in adulthood 381 in both settings ( Figure 5B). These results were consistently observed with different thresholds for 382 categorizing repeat-containing peptides (repeat index >= 4 or 5) ( Fig S5A).

384
Investigation of individual repeat elements recapitulated this trend and showed higher seropositivity in the 385 high exposure setting compared to moderate exposure in children, but not adults (Fig S5B). There were a 386 small number of notable exceptions, including repeat elements from PHISTc (PF3D7_0801000), LSA3 387 (PF3D7_0220000), FIRA (PF3D7_0501400), all of which did not show a transmission setting-dependent 388 response in children (Supplementary Table S4). Overall, the above data show that antibody responses to 389 repeat-containing peptides may be more efficiently acquired and/or maintained in children living in settings 390 of high vs. moderate exposure, but plateau at the same level in adulthood.

392
Extensive sharing of motifs observed between seroreactive proteins, particularly the PfEMP1 family

394
While repeat elements within individual proteins were explored in the previous section, similar or identical 395 motifs may also be shared among different proteins. If these motifs are a part of an epitope, then antibodies 396 and B-cell receptors (BCR) specific to a motif can potentially cross-react with the motif variants in different ` 10 proteins, depending on accessibility and other factors. Identifying such shared motifs serves as the first step 398 in exploring potential cross-reactivity between individual seroreactive proteins, and to identify them, a 399 systematic investigation was performed.

401
First, enriched kmers (6-9 amino acids) were identified by collecting those present in a significantly (FDR-402 adjusted p-value < 0.001) higher number of seroreactive peptides (9927) than a random sampling (1000 403 iterations) of 9927 peptides from the whole library. From this collection, enriched kmers that were shared 404 by different seroreactive proteins were identified as "inter-protein motifs" (Fig 6a). Using a kmer size of 7,

421
The design of the programmable phage display library used here features 62 amino acid peptides tiled with 422 a 25 amino acid step size, yielding an overlap of 37 amino acids for sequential fragments, and 12 amino 423 acids for every second fragment (Fig S6b). The design provides for localization of seroreactive sequences 424 to the region of overlap when considering adjacent fragments. For all except the first and last two peptides 425 in each protein (85% of peptides in the library), the seroreactive region can theoretically be narrowed down 426 to a 12-13aa segment within the peptide. Given that B cell linear epitopes are typically 5 -12 amino acids 427 in length (Buus et al., 2012), the 12-13aa mapping provides a near-epitope resolution.

429
To test the notion that the inter-protein motifs within each peptide are actually the elements associated with 430 the observed seroreactivity, we leveraged the tiled peptide library design by comparing inter-protein motif 431 carrying peptides with overlapping and adjacent peptides (Fig S6c). The maximum seropositivity among 432 peptides containing an inter-protein motif was on average 54-fold higher than the maximum seropositivity 433 among overlapping peptides not containing the motif (using a pseudo-seropositivity of 0.1% for peptides 434 with 0% seropositivity to facilitate fold change calculation), suggesting a strong association between 435 seroreactivity and the inter-protein motif itself, not just the whole peptide within which it resides 436 (comparison of median seropositivity yielded a similar result). Furthermore, a similar result was observed 437 when the same analysis was done with all the significantly enriched kmers (Fig S6d).

439
On average, each inter-protein motif was shared by 3 seroreactive proteins. Among the 509 seroreactive 440 proteins, each of them shared inter-protein motifs with 6 other proteins on average (median = 3), ` 11 (Supplementary file 1, Fig S6e). Visualized as a network (Fig 6b), the PfEMP1 family of proteins formed 442 a central hub to which a large number of other seroreactive proteins were connected. The PfEMP1 family 443 of proteins possessed at least 90 shared inter-protein motifs, and this family shared those motifs with the 444 greatest number of other seroreactive proteins (57) compared to all other proteins in this analysis.

445
Approximately 5 times as many proteins shared connections with PfEMP1 than would be expected by 446 chance (PfEMP1 shared motifs with 12-16 other proteins using a set of 9927 peptides consisting of PfEMP1 447 seroreactive peptides + random non-PfEMP1 peptides). Seroreactive proteins sharing motifs with PfEMP1 448 included many of the proteins with the highest measured seropositivity, such as RIFINs, SURFINs, FIRA, 449 and PHISTc. This extent of sharing was driven, in part, by the number of PfEMP1 sequences included in 450 the analysis. This was apparent when the same analysis performed with a reduced diversity of PfEMP1 451 sequences in the seroreactive peptide set (using PfEMP1 peptides from only PF3D7 and PFIT genomes 452 instead of 7 genomes) resulted in PfEMP1 sharing motifs with 32 seroreactive proteins instead of 57. This 453 suggests that the extent of sharing for PfEMP1 observed in this study may only be a small fraction of that 454 occurring in the extensive natural diversity of PfEMP1 variants in circulating parasites.

456
Outside the main network driven by PfEMP1, 495 seroreactive proteins were also found to be highly 457 connected to each other through motif sharing (Fig S6f). A large proportion of proteins with high 458 seropositivity were connected (80% and 58% of proteins with >30% and 10-30% seropositivity 459 respectively). This included proteins like GARP, LSA3, Pf332, Pf11-1, and MESA (Fig6b, Fig S6g). As 460 observed for the full set of inter protein motifs, motifs shared by the subset of proteins with >30% 461 seropositivity also consisted predominantly of charged glutamate, lysine, asparagine and aspartate residues 462 ( Fig S6g). Since the analysis used here to identify inter-protein motifs allowed only up to two conservative 463 substitutions in 7-mer motifs, the similarity of motifs in the network in Fig. S6g suggests that with a less 464 stringent threshold of identifying motifs, these proteins would be even more highly connected. Moreover,

465
80% of proteins in this network had reported expression in the asexual blood stage of the lifecycle of P.

469
These results indicate that the interprotein motifs are strongly associated with seroreactivity and are 470 extensively shared across seroreactive proteins, including among regions highly targeted by the antibodies.

471
Furthermore, PfEMP1 shares motifs with the greatest number of other seroreactive proteins, partly driven 472 by the sequence diversity of PfEMP1 variants.

476
Using a customized programmable phage display (PhIP-seq) library, we have evaluated the proteome-wide 477 antigenic landscape of the malaria parasite P. falciparum, using the sera of 198 individuals living in two 478 distinct malaria endemic areas. This approach readily identified previously known antigens, including 479 proteins that are targets of protective antibodies, as well as novel antigens. In our study, we characterized 480 features of P. falciparum antigens that could potentially contribute to the inefficient acquisition and 481 maintenance of immunity to malaria. Repeat elements were found to be commonly targeted by antibodies, 482 and had patterns of seropositivity that were more dependent on exposure than non-repeat peptides.

483
Furthermore, extensive sharing of motifs associated with seroreactivity was observed among hundreds of 484 parasite proteins, indicating potential for extensive cross-reactivity among antigens in P. falciparum. These ` 12 data suggest that repeat elements-a common feature of the P. falciparum proteome, and shared motifs 486 between antigenic proteins could have important roles in shaping the nature and development of the immune 487 response to malaria.

489
To map the antigenic landscape, PhIP-seq for P. falciparum offers several attractive advantages. The library 490 described here contains >99.5% of the proteome, including variants for several antigenic families,

495
The result is a cost-effective and scalable system, allowing for the processing of hundreds of samples in

506
The near-epitope resolution provided by this platform allowed a systematic investigation of targets of 507 antibodies. Targets with high seropositivity were observed to be significantly enriched for repeat elements.

508
In some previous reports, the elevated antigenic potential of repeat elements has been noted (Davies et al.,

519
A key finding of this study is the exposure-setting dependent difference in seroreactivity to repeat-520 containing peptides, with the breadth of seroreactivity increasing more quickly with age in the high versus 521 moderate exposure setting. We note that the samples analyzed in this study differed between the two cohorts 522 not just by exposure, but also with respect to time since most recent infection, reflecting the differing 523 epidemiology of infection in these settings. In the moderate exposure setting, the median number of days 524 since last infection was 100, whereas over 65% of the samples from the high exposure setting were taken 525 during periods of active infection. The difference in seroreactivity to repeat-containing peptides observed 526 here between the settings could therefore emerge from two related mechanisms. In the first, the difference 527 could be driven by a requirement for a minimum level of cumulative exposure to the target repeats to 528 generate a robust response. In the second, the antibody response to repeats may be inherently less durable, ` 13 leading to rapid waning in the absence of frequent exposure. Future longitudinal studies may be required 530 to distinguish between these two possibilities. There were a few exceptions, including repeats from FIRA,

531
PHISTc and LSA3, that did not show an exposure-setting dependent difference in seropositivity, suggesting 532 that factors beyond the repeated nature of the epitope could influence the nature of the response. Whether 533 either or both potential mechanisms contribute, the predominance of repeat containing peptides in antibody 534 targets, along with the remarkable abundance of these peptides in the P. falciparum proteome, suggests a 535 possible strategy evolved by the parasite for the purpose of diverting the humoral response towards short-536 lived or exposure-dependent responses.

538
The hypothesis of less durable antibody responses to repeat antigens in P. falciparum can be reconciled 539 with a model in which repeating epitopes favor extrafollicular B cell responses, which are typically short-540 lived (Cockburn & Seder, 2018; Schofield, 1991). This is based on the potential of repeat epitopes in an

555
falciparum follow this pattern, an expected outcome would be defective formation of LLPCs and memory 556 B cells. On the other hand, the finding that adults from both exposure settings ultimately developed a similar 557 breadth of response to repeat regions could argue for the hypothesis that greater cumulative exposure is 558 required to develop responses to these regions, but the difference between adults and children could also be 559 driven by age-intrinsic factors (Baird et al., 1991(Baird et al., , 1993.

561
Another major finding of this study is the extensive presence of inter-protein motifs among seroreactive 562 proteins. Since a strong association with seroreactivity was observed for these motifs, they may represent 563 cross-reactive epitopes. Whether these inter-protein motifs are cross-reactive in vivo is unclear and may 564 depend on expression timing and accessibility to the immune system, among other factors. Analogously, 565 seroreactive repeat elements with non-identical repeating units could represent cross-reactive epitopes 566 within proteins. Extensive presence of potential cross-reactive epitopes in P. falciparum antigens may play 567 an important role in influencing the quality of the immune response to malaria. While it could be 568 advantageous for the host if multiple parasite proteins could be targeted by antibodies through cross-569 reactivity, simultaneous presence of cross-reactive epitopes could alternatively frustrate the affinity 570 maturation process due to conflicting selection forces, as was observed for variant HIV antigens (Wang et

577
The atlas of seroreactive repeat elements and inter-protein motifs from this study will be useful for future 578 investigations in understanding their impact on the quality of immune response to malaria.

580
The PfEMP1 family shared inter-protein motifs with the greatest number of other antigens in this study.

581
This was driven in part by the wide diversity of PfEMP1 variants, indicating that as one becomes naturally

592
While phage display of small peptides yields high resolution discrimination of linear epitopes, this approach 593 may not capture antibodies binding to conformational epitopes. Therefore, such epitopes are likely to be 594 missed by this assay, although polyclonal responses are frequently a mixture of linear, partially linear, and 595 conformational epitopes. Reassuringly, we observed a large-scale enrichment of P. falciparum peptide 596 sequences in exposed individuals when compared with control sera from the US. This suggests that the 597 humoral immune system of exposed individuals acquires an extensive and diverse set of P. falciparum 598 targets, including thousands of linear sequences. The bias towards linear epitopes may have increased the 599 relative detectability of repeat regions by this assay since they often form intrinsically disordered regions.

600
However, that would not account for the observed differences between exposure settings for children and 601 adults. Another limitation of our study is that it did not provide quantitative measures of absolute antibody 602 reactivity to individual peptides per person. Therefore, enrichment counts for peptides were only used in a 603 semi-quantitative way to determine seropositivity. Lastly, given the breadth and sensitivity of the PhIP-seq 604 technique, 86 control sera were used to remove non-specific enrichments. We imposed a stringent filter to

611
With the rapid success of mRNA vaccines for SARS-CoV-2(Chaudhary et al., 2021), an optimistic future 612 for malaria vaccines is possible as well. Findings from this study could have important implications on 613 malaria vaccine design. Results from our study suggest that that in natural infections in children, repeat 614 regions in P. falciparum could lead to an exposure-dependent and/or short-lived antibody response to a 615 higher degree than for non-repeat regions. While we recognize that vaccine induced immunity is distinct 616 from naturally acquired immunity, this potential limitation should be considered when evaluating repeat-` 15 containing antigens as vaccine targets. Further, given that highly immunogenic regions in natural immunity 618 to malaria are predominantly repeats and there is widespread presence of potential cross-reactive epitopes 619 across many proteins, whole-parasite vaccines may also be susceptible to similar limitations. If the findings 620 from this study translate to vaccine-induce immune responses, non-repetitive, unique antigenic regions may 621 be more effective targets.

625
Ethical Approval

626
The study protocol was reviewed and approved by the Makerere University School of Medicine Research     Table 2.

664
The final set of protein sequences (n=8,980) was then merged and short sequences (<30 aa long) were 665 removed prior to collapsing at 100% sequence identity (n = 8534). Next, all sequences were split into 62-666 amino acid peptide fragments with 25-amino acid step size. Fragments with homopolymer runs of >= 8 667 exact amino acid matches in a row were removed, X amino acids were substituted to Alanine and Z amino 668 acids (Glutamic acid or Glutamine) to Q (Glutamine), and finally, lzw compression was used to identify 669 and remove low-complexity sequences with a compression ratio less than 0.4. Lastly, sequence headers 670 were renamed to remove spaces and the resulting peptide fragments were converted to nucleotide 671 sequences. Adapter sequences were appended, with a library-specific linker on the 5' end 672 (GTGGTTGGTGCTGTAGGAGCA) and a 3' linker sequence coding for two stop codons and a 17mer (-673 TGATAA-GCATATGCCATGGCCTC). This file was then iteratively scanned for restriction enzyme sites 674 (EcoRI, HindIII), which were eliminated by replacement with synonymous codons to facilitate cloning.

675
The final set of nucleotide sequences was collapsed at 100% nucleotide sequence identity (n = 238,068) 676 and then ordered from Agilent Technologies.

708
For each aligned sequence, the CIGAR string was examined, and all alignments where the CIGAR string 709 did not indicate a perfect match were removed. The final set of peptides was tabulated to generate counts 710 for each peptide in each individual sample. Samples with less than 250,000 aligned reads were dropped 711 from further analysis and any resulting samples with only one technical replicate were also dropped (2 of 712 the 200 Ugandan samples were dropped). To keep the analysis restricted to P. falciparum peptides and limit 713 the influence from non-P. falciparum peptides, reads mapping to all vaccine, viral and experimental control 714 peptides were excluded from analysis. The remaining peptide counts were normalized for read depth and 715 multiplied by 500,000, resulting in reads/500,000 total reads (RP5K) for each peptide. The null distribution 716 for each peptide was modelled using read counts from a set of 86 plasma from the US (New York Blood 717 Center) using a normal distribution, with the assumption that most of these individuals were likely 718 unexposed to malaria. To avoid inflation by division, if the standard deviation of read counts of any peptide 719 in the US samples was <1, then that was set as 1. Z-score enrichments ((x-mean US)/std. dev US) were 720 then calculated for each peptide in each sample using the US distribution and Z-score >= 3 in both technical 721 replicates (or more than 75% of the replicates if there were more than 2 technical replicates) of a sample 722 was used to identify enriched peptides within a given sample. To call malaria-specific peptide enrichments

754
seroreactive set, the same number of proteins as the seroreactive protein set was randomly sampled from 755 the total non-seroreactive protein set 1000 times and the distribution of cumulative frequencies between the 756 seroreactive and non-seroreactive sets were compared using a 2-sample KS-test in each iteration.

758
Repeat index calculation -To systematically compare the distribution of repeats between seroreactive and 759 non-seroreactive peptides within seroreactive proteins, the following approach was adopted. Firstly, for 760 each protein, repeats and their frequency within that protein was calculated using a k-mer approach. K-mers 761 were fixed length sequences (6/7/8/9-aa) with any number of conservative substitutions (AG, DE, RHK,

762
ST, NQ, LVI, YFW) and did not include polymeric stretches of single amino acids from N/Q/D/E/R/H/K.

763
For each protein sequence, all possible kmers in the protein and their frequency (number of non-overlapping 764 occurrences) in the protein (intra-protein repeat frequency) was calculated. Then for each peptide in the 765 protein, all k-mers in the peptide sequence were taken and the k-mer with the highest intra-protein repeat 766 frequency was identified. This frequency was assigned as the repeat index for the peptide. Once all peptides 767 across all seroreactive proteins were assigned a repeat index, they were subsequently classified according 768 to seropositivity. In each seropositivity group, since peptides from the same protein could have the same 769 highest intra-protein repeat k-mer, to avoid redundancy of representation, peptides sharing the same highest were combined with random peptides (n=6926) to a total of 9927 peptides. This was treated as the 801 'seroreactive' set and a similar analysis was performed to identify significantly enriched motifs in this set.

805
The data associated with this study can be accessed in the Dryad repository with the

811
We thank all study participants who participated in this study and their families. We thank the New York Other variants P. reichnowi PfEMP1 (PFREICH) Anopheles -CE5 (5), SG6 (5) Anopheles salivary proteins 53 proteins from 19 Anopheles species as described in Fig 1 of      Does the peptide span intra-protein repeat elements?

Characterize repeat elements in each protein
For each protein, identify intra-protein repeat elements of length n (6/7/8/9-mer) allowing biochemically similar residues at each position Intra-protein frequency of repeat elements calculated       Table 2. The filtered sequences were then processed into peptides using the peptide processing pipeline and quality checks were performed as described in NT sequence verification.

NT SEQUENCE VERIFICATION
Do both files (AA and NT) have the same number of sequences?
Are there linker sequences at the 5' and 3' ends of all sequences?
Are all sequences the same length?
Are there any long runs of (>8NT) of single nucleotide?
Do the translated sequences match those in the original peptide file?
Are all restriction sites (HINDIII, EcoR1) removed?               Number of inter-protein motifs shared between the two proteins