An Integrated Genomic and Immunoinformatic Approach to H. pylori Vaccine Design

Background One useful application of pattern matching algorithms is identification of major histocom-patability complex (MHC) ligands and T-cell epitopes. Peptides that bind to MHC molecules and interact with T cell receptors to stimulate the immune system are critical antigens for protection against infectious pathogens. We describe a genomes-to-vaccine approach to H. pylori vaccine design that takes advantage of immunoinformatics algorithms to rapidly identify T-cell epitope sequences from large genomic datasets. Results To design a globally relevant vaccine, we used computational methods to identify a core genome comprised of 676 open reading frames (ORFs) from amongst seven genetically and phenotypically diverse H. pylori strains from around the world. Of the 1,241,153 9-mer sequences encoded by these ORFs, 106,791 were identical amongst all seven genomes and 23,654 scored in the top 5% of predicted HLA ligands for at least one of eight archetypal Class II HLA alleles when evaluated by EpiMatrix. To maximize the number of epitopes that can be assessed experimentally, we used a computational algorithm to increase epitope density in 20-25 amino acid stretches by assembling potentially immunogenic 9-mers to be identically positioned as they are in the native protein antigen. 1,805 immunogenic consensus sequences (ICS) were generated. 79% of selected ICS epitopes bound to a panel of 6 HLA Class II haplotypes, representing >90% of the global human population. care-fully determine a core set of genes. Application of immunoinformatics tools to this gene set accurately predicted epitopes with promising properties for T cell-based vaccine development.


Background
A genomes-to-vaccine strategy for rational vaccine design rests on the premises that (i) a minimal set of immunogens capable of inducing a robust and sustained immune response to a pathogen can be discovered using immunoinformatics, and (ii) administration of these immunogens, in a suitable delivery vehicle together with adjuvant, will result in protection from disease. Our approach is to identify the minimal, essential information needed to achieve this goal. That data is encoded, in part, by T -cell epitopes, short peptide sequences displayed by antigen presenting cells to T cells, critical mediators of adaptive immunity. Four major steps comprise our genomes-to-vaccine strategy, which can be thought of as a funnelling process ( Figure 1): (i) Genomes are mined using computational tools to identify genes that encode proteins with promising vaccine antigen properties such as secretion, upregulated expression, immunogenicity and virulence; (ii) immunoinformatics tools are then used to map protein sequences for short, linear putative Tcell epitopes; (iii) sequences are synthesized as peptides and evaluated for human leukocyte antigen (HLA) binding and antigenicity in survivors of infection or vaccinees and (iv) prototype epitopebased vaccines are evaluated for immunogenicity and efficacy in humanized mice. We adopted this genomes-to-vaccine strategy to design a vaccine against Helicobacter pylori (H. pylori), a motile, gram-negative spiral bacterium that colonizes gastric mucosa. Approximately one-half of the world"s population is infected with H. pylori, and the infection persists for life unless treated with combination antimicrobials. Natural immune response does not eliminate infection nor does it confer protective immunity against re-infection after antimicrobial therapy. Chronic infection may lead to development of chronic gastritis, peptic ulceration, and even gastric adenocarcinoma and lymphoma; hence the need for a protective vaccine [1].
Here, we set out to characterize the human T cell response to H. pylori infection using an informaticsdriven epitope mapping approach. H. pylori immunopathogenesis has been thoroughly characterized, providing important insights into how this bacterial pathogen interfaces with the host immune system and evades host defenses, allowing it to persist in the gastric environment [2]. How H. pylori manages to trigger epitope-specific immune responses at the molecular level is not yet well-understood and requires further investigation in order to develop prophylactic and therapeutic vaccines. The availability of sequenced H. pylori genomes and immunoinformatics tools capable of their analysis enables experimental analysis of sequences that directly stimulate and inhibit multi-functional human T cell responses. An example of the use of this genomic sequence data would be to predict how T cells will respond to stimulation by H. pylori sequences that have homologues in the human genome and the human gut microbiome in order to understand the impact of H. pylori infection on autologous and heterologous immunity. This information is especially valuable to rational vaccine design because vaccines should not contain cross-reactive epitopes that can break self-tolerance and lead to autoimmune disease. In preliminary studies, we computationally identified T-cell epitopes from the H. pylori J99 and 26695 genomes that stimulated long-lived immune responses and cleared infection in a p27 knockout mouse model of H. pylori infection [3,4]. Here, in the first phase of a more expansive and translational study, we computationally screened seven H. pylori genomes to identify T-cell epitopes that may serve as human vaccine immunogens. This involved multiple steps in a funnelling process that progressively narrowed down the search universe to yield a set of sequences for experimental evaluation: (i) Open reading frame (ORF) amino acid sequences were compared across genomes to identify conserved proteins. (ii) 9-mer sequences completely conserved across all genomes in this protein subset were identified using the Conservatrix algorithm. (iii) Among these sequences, potential HLA binders were predicted using the EpiMatrix epitope mapping algorithm. (iv) Immunogenic consensus sequences (ICS) were constructed using EpiAssembler. (v) Sequences bearing homology to human sequences were triaged. (vi) Finally, ICS were selected for experimental validation and (vii) assayed for binding to multiple HLA alleles.

H. pylori genomes
We assembled the seven H. pylori genomes available in the public domain as of September 2009 in order to represent, as widely as possible, the breadth of genetic diversity of H. pylori strains available at the time of our computational analysis. Because different bacterial genotypes with a broad range of chronic inflammatory sequelae predominate in different human populations, vaccination with immunogens common to these strains may provide effective protection worldwide.
The 26695 strain of H. pylori was derived from a gastritis patient in the United Kingdom. Prior to sequencing, it underwent repeated subculturing. The 26695 genome was the first to be sequenced and has since been the most thoroughly H. pylori genome characterized [5]. Therefore, 26695 served as the reference genome for comparison to the other six in our study. H. pylori strain J99 was isolated from a patient with duodenal ulcer disease in the USA in 1994 [6]. It underwent little subculturing prior to sequencing. The B128 strain was isolated from a patient with gastric ulcer disease, while 98-10, which is most closely related to H. pylori strains of East Asian Genomes are mined using computational and experimental tools to identify genes that encode proteins with promising vaccine antigen properties such as secretion, upregulated expression and virulence. In silico. Immunoinformatics tools are then used to map protein sequences for short, linear putative T cell epitopes. In vitro. Candidates are synthesized as peptides and evaluated for MHC binding and antigenicity. In vivo. Prototype epitope-based vaccines are evaluated for immunogenicity and protection in humanized mice. Used with kind permission from Medicine and Health Rhode Island [25]. origin, was isolated from a gastric cancer patient in Japan [7]. The HPAG1 strain was isolated from a Swedish patient with chronic atrophic gastritis, an inflammatory condition of the gastric mucosa, which is a precursor to lesion development and gastric adenocarcinoma [8]. Shi470 was cultured from the gastric antrum of an Amerindian resident of a remote Amazonian village in Shimaa, Peru [9]. This strain is more closely related to strains from East Asia than other geographic regions, and is thought to represent the strains of Native Americans prior to European conquest. It represents the first completely sequenced strain that is not from an ethnic European. The G27 strain is a laboratory strain extensively used in H. pylori research, originally isolated from the stomach of an Italian patient [10]. Thus, the genomes selected for this analysis represent geographically diverse isolates, of both laboratory and clinical origin, and are representative of the varied clinical outcome of H. pylori infection.

Core genome determination
We set out to discover vaccine immunogens common to the seven H. pylori strains in order to design a vaccine that will broadly cover H. pylori strains worldwide. To do this, we first identified the H. pylori core genome in a two-phased process involving comparative amino acid frequency followed by sequence similarity analyses. For every ORF in the reference strain (26695), ORFs in each of the other strains were screened for relatedness by amino acid composition. While composition is not as precise a measure of relatedness as sequence identity, the information it offers provides a rapid first-pass screen. We used the SΔn algorithm with a cut-off score of 30, below which a query ORF was considered to be related to the reference strain ORF [11]. Related ORFs were then analyzed for sequence similarity by sequence alignment of the first 200 amino acids of the reference and query ORFs using the GOTOH82 algorithm and BLOSUM50 matrix [12,13]. Analysis beyond the first 200 amino acids is not only computationally time intensive but also inefficient because it does not provide additional match-confirmation that could or could not be found before the 200 amino acid cut-off. A query ORF with >80% sequence identity as compared with the reference ORF was considered a match. Matches for each reference ORF were then counted across genomes. Every ORF which had a match of >80% sequence identity in each of the 6 query genomes -representing conservation in all 7 genomes -was designated a member of the "core genome" and therefore selected for downstream analysis. We found 676 ORFs conserved across all seven genomes, representing 43% of the reference genome and corresponding to a total of 4,732 ORFs out of 10,921 ORFs in all seven genomes ( Figure 2). With reference to the 26695 strain, 917, 1061, 1138, 1206 and 1282 ORFs are found, respectively, in at least five, four, three, two or one of the other genomes. Of the 1576 total ORFs in the 26695 genome, 294 (19%) are strain-specific ( Figure 3).
We note that the magnitude of the H. pylori core genome in the present study differs significantly from previously published studies. Using molecular biology methods or computational analyses, including permutations of BLAST or application of BLAST combined with spatial analyses (e.g. synteny analysis), the number of ORFs comprising the core genome was determined to be approximately 1200 in previous studies [7,14,15,16]. As the total number of ORFs is close to double the 676 reported here, we undertook a comparison of our dataset with select datasets representative from other studies to understand the discrepancy. The first determination of an H. pylori core genome was made by Salama et al. in 2000, which identified 1281 sequences common to 15 strains using a whole genome microarray [14]. Five years later Gressmann et al. undertook a much larger sample size, testing 56 "representative strains", identified as a representative sample of an 800 strain unpublished data collection [15]. This study used microarray hybridization to find 1111 core genome sequences in a "virtual genome" representing 98% coverage Using comparative amino acid frequency and sequence similarity analyses, seven H. pylori genomes comprising 10,921 ORFs are reduced to a core subset of 4,732 ORFs. Of 1,241,153 peptides parsed, 106,791 were identical amongst all seven genomes, of which 23,654 are predicted to bind to at least one Class II HLA allele. These served as input into EpiAssembler, constructing 1,807 ICS, two of which were found to be homologous to the human genome by BLAST analysis. We attribute the large difference in size between the two datasets to factors involving the phased method of triaging unrelated sequences in the present study by amino acid relatedness first and then sequence similarity, as well as the effect of the geographical distribution of strains upon composition of the core genome. First, a limited analysis of the ORFs present in McClain et al. but not identified in this study showed that the amino acid relatedness screen triaged genes that differed significantly in sequence at the N-or C-terminus across genomes as a result of gene "truncations." The differences accounted for sufficiently lowered amino acid relatedness over entire ORF sequences thereby raising SΔn scores over the cut-off. Thus, the first screen for ORF similarity accounts for discarded sequences. Furthermore, our higher stringency requirement of >80% sequence identity in comparison with the BLAST score ratio of 0.4 employed by McClain et al., which represents a 30% sequence identity over 30% of sequence length, may have resulted in a greater number of sequences of core sequences to be included in McClain et al. Additionally, geographical diversity among H. pylori strains may account for differences in the core genome datasets. Of the 1237 core genes identified in McClain et al., a subset of alleles was highly divergent in the East Asian strain 98-10, encoding proteins that exhibited <90% amino acid sequence identity when compared to the corresponding protein in the 4 other strains analyzed; similar results were shown for the J99 strain which is most closely related to H. pylori isolates from West Africa. Thus, the McClain et al. dataset contained some highly divergent strain sequences with <90% sequence identity present in 98-10 and J99 strains, many of which would not have met the 80% cut-off in this study. Moreover, our core dataset included sequences present in two strains of East Asian origin, 98-10 and Shi470, the latter of which is a Peruvian strain, but which is more closely related to strains from East Asia. The high sequence divergence inherent to East Asian H. pylori strains could also have accounted for the smaller size of our core. Therefore, choice of strain and variation in sequence composition of strains selected for analysis affect the resulting size of core genome determined.

Conserved 9-mer search
9-mer sequences parsed out of the 676 core genome ORFs from the reference strain were searched for identically parsed 9-mers in the matching ORFs of query strains using the Conservatrix algorithm [17]. A 9-amino acid frame was used because it is the length of a peptide that fits into the HLA binding pocket. We found that out of the 1,241,153 9-mers parsed, 106,791 were identical amongst all seven genomes ( Figure 2). We next examined the potential immunogenicity of these sequences using computational methods depicted below. Using H. pylori 26695 as a reference strain, the genomes of seven H. pylori strains were compared to discover a core set of ORFs for identification of vaccine immunogen candidates. This Venn diagram illustrates the number of ORFs in 26695 that are common to at least one other genome. ORFs common to all genomes compose an H. pylori core genome.

Epitope mapping
Each of the 106,791 9-mers was scored for predicted binding affinity to a panel of 8 Class II HLA alleles using EpiMatrix, a matrix-based algorithm for mapping T-cell epitopes [18]. The algorithm was previously benchmarked against similar prediction tools, including SMM-align, IEDB ARB, TEPI-TOPE, MHCPRED among others, and shown to have a sensitivity rating, on average, for HLA Class II predictions of 77%, which is 5-17% greater than the others [19]. A total of 23,654 9-mer sequences had a z-score ≥1.64 for at least one Class II HLA allele ( Figure 2). The sequences were ranked according to the cumulative EpiMatrix score for all 8 alleles to serve as a starting point for the construction of immunogenic consensus sequences.

ICS construction
Immunogenic consensus sequences were built by EpiAssembler, an algorithm that maximizes epitope density in a 20-25 amino acid stretch by assembling potentially immunogenic 9-mers identical to their placement in the native protein antigen [17]. The basis for this approach to vaccine immunogen design lies in the observation that immunogenicity is not randomly distributed throughout protein sequences but instead tends to cluster. Designing vaccine immunogens with increased epitope density improves the possibility for epitope presentation to T cells in the context of more than one HLA allele, thereby broadly covering an HLA diverse human population. EpiAssembler produced 1,807 ICS from the input sequences ( Figure 2). The number of 9mer epitopes per ICS ranges from 4 to 11 with an average of 6.52 ± 1.23 (standard deviation). Because a single 9-mer epitope may bind more than one HLA allele, the number of predicted 9-mer epitopes across all 8 alleles was counted for each ICS. The number of hits per ICS ranges from 12 to 55 with an average of 22.72 ± 6.49 (standard deviation).

Human cross-reactivity
To avoid potential cross-reactivity with human sequences that may stimulate autoimmunity, epitopes that are homologous to components of the human genome were triaged while the remaining "foreign" epitopes (i.e. those lacking homology to human) were considered safe to include in vaccine formulations. As a standard practice, any peptide that shares greater than 70% identity (or more than 7 identities per 9-mer frame) with sequences contained in the human proteome is eliminated from consideration. Among the 1,807 ICS, two were significantly homologous to human sequences and were therefore excluded from the set of potential vaccine immunogens ( Figure 2). Cross reactivity with human gut microbiota is an equally important consideration in selection of vaccine immunogens. A major goal of the Human Microbiome Project, an NIH roadmap initiative, is to sequence and publish the genomes of gut microbiota. A BlastiMer analysis of the ICS sequences will be performed as these genome sequences become available.

ICS selection
ICS to be studied by experimental methods in order to validate computational predictions must have physicochemical properties compatible with peptide synthesis and experimental conditions. Hydrophobicity is an important parameter to consider because hydrophobic peptides can be difficult to synthesize, purify and solvate in aqueous buffers. To address this concern, the average hydropathy score for all the amino acids in an ICS was calculated [20]. Of the 1,805 ICS, 28 had hydrophobicity scores >2 and were removed from further consideration. The remaining 1,777 ICS clusters were then ranked according to EpiMatrix ICS score, as calculated by summing the individual EpiMatrix scores for all 9mer epitopes scoring at least 1.64 for all 8 Class II HLA alleles. The top 120 clusters that were selected are comprised of sequences that originate from 101 distinct ORFs with EpiMatrix scores ranging from 51.59 to 104.18. With regard to immunogenic sequences previously identified, none of the ICS clusters contain T-cell epitopes deposited in the Immune Epitope Database (http://www.immuneepitope.org/) and are therefore novel potential vaccine candidates. According to the clusters of orthologous groups (COGs) classification system, these ORFs consist of 53% metabolic, 17% cell process and signaling, 10% information storage and processing and 9% poorly characterized proteins (Figure 4) [21; http:// www.ncbi.nlm.nih.gov/sutils/coxik.cgi?gi=128]. In comparison with the distribution of ORFs for the entire 26695 genome, metabolic proteins are strikingly over-represented in the ICS dataset, while the other protein groups are closer in proportion. It is possible that H. pylori metabolic proteins, as a group, are more divergent than the others. Indeed, metabolic proteins of Salmonella enterica subspecies typhi and typhimurium have the greatest tendency for divergence, while their information processing proteins are least likely to evolve [22]. Thus, greater sampling of sequence space during evolution to acquire new metabolic functions may come at the detriment of host immune evasion with an increase in potential T cell immunogenicity. We further investigated whether the greater frequency of epitopes originating in metabolic proteins was pre-determined by core genome ORF selection and found that metabolic proteins are under-represented in the core genome in comparison with the whole genome ( Figure 4). Thus, over-representation of metabolic proteins among ICS clusters suggests that they may be a special class of potential vaccine immunogens. Because machine generated ICS are not optimally designed for peptide synthesis, the top 120 sequences were reviewed and "hand-edited" before attempting synthesis. This involved (i) dividing ICS where two distinct regions of immunogenic density were observed and (ii) trimming sequences to center around high scoring 9-mers for >3 HLA alleles. Hand-edited sequences that raised hydrophobicity scores over the cut-off described above were triaged. In total, 109 ICS with EpiMatrix scores ranging from 12.29 to 88.93 were submitted for peptide synthesis and 78 were successfully produced.
HLA binding ICS peptides were assayed in vitro for their capacity to bind multiple HLA types, including DRB1*0101, DRB1*0301, DRB1*0401, DRB1*0701, DRB1*1101 and DRB1*1501. Of the 468 ICS peptide-HLA binding interactions assayed, 54% displayed strong binding (estimated IC 50 <10 M), 24% showed moderate or weak binding (10 M<estimated IC 50 <100 M) and 21% displayed no binding (estimated IC 50 >100 M) ( Table 2). All peptides bound to at least one of the HLA alleles for which they were predicted to bind, 97% bound to two alleles, 87% bound to three, 79% bound to four, 65% bound to five and 44% bound to all. These data support the use of this approach for the high-volume genomic screening for vaccine candidates. Therefore, we will proceed to the next step in the genomes-to-vaccine development process with these highly conserved, highly promiscuous candidate epitopes. We analyzed the proportion of HLA binding peptides that were correctly predicted by immunoinformatic methods to assess the concordance of computational predictions and experimental results. We classified each binding reaction categorically as either a true positive, false positive, or true negative with positive predictions defined as epitopes scoring ≥1.64 on the EpiMatrix Z-scale and binding HLA at IC 50 <100µM. All but three ICS peptides were predicted to bind all HLA alleles assayed. As the three ICS below the EpiMatrix cut-off for a positive binding prediction failed to bind HLA in vitro, there were no false negative results in these assays. Overall, the proportion of true positive predictions is 79% ( Figure 5). With respect to each allele assayed, the values are 81% for DRB1*0101, 63% for DRB1*0301, 62% for DRB1*0401, 88% for DRB1*0701, 94% for DRB1*1101 and 87% for DRB1*1501. Categorical evaluations of each peptide"s EpiMatrix prediction association to in vitro HLA-binding were collected into a 2x2 contingency table. By chi-squared test, the association between immunoinformatic predictions (EpiMatrix Z-score ≥1.64) and HLA-binding results (IC 50 <100µM) is highly significant (p = 0.0007). Table 1 -Top 120 Immunogenic Consensus Sequences SEQUENCE refers to the amino acid sequence of the given ICS. SOURCE PROTEIN refers to the protein description from which each ICS is derived. 26695 ACCESSION NUMBER refers to the GenBank accession number source of the initial 9-mer "seed" for each ICS. EPX CLUSTER SCORE refers to the overall sum of significant scores aggregated and normalized.

Table 1 -Top 120 Immunogenic Consensus Sequences (continued)
SEQUENCE refers to the amino acid sequence of the given ICS. SOURCE PROTEIN refers to the protein description from which each ICS is derived. 26695 ACCESSION NUMBER refers to the GenBank accession number source of the initial 9-mer "seed" for each ICS. EPX CLUSTER SCORE refers to the overall sum of significant scores aggregated and normalized.

Table 2 -Validation of HLA Binding Prediction in HLA Binding Assay
PEPTIDE ID refers to a four-digit identifier for each ICS peptide. 26695 ACCESSION NUMBER refers to the GenBank accession number source of the initial 9-mer "seed" for each ICS peptide. SEQUENCE refers to the amino acid sequence of the given ICS peptide. DRB1 ALLELES refers to the alleles tested for ICS peptide binding; estimated binding affinities are IC 50 <1 µM (dark gray), 1 µM < IC 50 < 10 µM (medium gray), 10 µM < IC 50 < 100 µM (light gray), IC 50 >100 µM (white).
A number of possible explanations may account for the lack of accord between positive predictions and actual binding, including peptide folding, peptide aggregation under assay conditions, or the predictive accuracy of immunoinformatic algorithms. In a large, retrospective comparison of the EpiMatrix with epitope mapping algorithms in the public domain, EpiMatrix was >75% accurate across all the HLA Class II alleles studied here, which is as accurate as or more accurate than other epitope prediction tools [18]. Therefore, it is likely that a significant part of the discrepancy between predictions and HLA binding is due to peptide design and physical properties.

Conclusions
Using a genomes-to-vaccine approach described here, we were able to systematically narrow down over 1.2 million 9-mer sequences from seven bacterial genomes to an experimentally feasible number of potential vaccine candidates that demonstrate the capacity to bind HLA. In so doing, we derived a conservative estimate of the H. pylori core genome in comparison with previously reported core genomes. Furthermore, we observed that metabolic proteins might comprise a special group of H. pylori vaccine immunogens; the same may be the case in other microbial species but requires further investigation. In future studies, we will further validate the ICS peptides by evaluating their antigenicity through activation of T cells cytokine production in peripheral blood and gastric biopsy specimens obtained from H. pylori-infected patients. In addition, ICS peptide immunogenicity will be evaluated in vivo using humanized mice that express HLA DR1, DR3 and DR4. Broadly antigenic and immunogenic ICS will then be concatenated to generate a multiepitope sequence for production of DNA and protein vaccines that will be assessed for immunogenicity and protection against H. pylori infection.

Immunoinformatics
Seven H. pylori genomes were downloaded from the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov) in September 2009, including Helicobacter pylori 26695 (Accession NC000915; 1576 ORFs), Helicobacter pylori J99 (Accession NC000921; 1489 ORFs), Helicobacter pylori HPAG1 (Accession NC008086; 1536 ORFs), Helicobacter pylori G27 (Accession NC011333; 1493 ORFs), Helicobacter pylori Shi470 (Accession NC010698; 1569 ORFs), Helicobacter pylori B128 (Accession NZABSY00000000; 1731 ORFs) and Helicobacter pylori 98-10 (Accession NZABSX00000000; 1527 ORFs). Sequences were downloaded in GenPept format, where the accession number and corresponding amino acid sequence were exported and then uploaded to an in-house database. Conservatrix was used to parse input sequences into 9-mer strings and to search the resulting dataset for matching segments. A frequency table showing each unique segment in the dataset and the number of times that the sequence occurs was produced. The EpiMatrix algorithm was used to compare the amino acid sequence of each given 9mer peptide to the coefficients contained in the matrix and produced a raw score. In order to compare potential epitopes across multiple HLA alleles, Epi-Matrix raw scores were converted to a normalized "Z" scale. Peptides scoring ≥1.64 on the EpiMatrix "Z" scale (typically the top 5% of any given sample), are likely to be MHC ligands and are considered "hits". Class II epitopes were identified for 8 archetypal alleles that cover >90% of the human All ICS are predicted to bind at least one HLA allele tested in assay but not all are predicted to bind all six alleles. True positives represent sequences predicted to be HLA ligands (EpiMatrix Z-score ≥1.64) and bind in assay (IC 50 <100 M). False positives are sequences predicted to bind but do not in assay. True negatives are sequences not predicted to bind and do not bind in assay. No sequences that bound HLA in assay were predicted not to bind, thus there are no false negatives. With respect to each individual HLA allele, true positive predictions are 81% for DRB1*0101, 63% for DRB1*0301, 62% for DRB1*0401, 88% for DRB1*0701, 94% for DRB1*1101, and 87% for DRB1*1501. Overall, binding predictions were confirmed in 79% of cases.