Structure-based group A streptococcal vaccine design: Helical wheel homology predicts antibody cross-reactivity among streptococcal M protein–derived peptides

Group A streptococcus (Strep A) surface M protein, an α-helical coiled-coil dimer, is a vaccine target and a major determinant of streptococcal virulence. The sequence-variable N-terminal region of the M protein defines the M type and also contains epitopes that promote opsonophagocytic killing of streptococci. Recent reports have reported considerable cross-reactivity among different M types, suggesting the prospect of identifying cross-protective epitopes that would constitute a broadly protective multivalent vaccine against Strep A isolates. Here, we have used a combination of immunological assays, structural biology, and cheminformatics to construct a recombinant M protein–based vaccine that included six Strep A M peptides that were predicted to elicit antisera that would cross-react with an additional 15 nonvaccine M types of Strep A. Rabbit antisera against this recombinant vaccine cross-reacted with 10 of the 15 nonvaccine M peptides. Two of the five nonvaccine M peptides that did not cross-react shared high sequence identity (≥50%) with the vaccine peptides, implying that high sequence identity alone was insufficient for cross-reactivity among the M peptides. Additional structural analyses revealed that the sequence identity at corresponding polar helical-wheel heptad sites between vaccine and nonvaccine peptides accurately distinguishes cross-reactive from non–cross-reactive peptides. On the basis of these observations, we developed a scoring algorithm based on the sequence identity at polar heptad sites. When applied to all epidemiologically important M types, this algorithm should enable the selection of a minimal number of M peptide–based vaccine candidates that elicit broadly protective immunity against Strep A.


Group A streptococcus (Strep A) surface M protein, an ␣-helical coiled-coil dimer, is a vaccine target and a major determinant of streptococcal virulence. The sequence-variable N-terminal region of the M protein defines the M type and also contains epitopes that promote opsonophagocytic killing of streptococci. Recent reports have reported considerable crossreactivity among different M types, suggesting the prospect of identifying cross-protective epitopes that would constitute a broadly protective multivalent vaccine against Strep A isolates.
Here, we have used a combination of immunological assays, structural biology, and cheminformatics to construct a recombinant M protein-based vaccine that included six Strep A M peptides that were predicted to elicit antisera that would crossreact with an additional 15 nonvaccine M types of Strep A. Rabbit antisera against this recombinant vaccine cross-reacted with 10 of the 15 nonvaccine M peptides. Two of the five nonvaccine M peptides that did not cross-react shared high sequence identity (>50%) with the vaccine peptides, implying that high sequence identity alone was insufficient for cross-reactivity among the M peptides. Additional structural analyses revealed that the sequence identity at corresponding polar helical-wheel heptad sites between vaccine and nonvaccine peptides accurately distinguishes cross-reactive from non-cross-reactive peptides. On the basis of these observations, we developed a scoring algorithm based on the sequence identity at polar heptad sites. When applied to all epidemiologically important M types, this algorithm should enable the selection of a minimal number of M peptide-based vaccine candidates that elicit broadly protective immunity against Strep A.
Streptococcus pyogenes, or group A streptococcus (Strep A), 2 causes noninvasive infections such as pharyngitis, impetigo, and cellulitis, as well as invasive and life-threatening infections such as bacteremia, streptococcal toxic shock syndrome, and necrotizing fasciitis (1). Poststreptococcal sequelae that may follow Strep A infections include acute rheumatic fever, rheumatic heart disease, and glomerulonephritis (2). Rheumatic heart disease and invasive Strep A infections are estimated to cause over 500,000 deaths/year, with the vast majority of the disease burden in low-and middle-income countries where the mortality rate is also disproportionately high (3).
To reduce the health burden and the economic impact of these infections, efforts to develop an effective vaccine against Strep A have been ongoing (4). There are a number of potential vaccine candidates (4 -8), and in particular, there has been considerable progress in the preclinical and clinical development of multivalent M protein-based vaccines (9 -13). These vaccines have been formulated using recombinant fusion proteins containing multiple peptides from the variable N terminus of epidemiologically prevalent M types of Strep A (10,11). One obstacle to this approach has been the perception that protection against Strep A infections is type-specific and that the diversity of M types (Ͼ200) may prevent the development of broadly efficacious M protein-based vaccines. However, there is recent evidence that N-terminal M peptides evoke antibodies that cross-react with heterologous M types of Strep A (11), and natural infection with Strep A elicits cross-opsonic antibodies (14). This is supported by the observation that the majority of M proteins are members of structurally and functionally related clusters, indicating that immunity may be a combination of cluster-specific and type-specific antibody responses (15).
We previously showed that the relatedness of M peptides within a single M cluster could be exploited by using a computational structure-based approach to select five N-terminal M peptides that elicited antibodies that reacted with all 17 M peptides in the cluster and opsonized 15 of 17 M types of Strep A (16). In the present study we examined 117 M proteins, respon-sible for 92% of Strep A infections globally (15,17). The N-terminal hypervariable region (HVR) of the M protein (residues ϳ1-50) contains epitopes that elicit antibodies with the greatest bactericidal activity and is least likely to elicit antibodies that cross-react with host tissues (4, 9 -11, 18). Therefore, we confine our analysis to residues 1-50 and have redefined the sequence-based M clusters (15) to include seven N-terminal clusters (NTCs) containing all 117 M peptides. In this initial study, we focused on a cluster of 21 peptides (NTC6) that shared significant sequence similarity. Using a structure-based computational approach, we designed an NTC6 vaccine that elicited antibodies in rabbits that cross-reacted with 10 of 15 nonvaccine peptides. A posteriori analysis of the nonreactive versus cross-reactive peptides revealed that sequence identity within the polar heptad sites of the predicted ␣-helical domains within the N-terminal region is a strong predictor of crossreactivity. The application of this new approach to the structure-based design of multivalent vaccines may result in more broadly cross-reactive and efficacious M protein vaccines.

Sequence-based clustering
117 M types were divided into M peptide NTCs (Fig. 1) by constructing a phylogenetic sequence-based tree of the N-terminal 50 amino acids of the mature proteins that define the HVR region (Geneious, version 9.1.6). The seven N-terminal clusters were designated based on the calculated common branches. The overall phylogenetic relationships of the N-terminal peptides bear some resemblance to the previous description of M clusters based on the whole M sequences (15). In this study we limited the analysis to the NTC6 cluster, which contains 21 different M types that collectively accounted for 33% of all Strep A isolates from children with pharyngitis in North America (19), many of which are prevalent globally (17).

Subclusters of immunologically similar M peptides
A functional matrix of antibody binding and cross-reactivity among the NTC6 peptides, which describes the inhibition of antibody binding to 12 NTC6 M peptides by all 21 peptides in the cluster, was developed by performing ELISA inhibition experiments. The relational matrix of experimentally obtained antibody binding between NTC6 peptides (Table S1) was subclustered using k means into seven immunologically related peptide groups (Fig. 2). To resolve the optimal number of clusters, k was varied from 2 to 9, and the maximal average silhouette coefficient was obtained for k ϭ 7 (Fig. S1). The silhouette coefficient is considered as measure of quality of the structure of a cluster; in other words. it informs us how closely related objects in a cluster are and how distinct or well-separated a cluster is from other clusters (20). Clusters with high silhouette

Heptad identity predicts cross-reactivity among M proteins
coefficients are well-separated and were considered to contain M peptides more likely to cross-react than clusters with lowsilhouette coefficients. For example, from Fig. 2, M84 and M89 belonging to FC2 (s ϭ 0.46) would be predicted to cross-react with greater probability than M1, M9, and M227 belonging to FC1 (s ϭ 0.19).

Clusters of structurally and immunologically similar M peptides
The structures of the 21 M peptides were calculated using the de novo computational framework, PEP-FOLD3 (21) (Fig. S2A). To identify structural features that are most relevant to correctly predicting antibody binding, we used the 44 structurebased protein descriptors from the Molecular Operating Environment program (22) as independent variables to describe the PEP-FOLD3 models and the columns of the relational matrix as the dependent variables in a multiple regression analysis with feature ranking. The feature ranking resulted in a subset of 20 top-ranked descriptors from the supervised regression approach that were then used in k-means clustering of the NTC6 M peptides to obtain experimentally informed structure-based clusters (Fig. 3). An overlay of the models that have been grouped together for each cluster shows that models belonging to the same cluster share greater structural similarity than models belonging to different clusters (Fig. S2B).
There was considerable overlap between the experimentally informed clusters and antibody-binding function-based clusters (Rand index ϭ 0.77). FC3, FC4, FC6, and FC7 in Fig. 2 resembled and respectively corresponded to experimentally informed clusters 1, 5, 4, and 7 in Fig. 3. This demonstrates that the experimentally informed structure-based top 20 descriptors can adequately detect and isolate together different M peptides with similar antibody-binding function.
Next, the experimentally informed clusters (Fig. 3) were considered together with the functional data ( Fig. 2) to select a minimal number of peptides predicted to elicit broad crossreactivity against the remaining 15 M peptides in NTC6. The s coefficient was used as the main criterion in this selection. However, in Figs. 2 and 3, for some of the clusters the values of s do not indicate strong clustering of data. Therefore, additional information was taken into account in selecting the final vaccine candidates. In the case of the experimentally informed clusters, if three or more PEP-FOLD3 models of different M types each shared a cluster, they were considered to be more likely to cross-react than those with fewer than three models in a same cluster. As an example, in Fig. 3 we observed that cluster 6 contained only one model each of M114, M112, M73, M50, and M49, and these M types were considered unlikely to be immunologically related. In contrast, M1, M238, and M239 had all five models placed in cluster 4 and were considered likely to cross-react. Based on these considerations and also taking into account epidemiological prevalence, six M types were predicted to elicit cross-reactive antibodies against the remaining 15 M peptides in NTC6 ( Table 1). The vaccine construct is shown in Fig. 4A.

Hexavalent NTC6 vaccine evoked antibodies against vaccine and nonvaccine peptides
In three rabbits immunized with the hexavalent NTC6 vaccine, immune sera contained significant levels of antibodies against all vaccine peptides as well as significant levels of antibodies against 10 of the 15 nonvaccine peptides (Fig. 4B). All preimmune sera resulted in antibody titers of 100 against all 21 peptides. For this analysis, the nonvaccine peptides were considered to be cross-reactive when two of the three immune sera displayed antibody titers of at least 800, which is an 8-fold  The tick size on the bottom axis represents one 3D PEP-FOLD3 model (five were generated for each sequence). For example, cluster 5 contains two models of M112, five models of M102, and three models of M77. s refers to the silhouette coefficient and is reported for each cluster.

Cross-reactivity and sequence identity were correlated but not without exceptions
Given the experimental results in Fig. 4B, it was of interest to conduct an a posteriori analysis of factors that distinguish antibody cross-reactivity with nonvaccine peptides. To understand the extent of correlation between sequence identity and the antibody cross-reactivity, we performed a pairwise sequence alignment between the 21 M peptides and calculated their mutual sequence identities using the EMBOSS Needle program (23) ( Table 2). The maximum pairwise sequence identities between nonvaccine peptides and vaccine peptides extracted from Table 2 are listed in Table 3.
Five nonvaccine M types (M84, M124, M114, M175, and M232) have pairwise sequence identities greater than 60% with at least one or more vaccine M peptides and exhibited crossreactivity with the antibodies elicited by the vaccine. The nonvaccine M types with low pairwise sequence identity (Ͻ40%) with one or more vaccine peptides were M15 and M50, and these M peptides displayed marginal cross-reactivity with antibodies elicited by the vaccine. Thus, there is a moderate positive correlation between the pairwise sequence identity among the vaccine peptides and the nonvaccine peptides and the crossreactivity, as indicated by the Spearman correlation ( ϭ 0.56, p value ϭ 0.05) (Fig. 5). However, when nonvaccine peptides shared between 40 and 60% sequence identity with any of the vaccine peptides, sequence identity could not be used to infer whether the antibodies raised by the vaccine peptide would cross-react with the nonvaccine peptide. For instance, M238, M112, and M183 share 45-60% sequence identities with vaccine peptides and were only marginally cross-reactive with the vaccine antisera. On the other hand, nonvaccine peptides M9, M49, M102, M227, and M239 share a similar degree of sequence identity with at least one of the vaccine M peptides and did cross-react with the NTC6 vaccine antisera. Thus, although significant sequence identity is a useful consideration for antibody cross-reactivity, it alone cannot reliably predict cross-reactivity.

Coiled-coil heptad repeat sequence identity
Although overall sequence identity was correlated with crossreactivity, one might expect the three-dimensional structure of the antigen in vivo to also be of importance. The monomer structures predicted by PEP-FOLD3 have extensive helical content, and in many cases the peptide was predicted to fold on itself in a manner resembling coiled-coil structures. The ␣-helical coiled-coil structure is also evidenced by available crystal structures (24, 25) of M1, M2, M22, M28, and M49. Therefore in a further analysis, we assume that this coiled-coil extends into the N terminus. Heptad repeats form coiled coils, and the positions in the heptad repeat are labeled a-g. The core-forming positions of the coiled-coil (a and d) are usually occupied by hydrophobic residues whereas the remaining, solvent-exposed positions (b, c, e, f, and g) are dominated by hydrophilic residues.
The probability of shared epitopes between vaccine and nonvaccine M types increases with increased sequence identity. Therefore, instead of simply comparing overall sequence identity, we considered only the region within the N-terminal 1-50 residues that is predicted to be significantly coiled-coil (MAR-COIL (26) assigned coiled-coil probability Ն 2%) and calculated the sequence identity between each of the corresponding heptad sites of vaccine and nonvaccine M types. The heptad repeat projected on a helical wheel generated using DRAWCOIL 1.0 (27) for M types that share high overall sequence identity with vaccine type M1 (Ͼ40%) is shown in Fig. 6.

Empirical scoring scheme for predicting cross-reactivity
We developed an empirical scoring scheme that penalizes low sequence identity and rewards high pairwise sequence identity at the corresponding heptad sites between the vaccine and nonvaccine M types. Pairwise alignments of the vaccine and nonvaccine peptides at the heptad positions was performed. In calculating the empirical pair score between vaccine and nonvaccine M types, we left out the hydrophobic positions (a and d) of the heptad. This was done in part because little sequence variation is found at the hydrophobic core sites of the heptad (a and d) that are restricted in terms of the types of residue that can occupy those positions, favoring hydrophobic residues such as leucine, isoleucine, valine, and alanine. Heptad positions a and d are therefore conserved among many M types and do not serve as good discriminants between cross-reactive and non-cross-reactive types. Most of the NTC6 M types are leucine zippers, i.e. they contain an abundance of leucines at the d site. Another reason for not taking a and d into account is that on the surface of the bacterium, and possibly in the vaccine, these residues are buried and are presumably least accessible for antibody contact and recognition.

Vaccine type
Nonvaccine type predicted to be covered

Heptad identity predicts cross-reactivity among M proteins
score was then calculated by summing up the scores at each of the polar heptad positions. As an example, the pairwise sequence identity of nonvaccine M238 with vaccine peptides at each of the heptad positions is shown in Table 4. The empirical pairwise score for M238 with vaccine peptide M1 is thus calculated:

Low sequence identity at heptad positions is correlated with low antibody cross-reactivity against M peptides that share high overall sequence identity
Empirical pairwise scores based on the sequence identity at the heptad sites between all vaccine and nonvaccine peptides for NTC6 are given in Table S2. Table 5 shows the maximum peptide-specific and cross-reactive antibodies evoked in rabbits by the NTC6 vaccine. Antibody titers from the immune sera of three rabbits are shown. For this analysis, the nonvaccine peptides were considered to be cross-reactive when two of the three immune sera displayed antibody titers of at least 800, which is an 8-fold increase over preimmune antibody levels. As a group, the marginally cross-reactive peptides resulted in geometric mean titers Ͻ700 (range 126 -635), whereas the cross-reactive peptides resulted in geometric mean titers of Ͼ1,200 (range 1,270 -25,600).

Table 2 The pairwise sequence identity matrix between 21 M types
The asterisk denotes a vaccine M type. The highest sequence identity that an M type can share with a vaccine M type is highlighted with a gray background. The five marginally cross-reactive M types are shown with a black background.

Heptad identity predicts cross-reactivity among M proteins
empirical pairwise score between nonvaccine and vaccine M types for the NTC6 peptides based on heptad homologies. The score distinguished between cross-reactive and marginally cross-reactive M peptides. When the score for a pair was Ն10.5, antibody cross-reactivity was observed. Conversely, when a score was Ͻ10.5, cross-reactivity was not observed. Fig. 7 shows the geometric mean antibody titers obtained after immunizing three rabbits with the NTC6 vaccine plotted against the maximum heptad identity score of each nonvaccine peptide. This indicates that a score of Յ10.5 predicted all five marginally cross-reactive peptides in the NTC6 cluster. Spearman's rankorder correlation is ( ϭ 0.75, p value ϭ 0.001).
An illustration of the use of heptad identity is the correct prediction of lack of cross-reactivity of M238 and vaccine pep-tide M1, despite high overall sequence identity between the two (54%). These results indicate that mutual heptad identity between the vaccine and nonvaccine M types is an important indication of the degree of shared epitopes, as well as an important determinant of antibody cross-reactivity among sequencesimilar M peptides.

Discussion
The M protein of Strep A is a major protective antigen and a leading vaccine target. A significant challenge to the development of M protein-based Strep A vaccines has been the number of different M types (Ͼ200) identified to date. Each emm type is defined by the 5Ј sequence that encodes the variable N terminus of the mature protein (28). Recently it has been shown that the majority of M proteins can be clustered based on similar structural and functional characteristics (15), leading to a new paradigm suggesting that M antibody responses may be M cluster-specific as well as type-specific. The overall goal of structure-based approaches to vaccine design is to focus on the structural and functional similarities of M proteins to identify the fewest number of M peptides needed to formulate vaccines that will elicit immune responses against the majority of epidemiologically important M types of Strep A. In the present study, we have used both 3D structure-based and sequence-based approaches to analyze the antibody cross-reactivity among Strep A N-terminal M peptides from one sequence-related cluster. Because our subunit vaccines contain peptides from the N-terminal regions of the M proteins (11), we have redefined the M clusters using only the first 50 amino acid residues, as opposed to our previous studies that clustered the sequences of the entire protein (15).
In principle, if a complete antibody-binding matrix were available, it could simply be clustered to obtain immunologically related M types. However, the Table S1 is incomplete because the human intravenous immunoglobulin (IVIG) antibodies did not react with all peptides, so we identified structural features that correlated with immunological data and then used those features to cluster M types. The end result was the identification of clusters of peptides that shared immunologically relevant structural features. This also allowed us to test the hypothesis that M types that share similar structural features are immunologically related.
The starting point for the current structure-based analysis was sequence similarities among M peptides within the same cluster. However, because sequence similarity does not directly take into account conformational similarity of putative epitopes, we then extended the analysis to determine 3D structures of the peptides and to compare their shape-dependent properties using cheminformatics methods. All 21 peptides within the NTC6 cluster were then subclustered based on similarities among 3D models and functional antibody inhibition studies. Six peptides, each representing one of the subclusters, were selected to construct the recombinant NTC6.1 vaccine. The vaccine elicited significant levels of cross-reactive antibodies against 10 of the 15 nonvaccine peptides within NTC6.
In an a posteriori analysis of the cross-reactivity results, we found that a method distinguishing cross-reactive and noncross-reactive pairs that only considers the coiled-coil domain  The range of Spearman rank correlation is between Ϫ1 and ϩ1, with ϩ1 indicating the Y variable as a perfectly increasing monotone of the X variable and Ϫ1 indicating the Y variable as a perfectly decreasing monotone of the X variable. A Spearman correlation of 0 signifies no correlation between the X and Y variables.

Heptad identity predicts cross-reactivity among M proteins
within the N-terminal region and that calculates the homology between the residues at the corresponding polar heptad sites was more discriminating than predicting cross-reactivity from calculated 3D monomer structures. This method makes the assumption that the peptide epitopes are in a helical conformation. Although coiled-coil structures are strongly predicted for the M proteins, there is some evidence that the N-terminal ϳ15 residues are less likely to be coiled coils (24). The heptad identity-based scoring scheme incorporates simple structural data into sequence-based considerations and is a computationally efficient and effective way of predicting antibody crossreactive immunogenicity among Strep A M peptides. Two factors that must be considered while determining antibody cross-reactivity are that the primary requirement for antibody binding or antigenicity is surface accessibility (29,30) and that 80% of all naturally occurring antibody epitopes studied so far are discontinuous in sequence, as expected because the antibody sees a contiguous surface. The heptad identity-based Only heptad repeat regions with probability percentages of Ͼ2% were considered for helical wheel representation. The view is from the N-terminal region to the C-terminal region. Heptad repeat positions are labeled a-g. The following color scheme is followed: polar positive, blue; polar negative, red; polar neutral, orange; and nonpolar aliphatic, gray.

Heptad identity predicts cross-reactivity among M proteins
scoring scheme takes into account both of these factors by assigning precedence to surface-accessible polar residues in forming the dominant epitopes that stimulate the immune response and by searching for homology in the surface-accessible regions between nonvaccine peptides and vaccine peptides. The determination of the helical wheel homology helps ascertain the probability of a nonvaccine M type sharing the same epitopes as the vaccine M type and the extent to which it will be recognized by the same antibody that recognizes the vaccine M type. We have shown that an analysis of the conservation of exposed residues between the vaccine and nonvaccine peptides aided by the construction of a helical wheel yields a positive correlation with observed antibody cross-reactivity. Future Strep A structural vaccinology work will include expanding epitope analysis with solution biophysical experiments and detailed molecular dynamics simulation of the native M proteins and vaccines. Our previous studies of multivalent vaccines containing N-terminal peptides of M proteins have shown that these complex vaccines elicit type-specific and cross-reactive antibodies that promote opsonization and killing of Strep A bacteria (11,16). Future experiments will rely on several structure-based approaches to identify the fewest number of M peptides to include in broadly protective vaccines that will be tested for functional opsonic antibodies against multiple epidemiologically important M types of Strep A.

Vaccine design strategy
The workflow for the selection of vaccine candidates is presented in Fig. 8. Sequence-based clustering of the N-terminal 50 residues was performed, and one cluster, NTC6, containing 21 M types was selected for study. We aimed to select a minimum number of M peptides that would elicit antibody responses against the majority of streptococcal M peptides within NTC6. The sequences of the HVRs of the 21 M types are provided in Table S3, and their pairwise sequence identities were obtained using the EMBOSS program Needle (Tables 2 and 3). 3D structural models were created of monomeric forms of these peptides upon which cheminformatic analysis was performed. Partial data on experimental antibody binding were also derived (antibody cross-reactivity and the development of a functional matrix). Combining the experimental antibody-binding data with the cheminformatic results involved multiple regression calculations and resulted in the generation of experimentally informed cheminformatics descriptors. The candidate experimental vaccine was chosen based on the experimental and computational results.
Antibody cross-reactivity and the development of a functional matrix-A functional matrix of antibody binding and crossreactivity among the NTC6 peptides was developed using naturally acquired human antibodies in commercially available IVIG (Gammagard liquid; Baxalta US Inc., Westlake Village, CA). IVIG contains purified IgG that is pooled from the serum of multiple donors and contains antibodies against many of the common M antigens. Functional relatedness was assessed by performing ELISA-inhibition experiments performed in duplicate (16) using all 21 NTC6 peptides as inhibitors of IVIG antibody binding to 12 of the NTC6 peptides (Table S1). The 12 peptides were selected based on direct ELISA results, indicating that the IVIG contained levels of antibodies sufficient to perform inhibition assays. Relative functional "distances" were expressed as 100 Ϫ (percentage of inhibition), where 0 ϭ complete functional identity and 100 ϭ no functional activity, i.e. no inhibition. k-means clustering of antibody binding between NTC6 peptides was used to identify "subclusters" of functionally related peptides, and the optimal number of clusters was determined by a silhouette analysis (20).
Structural models of monomeric M peptides-In designing the vaccine we used our previous approach (16) of modeling the structures of the linear peptides in monomeric form in aqueous solution and identifying physicochemical similarities in the resulting computed structures as an indication of cross-reactivity. PEP-FOLD3 (21) is a computational framework that allows de novo structure prediction for linear peptides between 5 and 50 amino acids. PEP-FOLD3 was used to predict the 3D struc-

Heptad identity predicts cross-reactivity among M proteins
tures of the N-terminal regions (50 residues long) of all 21 NTC6 M peptides as monomers (Fig. S2A). Five top-scoring conformations per M peptide were retained for calculation of structural descriptors. The Molecular Operating Environment program package (22) was used to calculate structure-based physicochemical descriptors (44 nonzero variance descriptors) such as peptide surface area, patches of excess charge, hydrophobicity, volume, and shape of the 1-50 residue 3D models of M peptides that were obtained using PEP-FOLD3 with default parameters. For initial quality check, clusters based on all the nonzero descriptors are shown in Fig. S3. Machine learning to identify clusters of structurally and immunologically related M types-Instead of using all of the nonzero variance descriptors as in our previous study (16), we chose only the top descriptors that correlated the PEP-FOLD3predicted structures with antibody binding. We assumed that the use of these descriptors would improve the probability of identifying shared structural similarities and shared epitopes. We used supervised machine learning employing multiple regression to rank and identify the descriptors that contributed the most to the M peptide structure-activity relationship. The experimental inhibition of antibody binding to each of the 12 NTC6 M types in the relational matrix (Table S1) was used as a response or dependent variable, and the 44 scaled protein descriptors were used as predictors/independent variables in univariate multiple regression models. The univariate regression allowed us to rank the 44 descriptors per regression model and identify a subset of higher-ranked descriptors that appeared more frequently than others in the 12 regression models.
Following the multiple regression analysis, the subset of the 20 top-ranked cheminformatic descriptors from the supervised regression approach was used in k-means clustering of the NTC6 M peptides to obtain subclusters. The new subclusters thus obtained were compared with those that were obtained from clustering experimental antibody-binding data, i.e. the clusters based on the 12 response variables. We observed some correlation between the subgroups of M peptides defined by the k-means clustering based on the 20 physicochemical descriptors and those based on immunological relatedness. We were thus able to use antibody-binding data in a supervised learning approach to identify important descriptors that were then used in an unsupervised k-means clustering scheme to identify groups of M peptides that were predicted to be both structurally and functionally related.
NTC6 vaccine construction, immunization of rabbits, and detection of antibodies-Six NTC6 peptides containing residues 1-50 of M1, M77, M89, M2, M73, and M118 were selected as vaccine components based on the structural and antibodybinding predictions described above. A recombinant hybrid protein vaccine containing the six M peptides joined in tandem without linkers was produced from extracts of Escherichia coli containing pUC57, into which was inserted a chemically synthesized hybrid gene (Genscript, Piscataway, NJ), using methods previously described (11). The synthetic gene was also designed to encode an upstream T7 promoter and a 3Ј polyhistidine motif followed by a stop codon. Three rabbits were immunized with 200 g of protein on alum via the intramuscular route at time 0, 4 weeks, and 8 weeks. Serum was obtained prior to immunization and 2 weeks after the final injection. Antibody levels in preimmune and immune sera were assayed using the 21 NTC6 M peptides as solid-phase antigens, as previously described (16). Assays were repeated once to confirm the results of the initial ELISA. Antibody titer was defined as the reciprocal of the last serum dilution resulting in an A 450 of Ն0.2. All research involving animals was reviewed and approved by the University of Tennessee Health Science Center Institutional Animal Care and Use Committee.

Heptad identity predicts cross-reactivity among M proteins Prediction of coiled-coil regions of M peptides to determine helical wheel homology among M types
After the vaccine design and testing phase, we searched for 3D structural correlates distinguishing cross-reactive from nonreactive peptide pairs. A simple approach makes use of the predominantly ␣-helical coiled-coil dimeric structure of the M protein (18,31). A standard, canonical coiled-coil structure consists of two ␣-helices twisting around each other with their side chains interlocking in a "knobs" into "holes" packing. The regular meshing of knobs into holes requires recurrence of the side-chain residue types every seven residues along the helix interface (32). Various tools exist to predict the structures of coiled-coil regions in protein to a level of detail that permits the assignment of the individual residues to the positions of the heptad repeat. One such tool is MARCOIL, which calculates posterior probabilities of hidden Markov models and has been reported to offer the best combination of sensitivity and speed compared with other similar tools (26,33). The knowledge of the heptad register using tools like MARCOIL can eliminate the need for homology-based methods for modeling coiled-coil proteins, because the structure of coiled coil, unlike almost any other known protein fold, can be computed from parametric equations if the heptad assignment is known.
We confirmed the MARCOIL prediction of coiled-coil domains and the assignment of individual residues in a sequence to the heptad by comparing it with the knobs-in-holes interactions in experimental 3D X-ray crystallographic structures recognized by the SOCKET program (34). The SOCKET program can both identify the repeating knobs-into-holes structural motif and can use this information to assign oligomer order (number of helices), orientation (parallel, anti-parallel, and mixed), and heptad register for the coiled-coil. We found the prediction by MARCOIL and the actual heptad assignment calculated by SOCKET for the three available M dimer crystal structures (PDB entries 2OTO, 5HZP, and 5HYT) to be identical, thus validating further use of MARCOIL (Table S4 and Fig.  S4, A-C). Therefore, the heptad assignment of the HVR of all 21 M types was made using MARCOIL. Table S5 provides the sequence predicted to be coiled-coil within the 1-50 residues, along with their starting heptad register and the sequence length. Additionally, the sequences at the heptad sites for each of the M peptides are provided in Table S6.