Machine Learning Prediction and Experimental Validation of Antigenic Drift in H3 Influenza A Viruses in Swine

ABSTRACT The antigenic diversity of influenza A viruses (IAV) circulating in swine challenges the development of effective vaccines, increasing zoonotic threat and pandemic potential. High-throughput sequencing technologies can quantify IAV genetic diversity, but there are no accurate approaches to adequately describe antigenic phenotypes. This study evaluated an ensemble of nonlinear regression models to estimate virus phenotype from genotype. Regression models were trained with a phenotypic data set of pairwise hemagglutination inhibition (HI) assays, using genetic sequence identity and pairwise amino acid mutations as predictor features. The model identified amino acid identity, ranked the relative importance of mutations in the hemagglutinin (HA) protein, and demonstrated good prediction accuracy. Four previously untested IAV strains were selected to experimentally validate model predictions by HI assays. Errors between predicted and measured distances of uncharacterized strains were 0.35, 0.61, 1.69, and 0.13 antigenic units. These empirically trained regression models can be used to estimate antigenic distances between different strains of IAV in swine by using sequence data. By ranking the importance of mutations in the HA, we provide criteria for identifying antigenically advanced IAV strains that may not be controlled by existing vaccines and can inform strain updates to vaccines to better control this pathogen. IMPORTANCE Influenza A viruses (IAV) in swine constitute a major economic burden to an important global agricultural sector, impact food security, and are a public health threat. Despite significant improvement in surveillance for IAV in swine over the past 10 years, sequence data have not been integrated into a systematic vaccine strain selection process for predicting antigenic phenotype and identifying determinants of antigenic drift. To overcome this, we developed nonlinear regression models that predict antigenic phenotype from genetic sequence data by training the model on hemagglutination inhibition assay results. We used these models to predict antigenic phenotype for previously uncharacterized IAV, ranked the importance of genetic features for antigenic phenotype, and experimentally validated our predictions. Our model predicted virus antigenic characteristics from genetic sequence data and provides a rapid and accurate method linking genetic sequence data to antigenic characteristics. This approach also provides support for public health by identifying viruses that are antigenically advanced from strains used as pandemic preparedness candidate vaccine viruses.

ABSTRACT The antigenic diversity of influenza A viruses (IAV) circulating in swine challenges the development of effective vaccines, increasing zoonotic threat and pandemic potential. High-throughput sequencing technologies can quantify IAV genetic diversity, but there are no accurate approaches to adequately describe antigenic phenotypes. This study evaluated an ensemble of nonlinear regression models to estimate virus phenotype from genotype. Regression models were trained with a phenotypic data set of pairwise hemagglutination inhibition (HI) assays, using genetic sequence identity and pairwise amino acid mutations as predictor features. The model identified amino acid identity, ranked the relative importance of mutations in the hemagglutinin (HA) protein, and demonstrated good prediction accuracy. Four previously untested IAV strains were selected to experimentally validate model predictions by HI assays. Errors between predicted and measured distances of uncharacterized strains were 0.35, 0.61, 1.69, and 0.13 antigenic units. These empirically trained regression models can be used to estimate antigenic distances between different strains of IAV in swine by using sequence data. By ranking the importance of mutations in the HA, we provide criteria for identifying antigenically advanced IAV strains that may not be controlled by existing vaccines and can inform strain updates to vaccines to better control this pathogen. IMPORTANCE Influenza A viruses (IAV) in swine constitute a major economic burden to an important global agricultural sector, impact food security, and are a public health threat. Despite significant improvement in surveillance for IAV in swine over the past 10 years, sequence data have not been integrated into a systematic vaccine strain selection process for predicting antigenic phenotype and identifying determinants of antigenic drift. To overcome this, we developed nonlinear regression models that predict antigenic phenotype from genetic sequence data by training the model on hemagglutination inhibition assay results. We used these models to predict antigenic phenotype for previously uncharacterized IAV, ranked the importance of genetic features for antigenic phenotype, and experimentally validated our predictions. Our model predicted virus antigenic characteristics from genetic sequence data and provides a rapid and accurate method linking genetic sequence data to antigenic characteristics. This approach also provides support for public health by identifying viruses that are antigenically advanced from strains used as pandemic preparedness candidate vaccine viruses. KEYWORDS antigenic drift, influenza A, machine learning, molecular epidemiology, swine, viral evolution difficult due to rapid mutation that allows the virus to evade host immune defenses and impacts the efficacy of vaccination programs by antigenic drift (2). The best approach for effective IAV control has been the development of vaccines that reflect the antigenic diversity of circulating swine IAV strains (3). This is dependent on robust sampling and sequencing of contemporary strains, which is currently achieved primarily through passive surveillance, whereby clinically sick pigs are sampled and the hemagglutinin (HA) gene is sequenced and compared to vaccine antigens based on either genetic clade or sequence identity. Vaccines that include a well-matched HA can induce the production of antibodies that may provide sterilizing immunity, help reduce clinical signs, or reduce transmission (4,5). Conversely, mismatched vaccine antigens can result in vaccine failure or potentially cause enhanced disease, emphasizing the importance of careful vaccine strain selection (6).
In the United States, swine IAV is monitored by the U.S. Department of Agriculture (USDA) in collaboration with regional veterinary diagnostic laboratories in the National Animal Health Laboratory Network (7). These data are synthesized primarily using phylogenetic analysis (7,8), but there is no coordinated effort to characterize the phenotypic differences between circulating viruses (9). This contrasts with the approach for human IAV, whereby vaccine antigens are selected through comprehensive genetic and antigenic characterization of seasonally circulating IAV strains (10). Thus, the majority of vaccine antigens in use for IAV in swine are selected based solely on the genetic clade or amino acid identity. This effort is fraught with risk, as there are at least 16 distinct HA genetic clades of IAV in swine derived from multiple human-to-swine interspecies transmission events and subsequent evolution in the swine host (8,11). Further, there is evidence for regional patterns in HA clade persistence (8,12) and as few as six amino acid mutations within the HA may affect the antigenic phenotype of a virus (13,14). Consequently, there is a critical need to not only sequence and genetically characterize swine IAV but also determine what of the genetic diversity is meaningful for antigenic drift.
The antigenic properties of IAV are a manifestation of the structural interaction between IAV and host antibodies (15)(16)(17)(18). Structural changes in the HA may alter the interaction with antibodies targeting the virus, and these changes are generally correlated with the number of accumulated amino acid mutations in the HA protein (19). Empirical data have also shown that certain amino acid mutations have a disproportionate effect on antigenic change based on the location of the amino acid in the protein structure (13,15). Though there are relatively few antigenically characterized swine IAV HA genes (9,13), these empirical data may be used to establish antigenic distances between multiple IAVs in swine and to gain insight into the contribution of site-specific amino acid mutations. These data can subsequently be used to predict antigenic drift and assign a ranking of importance to specific amino acid mutations that nuance the biological relevance of genetic diversity collected during surveillance programs.
In this study, machine learning methods were used to model the antigenic properties of IAV in swine and predict the antigenic distance between different strains using HA sequences. Modeling methods, such as the ones we present, are able to overcome the prohibitive costs and logistical challenges associated with large-scale phenotypic characterization. These data can be used in combination with in-field surveillance platforms (20) as an approach for the early detection of antigenic variants and novel viruses. Additionally, these algorithms can be disseminated to swine practitioners in analytical pipelines (11,20,21) to facilitate the rational design of vaccines that include antigens that will likely protect against the circulating IAV strains. Understanding how genetic diversity, and which amino acids within the HA gene are the most important, can allow for the simulation of the antigenic evolution of swine IAV and make predictions about the persistence and circulation of future IAV strains.

RESULTS
Machine learning model performance. Comparison of the empirical antigenic distances with the values predicted by random forest, AdaBoost decision tree, multilayer perceptron regression, and the ensemble of all three models indicated that the Pearson correlation for all regression models was within a range of 77% to 80% ( Table 1). The root mean square error (RMSE) was between 1.21 and 1.60 antigenic units (AU) of error depending on the model. Tenfold cross validation of the random forest, AdaBoost decision tree, multilayer perceptron, and the ensemble of the regression models had RMSEs of 1.56 6 0.29, 1.59 6 0.33, 1.76 6 0.39, and 1.58 6 0.27, respectively. The leave-one-out cross validation demonstrated that for all models, 25% had #0.5 AU, 50% had #1.0 AU, and 75% had #1.7 AU distance error. The maximum observed error was 6.3 AU, with each model producing errors of .6.0 AU (Fig. 1).
Mapping antigenic predictions onto phylogenetic trees. Four trees were built with sequences genetically similar to four selected test antigens (Fig. 2). Trees were annotated with an amino acid motif based on positions 145, 155, 156, 158, 159, and 189, as these sites have been found to have a disproportionate effect on the observed a Pearson correlation and root mean square error (RMSE) were determined using an 80%/20% split between training and test antigen data. A 10-fold cross validation based on the RMSE was applied. CV, cross validation.
FIG 1 Distribution of errors calculated for the predicted antigenic distance compared to actual antigenic distance as predicted by machine learning models and hemagglutination inhibition assays, respectively. Three regression models were used to predict distances from empirically determined antigens using hemagglutination inhibition titers in a leave-one-out approach: random forest regression (rf), AdaBoost decision tree regression (ada), and multilayer perceptron (mlp) regression. All three predictions were combined into an ensemble (ens) to prevent overfitting and to minimize errant predictions by averaging across predictions from all models. Approximately 25% of the data have 0.5 antigenic units (AU) of error or less, and 50% of the data have 1 AU of error or less, with 75% of the data having less than 2 AU of error. Maximum error for outliers exceeded 6 AU. Empirical validation of the predicted antigenic distance predictions. The predicted ensemble distances of the four selected test antigens were validated via HI assay ( Table 2). Test antigen A/swine/Nebraska/A01672826/2017 was predicted to be 0.15 AU from reference strain A/swine/Indiana/A00968373/2012, with 99.4% amino acid identity shared between the HA1 segments of the HA (Table 3). Both the reference and test antigens were from the H3 cluster IVA clade ( Fig. 2A), and this pairing represents a near identity and near antigenic distance prediction. The amino acid differences between the reference strain and the test antigen were at M10T and R208I ( Table 3). The HI assay demonstrated that the antigenic distance between the refer-  ence strain antiserum and test antigen was 0.5 AU (Table 4), with an error between the predicted distance and the empirical distance of 0.35 AU.
Test antigen A/swine/Indiana/A02214844/2017 was predicted to be 3.39 AU from reference strain A/swine/Iowa/A01480656/2014, with 98.5% amino acid identity shared between the HA1 segments. Both the reference strain and test antigens are from the H3 cluster IVA clade (Fig. 2B), and this pairing represents near identity but far antigenic distance prediction. There were 5 amino acid differences between the reference strain and test antigen ( Table 3). The HI assay found a distance of 4.0 AU between the test antigen and reference antiserum and an error of 0.61 AU between empirical and predicted distances (Table 4).
Test antigen A/swine/North Carolina/A01732197/2016 was predicted to be 0.81 AU from reference strain A/swine/Pennsylvania/A01076777/2010, with 93.9% amino acid identity shared between the HA1 segments. The test antigen was selected from the H3 cluster IVA clade, and the reference strain was selected from the H3 cluster IV clade (Fig. 2C); this pair represents a far identity, but the antigen and reference strain were predicted to be antigenically similar. There were 20 amino acid differences between the reference strain and test antigen ( Table 3). The HI assay demonstrated an average antigenic distance between reference antiserum and test antigen of 2.5 AU, with a prediction error of 1.69 AU (Table 4).
Test antigen A/swine/Iowa/A01733626/2016 was predicted to be 6.37 AU from reference strain A/swine/Indiana/A01202866/2011, with 91.2% amino acid identity shared between the HA1 segments. The test antigen is from the H3 cluster IVA clade of virus, and the reference strain is from the H3 cluster IVC clade (Fig. 2D). This pairing represents a far identity and far predicted antigenic distance prediction. There were 29 amino acid differences between the reference strain and the test strain ( Table 3). The HI assay demonstrated 6.5 AU between test antigen and reference antiserum, giving an error of 0.13 AU between empirical and predicted distances ( Table 4).
Ranking of predictor features. Random forest regression, one of the regressors composing the ensemble model, ranks user-selected features by a metric of importance, calculated by the decrease in the node variance per tree and normalized across the forest for a single model run so that the sum of importance scores is equal to 1 (22) ( Table S2). The highest-ranking features were stable across runs, as they had a consistent decrease in  /A01076777/2010  T10M, E83K, V106A, S107T, V112I, T117N, N124S,  K142S, A163E, L164Q, M168V, N173K, I196V, T203I,  P273H, G275D, N276E, K278N, R299K, V304A A/swine/Iowa/A01733626/2016 A/swine/Indiana/A01202866/2011 I29L, G50R, E83K, S107T, T117N, S124N, A131D, D133G, R137N, S138T, R140K, G144V, N145S, H156K, G158N, H159Y, A163E, L164Q, T167A, N173K, E189K, S193N, V196A, I203V, R220V, R269K, S273H, N276E, R299K their average variance, although these metrics were susceptible to starting conditions (data provided at https://github.com/flu-crew/antigenic-prediction). The most important feature in predicting the antigenic distance between two strains was amino acid identity within the HA1, accounting for 31.4% of the importance score. Transitions between K and N at position 145 accounted for 8.1% of the model's importance score, and this change was ranked as the most important amino acid mutation. However, transitions between K and S and N and S at the same position 145 received a lower ranking in the model's importance score (totaling 0.2% importance cumulatively), demonstrating that the context of the positional mutation is important. Features I202V and R222W (representing bidirectional mutations) accounted for 5.4% and 5.2% of the importance score, respectively. The remainder of the features in the models accounted for less than 3% of the model on an individual basis ( Fig. 3; see Table S2 in the supplemental material), with the next 10 bidirectional mutations in order of importance being H75Q, R137Y, D101Y, E62K, I25L, P289S, D133N, E189K, K92T, and H159Y (Fig. 3). Projecting the cumulative importance of each amino acid position on an H3 crystal structure indicated that position 145, the most important position in the model, is located in the groove of the active site (Fig. 4). Other sites of higher importance in the model were more likely to be observed on the solvent-facing side of the trimer. Amino acid position 202 was an exception, as it was ranked as having high importance but was located on the inside of the trimer.
Of the 728 features included in the model, amino acid identity and the sum of the top 10 amino acid mutation features of the model accounted for 58.3% of the model's importance. The top 100 features, including percent identity within the HA1 and amino acid mutations, accounted for 83% of the calculated importance. The top 253 amino acid mutation features and percent identity accounted for 95% of the calculated importance. The model required 397 features along with percent identity to account for 99% of the calculated importance.

DISCUSSION
In this study, a model was developed to computationally estimate antigenic distances between different IAVs in swine based on amino acid sequence using nonlinear machine learning methods. The method leverages data that were generated from previously characterized IAV strains in swine to train regression models. After in silico validation, the models were used to predict the antigenic distance between paired IAV strains based on amino acid identity and mutations present between each strain. The antigenic predictions were experimentally confirmed by comparing the distances between homologous and heterologous hemagglutination inhibition (HI) titers. Predicting antigenic distances from genetic sequence data can identify strains that require further antigenic characterization, reduce the number of HI assays required to describe circulating antigenic diversity, and aid in the selection of candidate strains for vaccines when genetic diversity surveilled in the field does not have an adequate antigenic match in current vaccine formulations. This work adds to a growing body of literature that aims to quantitatively predict antigenic phenotypes of IAV from the sequence without requiring HI titers for each IAV strain (19,(23)(24)(25)(26). To the best of our knowledge, earlier approaches to calculate antigenic distances between IAV strains were trained and tested on human IAV strains, where the HA genes are characterized by phylogenetic trees with a single thick trunk with short interspersed branches with far less cocirculating genetic diversity (27)(28)(29). Compared to IAVs circulating in humans, HA gene phylogenetic trees from endemic IAVs in swine demonstrate multiple genetic clades within the same subtype that are derived from multiple human-to-swine spillover events across the last 100 years (7,30). The large genetic diversity of strains coevolving within the swine population has resulted in a similarly large breadth of antigenic diversity and evolution. Consequently, a broad range of HI assays including many genetically different IAVs are needed to assess the antigenic diversity of IAVs circulating within swine. The scale of these studies has been difficult, and there is a sparsity of antigenic characterization of IAV in swine, frequently with large gaps of time between characterizations. This has the unfortunate consequence of potentially misrepresenting the antigenic diversity of swine IAVs and can make it difficult to improve our understanding of antigenic evolution of IAV in swine (19,26,31).
We experimentally validated our model using four test antigens, with the empirical data demonstrating that predictions generally had an error of less than 1 AU. These four strains were selected to represent the full spectrum of observed diversity within the H3 cluster IV genetic clade. Our model performed very well on sequences with high sequence identity that were predicted to be antigenically similar (near identity/ near distance = 0.5 AU, with 0.35 AU error) (Tables 2 and 4). Similarly, the model performed well when making predictions on sequences that were genetically similar but predicted to be antigenically distinct (near identity/far distance = 4 AU, with 0.61 AU error) (Tables 2 and 4) and those that were very genetically different and predicted to be antigenically distinct (far identity/far distance = 6.37 AU, with 0.13 AU error) (Tables 2 and 4). On sequences that were genetically dissimilar but were predicted to be antigenically similar, the model had a nonnegligible error (1.69 AU); however, the ensemble prediction was able to discern that these two strains were more antigenically similar than would be predicted based on sequence similarity alone. The large error in this prediction, despite all features being accounted for in the model (see Table S2 in the supplemental material), suggests limitations in our approach. We parameterized the model with empirical data, and sequences that fit the "far identity/near antigenic distance" are very sparse in the training set, resulting in a higher prediction error. As new empirical data are generated, they can be used to refine and improve the model. This point is also valid for the "near identity/far antigenic distance" predictions that were parameterized by a small number of empirical observations. It should be noted that the HI assay is a discrete measure whereas the prediction is continuous, and thus an error of less than 1 AU is not biologically meaningful. Additionally, because of the discrete nature of the HI assay, a 0.5 AU error is negligible, as the true antigenic distance is somewhere between 0 and 1 AU. Consequently, our approach, which was developed using a relatively small empirical data set of IAV in swine, made predictions that are useful in biological applications.
An additional benefit of machine learning methods is that they can assign an importance score to the position and context of amino acid mutations, allowing biological interpretation. This importance score is calculated by the decrease in the node variance after fitting the random forest model. While sequence amino acid difference had the highest importance score, further assessment of the model revealed that both the position and the context of the amino acid mutation contributed to the observed antigenic phenotype. An example of this dynamic was H3 HA position 145, where a mutation between K and N bidirectionally was ranked as the most important amino acid mutation feature. Other observed mutations at position 145 between K and S and N and S were less important, matching the biological nuances that have been observed with empirical testing and other computational predictions (15,24). Earlier literature has suggested that conservation of biochemical properties of the amino acid mutation may also have some effect on the observed antigenic change (15,19). Sites other than these were identified as important in determining phenotype and were located on the solvent-exposed surface of the HA protein and in antibody epitopes (Fig. 4) (32, 33).
The positions in our model demonstrated overlap with those of a human IAV machine learning algorithm (23), the joint random forest regression (JRFR) algorithm ( positions  62, 121, 131, 133, 135, 137, 142, 144, 145, 155, 156, 158, 159, 172, 173, 189, 193, 196, and 276) (Table S2), but the relative importance of the predictor features varied between this model and ours. Specifically, position 189 was the most important site in human H3 with ferret antisera, whereas our model identified position 145 as the most important position in swine H3 with swine sera (23). These differences are likely to be reflective of host-specific interactions, and there is evidence that the source of antisera may impact HI results (34). Additionally, our importance ranking demonstrated that a relatively small number of sites had a disproportionate importance for the phenotype (Fig. 3). Consequently, these data suggest that incorporating the identity of amino acid mutation alongside sequence homology will help improve vaccine antigen selection, as this likely has a critical influence on antigen-antibody interactions.
There are other in silico approaches that link genetic sequence data to antigenic phenotype. Using 10-fold cross validation, our ensemble model had a higher RMSE (1.21 AU) than JRFR, a random forest-based model that consistently has an RMSE of ,1.0 (23). Similarly, the linear mixed-effects model employed by Harvey had very strong performance (mean absolute error, 0.75 U) (26). However, a direct comparison between these and similar methods used in human IAV with our approach is difficult because of the major differences between extensive training data sets and our own and the observed genetic diversity of swine IAVs with multiple cocirculating lineages (26). Our approach does have utility, as the robust leave-one-out cross validation demonstrated that 54% of the predictions made with the ensemble model were at or below 1 AU of error, and 86% were below 2 AU of error (a distance of ,2 AU is frequently used to indicate biological equivalence), and we were able to experimentally validate our in silico predictions with strains that represented the full spectrum of genetic diversity in H3 cluster IV swine IAVs.
Our ensemble of nonlinear regression methods was chosen due to a nonlinear relationship that is not strictly additive between amino acid changes and antigenic phenotype. The nonlinear regression techniques used are robust against collinearity, and the tree methods have the benefit of ranking the contribution of each feature to the predictive power of the model, designated through an importance score (22,35). These data can subsequently be used to inform in vitro or in vivo studies that determine molecular features associated with antibody recognition and drift (14). Several earlier methods implement linear regression, despite the relationship between amino acid mutation and antigenic phenotype being nonlinear and not strictly additive (19,25). Linear models can mitigate issues of collinearity by implementing approaches such as ridge regression in antigen bridges (24) or lasso regression used by Nextstrain (19,31), but these approaches may result in models that are more difficult to interpret biologically. Consequently, our empirically validated models, although not as computationally accurate, performed in a biologically meaningful manner and were also able to identify the top 10 features accounting for 58.3% of the antigenic phenotype (253 features were needed to account for 95% importance). These data have now generated explicit predictions on when specific mutations in the HA gene may result in antigenic drift and reduce vaccine efficacy.
Our experimental validation using test antigen and reference strains demonstrated that this approach can be used to determine antigenic differences between IAVs without requiring extensive HI testing in laboratories. It is currently impractical to antigenically characterize all strains of IAV isolated from swine, and our work shows that antigenic phenotype can be reasonably predicted from genetic sequence. The performance of our approach was sufficient even though it was parameterized with a limited empirical data set; it is feasible that prediction can be improved as more empirical data are made available. Due to multiple introductions of IAV into swine from human and avian sources, the genetic diversity of IAV in swine exceeds what is observed for human IAV strains (11,30,36). The genetic diversity of IAV in swine is also confounded by transportation patterns that move regional IAV strains with swine to new geographic locations, where additional antigenic drift and reassortment with endemic strains may occur (37,38). Consequently, this method can aid in vaccine design efforts for IAV in swine, which currently do not have an integrated and comprehensive system such as the World Health Organization's (WHO) global influenza surveillance program for IAV in humans (39). Providing accurate methods such as ours that predict antigenic distances of IAV in swine increases the ability of swine producers and veterinarians to make informed decisions regarding vaccine antigens to help maintain swine herd health.

MATERIALS AND METHODS
Swine IAV H3 antigenic reference data set. The antigenic properties of two influenza viruses can be quantitatively compared using a hemagglutination inhibition (HI) assay. The assay is based on the ability of the hemagglutinin to agglutinate red blood cells, which express sialic acid on their cell surface (40,41). The HI antibodies raised against a homologous IAV can block the agglutination of red blood cells, even at low concentrations. Genetically different viruses often need a higher concentration of HI antibodies to prevent agglutination than the homologous titer. Comparing the antigenic distances between two viruses is calculated by distance D ij ¼ log 2 H jj ð Þ2log 2 H ij ð Þ, representing a 2-fold loss in HI antibody cross-reactivity between the homologous and heterologous HI antibody titers (42) (H ij represents the titer between heterologous serum i and antigen j, and H jj represents a homologous titer). These data have traditionally been used to generate pairwise antigenic distances between IAVs in swine that are then visualized using multidimensional scaling to form an antigenic map (9,43,44).
The HI titers were collected from prior swine H3 HA virus characterization studies that used HI assays (41,45,46). The HI titers from new IAVs selected as reference strains were collected at the time of the experiment to expand the data set by the use of methods described in earlier literature, totaling 128 reference antigens tested against 47 reference antisera in various combinations from combined experiments (40). Distances between available HI titers were calculated by subtracting the log 2 of the heterologous titer from the log 2 of the homologous titer (42). Distances corresponding to the same antigenantiserum pair were calculated as the log 2 of the geometric mean by the following equation: Hjj 1 Hjj 2 Hij 1 Hij 2 2 Training and validation of machine learning regression models. Full-length HA amino acid sequences for each antigen represented in the data set were aligned using MAFFT v7.311 (47) and then trimmed to the HA1 domain (amino acids 1 to 328 using the H3 HA numbering with the signal peptide removed) for subsequent analyses. Percent amino acid difference (100% 2 amino acid identity) was calculated between each HA pair for all combinations of sequences. Specific amino acid substitutions were not weighted to minimize model assumptions, and prior research in human IAV has suggested that these approaches may add noise to analysis (23,48). All observed site-specific amino acid substitutions in the reference data were identified and treated as bidirectional.
The regression model data were constructed with the antigenic distance calculated from the HI titer as the training value, with the percent amino acid difference as a continuous predictor feature and sitespecific mutations as binary predictor features. Three different machine learning regression models were trained using scikit-learn (49): random forest, AdaBoost decision tree, and multilayer perceptron. For each regression model, hyperparameters were tuned using a random search optimization (see Table S1 in the supplemental material). A fourth regression model was created by averaging the three prior machine learning model predictors and is referred to as the ensemble model.
Data were split into 80% training and 20% testing data groups to calculate the Pearson correlation and root mean square error. Additionally, 10-fold cross validation was used to assess the root mean square error (Table 1). Given the sparsity of antigenic data available, a leave-one-out cross validation approach was employed to generate a distribution of prediction errors for each model (Fig. 1). Each antigen included in the training set (n = 128) was iteratively excluded from the training set, and distances were predicted by using each of the four regression models. The error was calculated as the absolute value of difference between the predicted distance and the empirical distance.
Mapping antigenic predictions onto phylogenetic trees. Maximum-likelihood phylogenetic trees were created to assess antigenic distance predictions of genetically similar sequences of the test antigen sequence compared to the reference sequence. Sequences were aligned using MAFFT v7.311 (47), and phylogenetic trees were inferred using FastTree v2.1.10 (50). Trees were annotated using FigTree v1.4.3 (51), with each tree rooted to a reference strain and sorted in ascending order relative to the inferred evolutionary relationship. Each tip within the tree was color coded based on the antigenic motif designated by H3 numbering of positions 145, 155, 156, 158, 159, and 189, as earlier work had identified these sites as significant for antigenic phenotype (15). Branches were annotated with the ensemble-predicted antigenic distance relative to the root. Trees were pruned to 30 leaves to facilitate viewing.
Determining the relative importance of genetic mutations. Random forest regression models provide a natural ranking system of feature importance (22,35). The importance of each predictor feature was calculated by the decrease in the node variance after fitting the random forest model. The feature importance rankings for the random forest regression model were analyzed to assess the biological importance of observed mutations in the swine H3 antigenic reference data set. The significance of each amino acid position in the HA was determined by summing the mutation-based features grouped by the position they represented. The resultant significance of each amino acid was projected onto a protein model of a human H3 HA gene from strain A/Victoria/361/2011 obtained from the Research Collaboratory for Structural Bioinformatics (4O5N) (52).
Empirical validation of machine learning regression models. The H3 HA amino acid sequences of uncharacterized IAVs in swine submitted to NCBI GenBank from the Iowa State University Veterinary Diagnostic Lab from January 2016 to August 2018 were collected and clustered by phylogenetic clade (7,11). The HA gene sequences were trimmed to the HA1 domain (positions 1 to 328 using H3 numbering with the signal peptide removed). The HA1 sequences were compared against all antigenically characterized sequences to calculate percent amino acid difference and to compare the presence or absence of site-specific amino acid mutations. Site-specific amino acid mutations absent from the training set were not considered in additional analyses. The antigenic distance from each uncharacterized HA gene to each reference antigen was predicted using the previously described four trained regression models.
A selection of four contemporary IAVs were selected as test antigens to be antigenically characterized with in vitro HI assays to validate the regression models by using their HA genes. We selected these HA genes from within the H3 cluster IVA genetic clade, since (i) this is a significant genetic clade that is frequently detected in diagnostic submissions to the Iowa State University Veterinary Diagnostic Lab (11), (ii) this genetic clade was responsible for more than 300 zoonotic infections from 2012 to present, and (iii) there was a significant amount of uncharacterized data for this clade within the last 2 years (n = 299 from 2018 to present, representing 8% of sequenced HA genes). Since the ensemble predictions demonstrated the least error in the analyses above, antigenic distances of 106 H3 cluster IVA viruses were predicted against a panel of 44 available antisera using this model. We selected four test antigen/ antiserum prediction pairs within this genetic clade based on the following criteria: near amino acid sequence identity ($98%) and near predicted ensemble antigenic distance measured in antigenic units (AU) (#2 AU); near identity and far antigenic distance ($3 AU); far identity (#95%, $90%) and near antigenic distance (#2 AU); or far identity (#95%, $90%) and far antigenic distance ($3 AU) ( Fig. 2; Table 4).
The four selected antigen/antiserum pairs were tested in parallel with antigens homologous to the antisera via HI assay. HI assays were conducted as previously described (41), with empirical distances calculated by subtracting the log 2 of the heterologous titer from the log 2 of the homologous titer. Empirical distances were compared against predicted values by subtraction.
Data availability. Data and code used in this research are available in a GitHub repository (https:// github.com/flu-crew/antigenic-prediction).

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.

ACKNOWLEDGMENTS
We gratefully acknowledge pork producers, swine veterinarians, and laboratories for participating in the USDA Influenza A Virus in Swine Surveillance System and publicly sharing sequences in NCBI GenBank.