Comparative evolution of influenza A virus H1 and H3 head and stalk domains across host species

ABSTRACT The influenza A virus hemagglutinin protein (HA) has been studied extensively in humans but less so in other host species. Here, we compared the rates of nucleotide substitution, protein evolution, and glycosylation in the H1 and H3 head and stalk domains among five host classes (avian, canine, equine, human, and swine). For further resolution, we separately analyzed 49 lineages differentiated by host and geography. HA evolution in non-human hosts was more dynamic than expected. For example, the classical swine H1 head domain accumulated glycosylation sites in the antigenically important head domain at a rate as high as in humans. However, whereas the stalk domain was highly conserved in humans, with a low ratio of non-synonymous to synonymous changes, this ratio was approximately 2.5-fold higher in canines. Together, these findings highlight the need for further study of HA evolution in non-human hosts. IMPORTANCE For decades, researchers have studied the rapid evolution of influenza A viruses for vaccine design and as a useful model system for the study of host/parasite evolution. By performing an exhaustive analysis of hemagglutinin protein (HA) sequences from 49 lineages independently evolving in birds, swine, canines, equines, and humans over the last century, our work uncovers surprising features of HA evolution. In particular, the canine H3 stalk, unlike human H3 and H1 stalk domains, is not evolving slowly, suggesting that evolution in the stalk domain is not universally constrained across all host species. Therefore, a broader multi-host perspective on HA evolution may be useful during the evaluation and design of stalk-targeted vaccine candidates.

Spillover of avian viruses to humans caused the 1918 H1N1 (21) and 1968 H3N2 pandemics (22).Avian viruses (H3N8) also became established in equines between the 1950s and 1960s (23).Human viruses have been transmitted dozens of times to pigs, establishing several independently evolving H1N1, H1N2, and H3N2 swine lineages on multiple continents (24).In the early 2000s, two canine lineages emerged independently in the United States and in Asia through introductions from horses (H3N8) and birds (H3N2), respectively (8,25).More recently, a H1N1 virus jumped from swine to humans and caused the 2009 H1N1 pandemic (26), with subsequent reverse zoonosis events back to pigs on multiple continents (27).
Mutations occur frequently during IAV replication by an error-prone RNA polymerase.The rate of synonymous substitution varies by host species, presumably due largely to differences in mutation rate and transmission modes (28).This is affected by host-spe cific factors, including the number of viral replications per unit time and biochemical differences affecting replication fidelity.The rate of non-synonymous evolution, relative to the mutation rate, is determined by the forces of natural selection.Non-synonymous mutations, which change the amino acid encoded, are frequently deleterious but may be adaptive.For example, they may diminish binding by existing antibodies (29) or improve binding to host cell receptors (30), particularly after a host switch.Research on influenza evolution and adaptation has focused on the HA, the membrane glycoprotein present on the surface of the virus which is responsible for receptor binding and membrane fusion and which is under constant selection due to the host immune response.
Each subunit of the HA homotrimer consists of two domains.The immunodominant head domain is the principal target of antibody-mediated immunity and evolves more rapidly than the stalk domain, which is relatively conserved within and among sub types (31)(32)(33).Mutations in the HA head domain can help the virus evade antibodies, increasing its ability to infect individuals with immunity elicited by earlier strains.Some HA mutations create or eliminate glycosylation sites (34).Although glycosylations can shield HA from antibodies, they can also interfere with receptor binding and are targets of innate immunity (35,36).Due to the different strengths of immune selection in different host species, the optimal number of glycosylations is expected to vary among them (37).
Human influenza vaccine strains are updated twice a year to reflect antigenic changes in viruses circulating globally.However, due to limitations of the global surveillance, in the influenza vaccine-manufacturing process, and in the ability to predict the dominant strains 6 months to 1 year in the future, vaccine strains are periodically mismatched to viruses in circulation, leading to reduced vaccine effectiveness (38)(39)(40).Vaccines are also available for swine (41,42), canines (43), equines (44)(45)(46), and poultry (47), although there is no systematic surveillance to monitor their effectiveness nor standard procedure for updating strains, limiting their effectiveness and utilization (48,49).Utilization of poultry vaccines is further reduced in the United States, European Union, and other exporters because international trade regulations prevent the export of products from vaccinated chickens.Studies of antigenic evolution rely on serological assays [hemagglu tination inhibition (HI) or virus neutralization assays].However, the assays are not always well standardized or reproducible across laboratories.Aside from US swine, there is no systematic antigenic evaluation of non-human influenza, which impedes comparison of rates of protein and antigenic evolution of influenza in non-human hosts.Although seasonal influenza vaccines rely primarily on antibodies generated against the HA head, novel vaccine approaches aim to provide broader, longer lasting protection (50) by targeting more conserved regions, including the HA stalk domain, which has not been studied as intensively to date.
Here, we compare the evolutionary dynamics of H1 and H3 head and stalk domains in humans, swine, birds, equines, and canines, including rates of synonymous and nonsynonymous change, the locations of changes in three-dimensional structures, and rates and patterns of gains and losses of glycosylation sites.Our results suggest that the head domain, which evolves rapidly in humans due to selection by the immune system, also does so in other mammals, though not to the same extent.Particularly, H3 evolution within the canine clades and along the host-jump branches leading to them reveals several phenomena that appear to reflect adaptation of the protein to canine hosts.

Evolutionary dynamics of influenza A/H1 and A/H3 Across Host Species
A combination of Bayesian mixed effects molecular clock modeling and stochastic substitution mapping was used to estimate the rates of non-synonymous and synony mous change in several sets of positions in all lineages described above.Consistent with previous studies (33), dN/dS is higher in the head domain than the stalk domain in all hosts (Fig. 2).In both H3 and H1, the highest dN/dS in the head domain are found in the human viruses (Fig. 2; Table S10 to S13).The next highest in H3 is found in canine viruses, which unexpectedly have the highest value in the stalk.Equine viruses also have a high value for the head compared with swine and avian viruses.The dN/dS estimates tend to be lower in swine and avian viruses of either subtype.In human viruses, dN/dS is higher in H3 than H1 in both domains.

Evolution of the H3N2 and H3N8 canine lineages
We sought to investigate the high dN/dS value associated with the stalk of canine viruses.They form two clades in the H3 phylogeny.One consists of H3N8 viruses and was introduced into canines from equines in the 2000s in the United States (canine USA H3N8 lineage).The second consists of H3N2 viruses and was introduced into canines from Eurasian birds in the 2000s in Asia (canine Asia H3N2 lineage) and subsequently disseminated to the United States in around 2015, establishing the canine USA H3N2 lineage.The existence of two independently derived canine clades allows confirmation that some evolutionary patterns are related to canine hosts, rather than being idiosyn cratic features of a sequence type.
A total of 14 amino acid changes were inferred to occur on the branches containing the host-switches to canines from avians or equines (Fig. 3).
On the host-switch branch leading to the canine USA H3N8 lineage, there are five non-synonymous changes and one synonymous change.The 5:1 non-synonymous:syn onymous (N:S) ratio on this branch is statistically distinguishable from the ratio in the ancestral equine lineage (366:537 = 0.68:1; P = 0.044, Fisher's exact test) but not from the ratio in the derived canine clade (101:111 = 0.91:1; P = 0.11).Some non-synonymous changes occurred in sites with known functional importance.One of the substitutions (W222L) is located adjacent to the sialic acid binding site.I328T is a reversal of an equine-specific change: the equine sequences have mostly I and occasionally L (6%), but 97% of other H3 viruses have T, and none has I or L. This position immediately precedes Arg329, after which proteolytic cleavage splits HA into HA1 and HA2 (the peptide bond between 329 and 330).The N483T change eliminates a glycosylation site that was present in the equine lineage (confirmed by crystal structures, Protein Data Bank (PDB) entries 4uo0 and 4unw) (Fig. 3).Independent loss of this site occurred in a close equine relative of the ancestor of the canine USA H3N8 virus within about 3 years [according to the maximum clade credibility (MCC) tree, 95% highest posterior density (HPD) branch lengths] and possibly simultaneously (Fig. S11).The parallel losses are clearly distinguishable: that on the host-shift branch results from a second-position change of codon 483 from AAT (asparagine) to ACT (threonine), while the other results from a third-position change to AAG (lysine).This conclusion comes from the maximum parsimony reconstruction but is also supported by the sequences of the earliest canine USA H3N8 isolate and the affected equine isolate, which bear ACT and AAG, respectively.
The canine Asia H3N2 host-switch branch contains 9 non-synonymous and 19 synonymous changes.The 0.47:1 N:S ratio is intermediate between that in the progenitor avian Eurasia lineage (472:2,217 = 0.21:1) and that in the derived canine H3N2 clade (147:157 = 0.94).However, it is not statistically distinguishable from either due to the low absolute counts.The canine clade and the avian progenitor are clearly distinguishable from each other (P = 2e −30 , Fisher's exact test).
No changes at the same amino acid positions occur on the two host-shift branches, but in some cases (indicated by red in Fig. 3), the amino acid resulting from a change in one is identical to that inherited at the same position by the other canine clade.In two additional cases (indicated by orange), a change results in a serine at a position where the other last common ancestor (LCA) has the chemically similar threonine or vice versa.In both cases, the observed change is due to a transition mutation, whereas a change to the amino acid present in the other LCA (threonine or serine, respectively) would require a transversion of a type that is much less frequent, as determined by analysis of fourfold-degenerate synonymous sites.
A total of 21.4% of the amino acid changes on these branches result in changes of glycosylation state.This is larger than the fraction (157/3,755 or 4.1%) for the remainder of the tree (P = 0.019, Fisher's exact test).A similar conclusion holds for the ratio of glycosylation changes to synonymous changes (P = 0.0070, Fisher's exact test).
The non-canine ancestors of both canine clades bear two glycosylation sites near the membrane-proximal portion of HA, at positions 8 and 483 (HA2 154).One of these sites is lost in the genesis of the canine H3N2 clade (8), and the other is lost in the genesis of canine USA H3N8 (483).Canine USA H3N8 inherits a glycosylation site at position 63, which is absent from H3N2, and the canine H3N2 clade gains a glycosylation at 81.These sites lie at nearly the same position along the HA axis and may cover similar regions of the protein's surface (Fig. S12).Additional evidence for their interchangeability comes from human H3N2, in which a gain of glycosylation at 63 is quickly followed by a loss of glycosylation at 81 in the "trunk" of the tree.

Changes within canine clades
Figure 4 shows the distribution of amino acid changes within the canine clades across the protein sequence.The variance of these counts is too large to be explained, except by extremely improbable occurrences, unless the rate at some of the sites exceeds the neutral expectation substantially.It would be explained if the counts for the seven sites with at least four changes reflected high rates, implying positive selection, rather than chance events.Unless rates at the remaining sites were mostly close to the neutral expectation or to zero, more sites would have to be subject to positive selection.Positive selection on a site might be obscured by purifying or weak selection in other parts of lineages, as appears to be the case for these changes (see below).
The locations of these seven positions in hemagglutinin crystal structures appear to lie disproportionately at interfaces between protein chains.This impression is confirmed by a quantitative analysis.We define interface residues as those for which any atom is within 5 angstroms of an atom of a different protein chain.This includes the interface between HA1 and HA2 chains derived from the same full-length HA.The set of positions meeting this criterion varies among HA crystal structures, presumably in part because of differences in the protein sequences, and also between monomers not related by crystallographic symmetry.We consider four structures: one from each canine clade (PDB entries 4uo4 for canine USA H3N8 and 6n4f for canine Asia/USA H3N2) and two of equine hemagglutinins (4uo0 and 4unw), which retain some ancestral residues that have changed in the canine USA H3N8 viruses.
Of the seven rapidly changing positions, six (all but 464) are interface positions in all monomers of all four structures.All seven are at interfaces in at least one monomer of these structures.The corresponding fractions for all positions in the protein are ~35.5% (interface in all monomers of all structures) and ~42.9% (interface in at least one).Thus, the probability that at least six of seven positions chosen at random lie at interfaces in all structures is 1.0%, and the probability that all seven do in at least some cases is less than 0.3%.This result is not due to more rapid evolution of interface residues in general.In fact, in non-canine lineages, interface positions evolve more slowly on average than non-interface positions by nearly a factor of 2. The C7 positions are also not particularly rapidly evolving in other lineages.Thus, the over-representation of interface residues among the rapidly evolving sites is even more striking than that suggested by the probabilities given above.
This phenomenon is not restricted to the seven most rapidly evolving sites.At the remaining interface positions, the average number of changes (0.44) is larger than that at non-interface positions (0.37) (P = 0.041, randomization test).In non-canine lineages, the reverse is true.The ratio of interface changes to non-interface changes, excluding the C7 sites, is higher in canines by a factor of 2.6 [P = 1e −10 , 95% CI = (1.9,3.5), Fisher's exact test].
The structures demonstrate that changes at position 135 of HA2 (464 of HA), the only C7 position that is non-interface in some of the structures, can have large effects on inter-subunit interfaces.Changes at this position in canines are restricted to the equine-derived H3N8 clade.Both equine structures have the ancestral glycine at this position.It lies within 5 angstroms of the associated HA1 and within 5.5 angstroms of the Arg at 124 (another C7 position) of another HA2 chain.The canine USA H3N8 structure, in contrast, has an aspartate at position 135, and it is further than 5 angstroms from any other chain.The backbone configuration surrounding this position is altered, and the positioning of several nearby residues is affected.Likely as a consequence, an inter-chain bidentate salt bridge between Glu131 and Arg163, found in both equine structures, does not form in the canine structure.Changes of Arg124 will also eliminate this salt bridge.The ancestral glycine at position 218 has backbone torsion angles, corresponding to right-handed beta-strand, that are "forbidden" for non-glycine residues, which suggests that changes there will disrupt the structure.At five of the seven positions, the ancestral residue is inferred to have changed to several different residue types, which is more consistent with disruption of favorable interactions than creation of new ones.

Synonymous rates of H1 and H3 across host species
Given the unique mechanisms that led to the emergence of the canine lineages and the high dN/dS estimate associated with their stalk domain, we sought to investigate how canine viruses evolved in the context of other human and animal lineages.Absolute rates of sequence change can be assessed because BEAST estimates branch lengths in units of time, which is possible because of the different isolation dates of the viruses.The absolute rates of synonymous change (substitutions/site/year) observed in H3 viruses in swine, human, and avian hosts were higher than those for the equine and canine lineages (Fig. 5).As expected, they were similar across codon subsets (Fig. S2).They were generally slightly lower in H1 than in H3, but patterns across host species were similar, with higher rates for swine and avian viruses (Fig. S3).

Protein evolution of H3 across host species
The estimates of the ratio of non-synonymous to synonymous substitutions [dN/dS (dN/ dS)] were higher in mammalian viruses than avian viruses across all subsets of the H3 sequence (Fig. 6).As expected, dN/dS was approximately 1-4 times higher in the H3 head compared with the stalk domain (Fig. S4; Table S3 to S6).Estimates for dN/dS in antigenically relevant sites in the H3 were 2.2-13.5 times higher in humans compared with canines, swine, equines, and avian species (Table S4).Second to human viruses, canine Asia H3N2 and canine USA H3N8 lineages had the highest dN/dS estimates across all sequence subsets.However, we observed a strikingly different pattern in the stalk domain.Estimates for dN/dS in the stalk domain are 1.8-7.4times higher in canines (Table S6) compared with human, equine, swine, and avian lineages.

Protein evolution of H1 across host species
Patterns of dN/dS in H1 viruses were generally similar to those observed for H3.Again, the estimates of dN/dS were higher in H1 mammalian viruses than in H1 avian viruses in all subsets of positions (Fig. 7).As expected, dN/dS was approximately 1.5-5 times higher in the H1 head compared with the stalk domain (Fig. S5; Table S7 to S9).We estimated the highest overall dN/dS for the H1 human lineage that circulated from 1918 to 1957 for all sites and head domain (Fig. 7; Table S7 and S8).
Estimates for dN/dS for the H1 head domain were 1.5−5.8times higher in humans than in swine and avian species (Table S8).The dN/dS of the head domain of human-like "delta" swine lineages circulating in the USA in 2005-2019 was between 1.4 and 1.9 times higher than several other swine H1 lineages, including delta, classical, and Eurasian swine viruses (Fig. 7; Table S8).The US delta swine viruses also had an elevated dN/dS in the stalk domain compared with other lineages (Fig. 7; Table S9).

Comparison of non-synonymous and antigenic rates of evolution
Rates of antigenic evolution (antigenic changes per unit time) of the corresponding human and swine HA lineages, derived from HI assays (51,52), are compared with nonsynonymous rates per unit time in Fig. S6.In all cases where confidence intervals are not too wide to permit comparison, human H3 and H1 have higher antigenic rates than their swine counterparts.Human H3 has a higher non-synonymous rate per unit time than swine H3, but human H1 has a lower rate than many types of swine H1.The large uncertainties of antigenic rate estimates for some types of swine HA limit the value of comparisons.

Glycosylation patterns across H3 lineages
We investigated trends in glycosylation for each lineage using the motif Asn-X-Ser/Thr-X, where X is any amino acid other than proline.The average yearly number of glycosyla tions for H3 (Fig. 8) varies markedly across host species with an average of 6.2 glycosyla tions in avian lineages, 7.0 in canine, 7.8 in equine, 9.2 in swine, and 11.5 in humans.While the number of glycosylations in avian, canine, and equine hosts generally remained constant over time, the number of glycosylations in human and swine viruses increased over time.The human H3N2 lineage inherited 7 glycosylations from its avian ancestor.It subsequently gained 6 glycosylations (Table 1), but the avian lineage did not gain or lose glycosylations in the decades that followed.During 1968-2019, the human H3 lineage gained an average of 1.1 glycosylation per decade, peaking at 13 glycosylations in 2017.Subsequent host-switches of H3N2 viruses from humans to swine seeded additional human-like H3 lineages in pigs.After establishment in swine, the number of glycosylations typically decreased over time to between 8 and 9 glycosyla tions (Fig. 8; Table 1).Two swine lineages in China and Vietnam gained glycosylations over time, but there are only a few years of data available for these lineages and longterm trends are unclear.
The avian H3 lineages have stably maintained the lowest number of glycosylations (6-7) in both the Americas and Eurasia.The equine lineage maintained a low glycosyla tion average of 7 for approximately 30 years after emergence, but in the late 1980s, it gained a single glycosylation that has been maintained as of 2019 (Fig. 8).One site was lost in the genesis of canine H3N8, which mostly maintained the resulting glycosylation state.The avian-origin canine H3N2 clade has also maintained 7 glycosylations, like the Eurasian avian lineage from which it derived (Fig. 8), though one of the glycosylations is at a different position.In H3, glycosylations are accumulated in the HA structural domains at different rates across lineages (Figs.S7 and S8; Table S14 and S15).Glycosylations were mainly gained and lost in the HA stalk domain in Avian viruses circulating in the Americas (Fig. S8), with no net change in the number (Table S15).There were gains of 1-2 glycosylations in the stalk domain of swine viruses circulating in China and Vietnam.The number of glycosyla tions in the HA head domain of human viruses increased by 3 during the period between late-1990s and early-2000s and was overall stable during approximately 15 years, followed by the loss of at least 1 glycosylation.The stalk domain of the human lineage kept a stable number of 5 glycosylations from its emergence to 2010 but since then has gained 2 glycosylations.

Glycosylation patterns across H1 lineages
The average number of glycosylations (Fig. 9; Table S16; File S4) varies across H1 lineages in a manner similar to H3.The avian lineages have maintained a stable, lower number of 7 glycosylations over time (Fig. 9).Like human H3, the human 1918-1957 H1 lineage, Lineage estimates with wider violin plots (lower probability of observations taking a given value) are presented in Fig. S6.
which is thought to have originated in birds, initially had 7 glycosylations.By 1957, the lineage had 11 glycosylations, a net gain of 4 (Fig. 9; Table S16).As expected, the human 1977-2009 H1 lineage, which began with a virus genetically similar to H1N1 from the 1950s, inherited 11 glycosylations.This number dropped to 10 by the time the virus went extinct in 2009 during the swine-origin H1N1 pandemic.The currently circulating swine-origin H1N1 pandemic (PDM) lineage in humans inherited a relatively low number of glycosylations from swine (n = 8), but this number increased to 9 within a decade (Fig. 9; Table S16).Surprisingly, the initial classical swine virus circulating in North America experienced a dramatic gain in glycosylations since 1970.The virus emerged in US swine during the 1918 pandemic with 7 glycosylations, like early human H1N1, but this number increased to 11 by 2019, reaching the same maximum as humans.In contrast, the Eurasian avian-like H1 swine lineage only gained 1 glycosylation in the decades since the virus switched hosts from birds to pigs in the late 1970s.The swine delta lineages (Fig. 9) resulted from host jumps from Human1977-2009 viruses.The average number of glycosylations in swine delta lineages and PDM lineages has remained relatively stable over time (Table S16).There have been multiple spillovers of human pandemic H1N1 into swine since 2009.Between 1992 and 2005, six of the swine lineages (DTswEUR1994, DTswVNM, DTswUSA2005OK, DTswCHLRancagua, DTswCHLMelipilla, and EAswEUR) have mostly lost glycosylation at position 94 (File S4).
As in H3, the accumulation of glycosylations in H1 occurred more frequently in the head domain in the human lineages (1.5-5.8 times more for H1 human lineages than in other swine and avian lineages) (Fig. S9 and S10; Table S17 and S18), whereas swine tended to gain and lose glycosylation at similar proportions in either HA domain, irrespective of subtype.
No gains or losses in glycosylation were observed in the stalk domain of human H1 lineages, avian, and pandemic swine viruses (Fig. S11).

DISCUSSION
Studying the evolution of influenza sequences in non-human hosts is important for several reasons: the burden of non-human infections, the ever-present possibility of a jump to human hosts, the knowledge gained about viral evolution in general, and the light it sheds on evolution of human influenza.A molecular evolutionary approach may be particularly informative for non-human influenza for which we are lacking antigenic data.
To study the evolution of H1 and H3 in all major hosts, we first built time-scaled phylogenetic trees for representative sets of coding sequences.Previous studies have paved the way for this type of analysis and have shown that independent rates of molecular evolution are necessary for reliable phylodynamic inference of influenza A viruses across hosts (53).However, a host-specific molecular model does not account for other sources of evolutionary rate variation.In this work, we apply an extension of this model by combining lineage-specific rates of evolution with random effects that accommodate within-lineage variation of the evolutionary rate.We determined rates of non-synonymous and synonymous change in the sequences and in portions correspond  ing to the two domains of HA.We also reconstructed the specific changes that occurred on all branches of the tree.
The ratio of non-synonymous to synonymous changes, relative to the ratio expected in the absence of selection, reflects the selective forces acting on the protein sequence.Figure 2 shows this quantity for all hosts for the head and stalk of H1 and H3, and Tables S10 to S13 compare them among hosts.In all cases, the avian dN/dS is lower than all of the mammalian values.Where the mammalian values are statistically distinguishable from each other, the human value is highest, except for the special case of canine stalk.Where differences exist, they might reflect differences in the strength of purifying selection, differences in positive selection, or some combination of the two.
High dN/dS on the head of human HA has long been attributed to selection for antigenic change, which allows escape from antibodies elicited by previous infections.Values for the head in other mammalian hosts are indistinguishable from human dN/dS or intermediate between human and avian values.The simplest explanation of this pattern is that there is selection for antigenic change in these other mammals as well, albeit weaker, but little or none in avians.Influenza infection in birds differs from that in mammals, mainly affecting the gastrointestinal tract, and birds may be less likely to be reinfected with the same subtype in their lifetimes.These differences, and other possible explanations of weak antigenic selection, may be due to the long association of these subtypes with avian hosts (5).
The N:S ratio on the H1 and H3 stalks is also higher in mammalian than avian viruses.The very high canine value can be explained by adaptations to canines, discussed further below.The higher value in other mammals might reflect weaker purifying selection or the operation of positive selection.There is no obvious reason for purifying selection on the stalk to be generally weaker in mammals, but the possibility cannot be excluded.Adaptive evolution of the stalk might involve escape from rare anti-stalk antibodies.The same force responsible for higher mammalian rates in the head would then also explain higher rates in the stalk.This possibility would be a concern for universal influenza vaccines designed to elicit antibodies that target the HA stalk (54).Some significant differences exist in N:S ratios among lineages of the same subtype that infect the same host (Table S5, S6, S8, and S9).The dN/dS of the head is lower in the 2009 human H1N1 pandemic lineage than in the other human H1N1 lineages.This likely reflects the small amount of time since the introduction of the 2009 lineage into humans and the consequent weakness of selection for antigenic change over much of its history (55).The vaccine strain for 2009pdm was not updated until 2016 because of a paucity of antigenic change (56).There is significant variation among both H3 and H1 swine lineages on both the head and the stalk.Differences in agricultural practices and swine vaccination rates over time and among countries may explain much of this variation.
We analyzed changes at 131 H3 positions commonly described as "antigenically relevant" to allow comparison to other results (57)(58)(59)(60).These positions are based on dubious premises: that all changes in the membrane-distal portion of the HA head that became widespread among human H3N2 were caused by selection for antigenic change and that regions of the protein not observed to have changed are not bound by antibodies.Analysis based on these sites can lead to circular reasoning, since positions that have changed in one part of the tree are more likely to change in other parts, even if the changes are not due to antigenic selection.This potential for this effect is highest in the human lineage, since these sites were selected based on human virus data.
Estimates of the rate of antigenic change, based on laboratory measurements, are available for human and swine H1 and H3 (51,52).Large uncertainties for several of the swine rates limit the conclusions that can be drawn from comparisons ( Fig. S7).In all cases where a conclusion is possible, the rate of antigenic change in human viruses is larger than that in swine.This is consistent with the generally higher non-syn onymous:synonymous ratio on the head of human viruses and its interpretation as an indication of stronger selection for antigenic change.Perhaps more appropriate is consideration of the rate of non-synonymous change per unit time.By this measure, human H3 again has a higher rate than swine H3, but human H1 is slower than most types of swine H1.The implication is that there is more antigenic change, on average, for a non-synonymous change in humans than swine H1.This is expected if there is stronger selection for antigenic change.Even in the absence of such selection, HA sequences, like those of most proteins, are expected to change over time.This will result in some antigenic change but less than that due to amino acid changes selected for the antigenic difference that they cause.
The number and locations of glycosylation sites vary across the HA trees, especially on the head domain.Glycosylation of the head can shield the protein from antibodies (61)(62)(63).It also comes with various costs (35,63,64).The optimal glycosylation state reflects a trade-off between the costs and the benefits, both of which may vary considerably among hosts and due to other factors.
As previously described (34,63,65), there has been net acquisition of glycosylations on the head since the introductions of H1 and H3 to humans in 1918 and 1968.This phenomenon is attributed to the selection for the shielding effects of glycosylation.The 2009 pandemic H1N1 may be in the early stages of a similar trajectory, having gained a head glycosylation after a few years.In contrast, avian H1 and H3 have generally maintained just one glycosylation on the head over long periods of time.
Equine and canine H3 typically have two or three head glycosylations.Swine H3 lineages converge to around four, despite being derived from human progenitors with a variable number.Loss of glycosylation is common in the swine lineages derived from highly glycosylated human H3 and on the host-switch branches leading to them.Presumably, this reflects weaker antigenic pressure in swine compared with humans, such that the costs of glycosylation outweigh its benefits, leading to selection for loss.
The pattern in swine H1 is similar in that the average number of head glycosylations is intermediate between the low number in avians and the higher numbers attained over time in humans.There is, however, considerable variation among swine lineages, mostly ranging from a single head glycosylation (like avian viruses) to three.The head has no glycosylations for a few years in the avian-derived EAswEUR.As discussed above, the N:S ratio also varies significantly among swine lineages.
The overall pattern of head glycosylation is similar to that of the non-synonymous to synonymous rate on the head:highest in human viruses, lowest in avian, and intermedi ate in non-human mammals.Differences in antigenic pressure among hosts would explain both patterns.
Considering only the total number of glycosylations hides the complexity of their evolution.This complexity is exemplified by position 94 of H1.Loss of this glycosylation site occurs in the ancestors of most isolates of six swine lineages.The site is present throughout most of the tree, including the avian lineages, so it would seem to have a cost that is specific to swine.Nevertheless, it is regained six times in EAswEUR, suggest ing that it has a net advantage in swine under some circumstances.It is never regained, however, in EAswCHN, which descends from EAswEUR.The explanation may be that glycosylation at position 94 provides protection against antibodies in those swine that have received certain vaccines (49,66).Vaccination of swine is not common in China (67).
The sequence data sets compiled in this study were subjected to substantial downsampling to mitigate the computational burden of Bayesian phylogenetic inference.The sequences used were chosen so as to represent the diversity through space, time, and host as well as possible.Some of the branches connecting viruses with different hosts were long, perhaps reflecting a lack of sampling of the earliest viruses infecting the new host.This made the characterization of early adaptation events difficult, particularly for swine and canine lineages.
This work sheds light on the evolution of influenza viruses circulating in a variety of non-human hosts, which has not been studied as well as human influenza evolution.Higher dN/dS on the HA head in all mammalian hosts compared with the avian host and the retention or accumulation of large numbers of head glycosylations in some swine lineages (as occurs in human HA) suggest that mammals in general impose greater selection for antibody escape.Mammals also had higher dN/dS ratios in the stalk domain compared with birds.The reasons for this require further investigation, but it may reflect antigenic evolution in the stalk in mammals.In addition, evolutionary analysis of the sequences identified host adaptation in canines involving a particular phenotype.These results, based entirely on sequence data, provide information about influenza evolution in non-human hosts and place human influenza evolution in a larger context.

Data set assembly
We downloaded all available hemagglutinin sequences for influenza A H3 and H1 from GenBank and downloaded specific geographical isolates, not available in GenBank, from GISAID on 8 December 2018.The sequences were aligned using the MAFFT (68) multiple sequence alignment tool.The aligned sequences were manually edited and cleaned in AliView version 1.26 software (69).The resulting data set was trimmed at the 5′ and 3′ ends to include solely the coding sequence.This data set was subjected to multiple iterations of phylogeny reconstruction using FastTree version 1.0 (70) with a general time-reversible (GTR) model and exclusion of outlier sequences whose genetic divergence and sampling date were incongruent using TempEst (71).
We defined lineages from the same host and geographical region and only consid ered those circulating for at least 3 years in the region (Table S1).Each lineage includes only viruses that infect a single host species or closely related host species.Each inferred introduction into a new species creates a new lineage.In some cases, what could be a lineage is subdivided on the basis of geography.Besides the H1 pandemic lineage circulating in humans since 2009, we defined two human H1 lineages circulating during the periods of 1918-1957 and 1977-2009.The H1 lineage reintroduced in 1977 derives directly from the viruses circulating in 1957, so studying the two lineages together would disrupt the temporal signal and consequently the molecular clock model, due to the lack of accumulation of mutations during the 20-year period where H1 did not circulate (72)(73)(74)(75).For the non-avian lineages, we excluded sequences with mixed subtypes, as marked by "Hx" or "mixed." For some analysis, we combine lineages for the same host.
The number of sequences available makes a full Bayesian inference using all of them impractical due to the computations required.For this reason, we analyzed a representa tive subset, which excluded those that cover less than 75% of the coding sequence with unambiguous bases and retained only one of identical sequences collected in the same country on the same day or in the same month or year in the case of incomplete date information.
Next, we applied a subsampling strategy that selected a specific number of sequen ces per country per time, in order to obtain similarly sized lineages that represented the overall circulating diversity.When sequences had the same country and collection date, we chose, in a decreasing order of priority, those that were longer, those that had at least the month of collection rather than just the year, those with a smaller number of internal gaps with lengths not divisible by 3 (codons), and those with full dates of collection (https://github.com/jlcherry/SAMPI).We combined all subsampled lineages for each subtype and performed a final iteration of phylogeny reconstruction and exclusion of outlier sequences as described above.The final H3 data set consisted of 873 sequences from viruses isolated between 1963 and 2019 from the following hosts: avian (n = 178), human (n = 101), swine (n = 370), equine (n = 87), and canine (n = 137).For H1, there were 964 sequences, with collection dates between 1918 and 2009, from avian (n = 136), human (n = 186), and swine (n = 642) hosts.The H3 sequences were divided into 16 lineages and the H1 sequences into 33 (Fig. S1; Table S1; Files S1 and S2).

Phylogenetic analysis
Phylogenetic relationships were inferred separately for the H3 and H1 data sets, with a Bayesian phylogenetic approach using Markov chain Monte Carlo (MCMC) available via the BEAST v1.10.5 package (76) and the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD, USA (http:// biowulf.nih.gov).
We divided the sequences into lineages based on host, geography, and a preliminary phylogeny inferred using FastTree with a GTR substitution model (77), retaining clusters with a minimum of 10 sequences and a minimum of three different collection years .Some lineages are paraphyletic with respect to other lineages.We partitioned the coding sequences into first, second, and third codon positions and applied a separate Hasegawa-Kishino-Yano 85 [HKY85 (78)] substitution model with gamma-distributed rate variation among sites to all partitions (79).To infer time-scaled trees, we apply a mixed-effects molecular clock model that combines a host-specific local clock (53) with random effects that allow for branch-specific rate variation similar to an uncorrelated relaxed clock model.We specified a Bayesian Skygrid coalescent tree prior (80).For sequences with an incomplete date of viral collection available, the lack of tip date precision was accommodated by sampling uniformly across a 1-year window from 1 January to 31 December.If the month of collection was available, the lack of tip date precision was accommodated by sampling uniformly across a 1-month window from day 1 to day 28, 30, or 31.
We applied an approach that relies on counting methods, set in a Bayesian frame work, to estimate lineage-specific synonymous (synonymous substitutions/all sites/year) and non-synonymous (non-synonymous substitutions/all sites/year) substitution rates as well as their ratio relative to the neutral expectation (dN/dS) in large data sets, while integrating over the posterior distribution of phylogenies and ancestral reconstructions to quantify the uncertainty on the lineage-specific estimates (81).
The ratios that we refer to as such are not, strictly speaking, dN/dS, and the sequence changes are not properly referred to as substitutions, because many of the changes have not in any sense gone to fixation and never will.We nevertheless continue a looser use of these terms that has become common in similar contexts.Terminology aside, these changes may have characteristics that differ from those that reflect the full effects of selection because they have come to predominate in their lineages.
The MCMC chain was run separately at least 20 times for each of the data sets and for at least 50 million iterations with variables recorded every 50,000 iterations, using the BEAGLE (82) library to improve computational performance (Files S1 and S2).All parameters reached convergence (effective sample sizes greater than 200), as assessed visually using Tracer v.1.7.1, with statistical uncertainty reflected in values of the 95% HPD.At least 10% of the chain was removed from the beginning as burn-in.MCC trees were summarized using TreeAnnotator v1.10.5 and visualized in FigTree v1.4.4.
We summarized the synonymous and non-synonymous rates and dN/dS estimates for all protein sites (AS), head and stalk domains (83), and a set of 131 positions claimed to constitute the antigenically relevant sites for H3 (ARS, analyzed only for H3) (84,85).All sites investigated are presented in Table S2.We evaluated the statistical significance of differences in dN/dS by applying Fisher's exact test to the counts of synonymous and non-synonymous changes in each pair of lineages inferred from a most parsimonious reconstruction on the MCC tree.We then applied the Benjamini-Hochberg Procedure to find the set of pairwise comparisons giving a false discovery rate of 5%.

Ancestral sequence reconstruction
Ancestral state reconstruction of nucleotides was performed on the MCC tree using maximum parsimony, implemented as a Python program (File S3 at https://github.com/jlcherry/HA_Evolution).In cases where the reconstruction was ambiguous, it was disambiguated arbitrarily for simplicity.Changes to the protein sequence were inferred from translations of the reconstructed nucleotide states.Raw ratios of non-synonymous to synonymous changes (N:S) were calculated from the reconstructions.

Glycosylation patterns
We scanned all sequences (ancestral and contemporaneous) for each data set to identify potential N-linked glycosylated sites, based on the motif Asn-X-Ser/Thr-X (N-X-S/T-X), where X is any amino acid other than proline (86) using a program written in Python.We then plotted the number of glycosylated sites over time for each lineage and estimated their rate of accumulation or loss.

FIG 4
FIG 4 Amino acid changes across the H3 protein of canine isolates.Total number of amino acid changes across the H3 protein of canine isolates (top).Changes in canine H3N2 (red) and canine USA H3N8 (blue) are stacked.The seven positions with at least four changes are labeled.Six of these are interface positions in all monomers of all four structures considered (PDB entries 4uo4, 6n4f, 4uo0, and 4unw).The seventh (464) is an interface position in all monomers of the two equine structures.Interface positions can be found in File S5.The seven positions evolving most rapidly in canine H3 shown on the HA structure (bottom).Three views together show all seven positions.The six protein chains (three each of HA1 and HA2) are given different light colors, and the seven positions are shown in darker versions of the same color.In panel A, three 216 and 218 residues appear (one of each on each monomer), but only one of each is labeled.In panel C, one of the HA monomers (chain C) is made invisible to reveal the buried position 107.Positions 216 and 218 in chain A are not colored in this view because they are at the interface with the invisible chain C. HA2 positions are given in parentheses.The HA structure is that of A/Equine/Richmond/07, PDB entry 4UO0.

FIG 6
FIG 6 dN/dS for all H3 sites, antigenically relevant sites, head and stalk domains across lineages.Viral lineages are colored by host species: green (human), blue (swine), red (avian), purple (equine), and orange (canine).Lineage estimates with wider violin plots (lower probability of observations taking a given value) are presented in Fig. S5.

FIG 7
FIG7 dN/dS for all H1 sites, head and stalk domains across lineages.Viral lineages are colored by host species: green (human), blue (swine), and red (avian).

FIG 8
FIG8 Average number of H3 glycosylations over time.For each lineage, the average number of glycosylation sites is plotted for each calendar year for which numbers are available.The average is taken over all isolates and internal nodes (reconstructed ancestral sequences) dated to that year.

FIG 9
FIG9 Average number of H1 glycosylations over time.For each lineage, the average number of glycosylation sites is plotted for each calendar year for which numbers are available.The average is taken over all isolates and internal nodes (reconstructed ancestral sequences) dated to that year.

TABLE 1
Average number and rate of accumulation of glycosylation in H3 lineages a Only 3-4 years of data is available for these lineages.