Genomic Signatures of Influenza A Pandemic (H1N1) 2009 Virus

Amino acid signatures at host species–specific positions may affect virus transformation.

A recent outbreak of pandemic (H1N1) 2009, previously known as the swine-origin infl uenza A, has infected >296,000 persons worldwide; 3,486 deaths have been reported (1). An increased number of infected humans can potentially alter virulence in the human population. The genomic sequences of many of the new strains of pandemic (H1N1) 2009 virus have revealed important information for promoting medical diagnosis, drug-resistance monitoring, clinical and basic research, and vaccine development. Nevertheless, analyzing adaptive mutation of the new pandemic (H1N1) 2009 virus is a priority so that researchers can evaluate the likelihood that viruses from other nonhuman species will further adapt to humans. Pandemic (H1N1) 2009 virus consists of multiple reassorted virus genes from different origins. Of its 8 segmented genomic RNAs, 2 polymerase genes, PB2 and PA, were from the avian virus of North American lineage and were introduced into swine populations around 1998. The other polymerase gene, PB1, also evolved recently from a human seasonal infl uenza (H3N2) virus around the same year. This particular H3N2 PB1 gene is known to have originated from an avian virus that entered humans in 1968. However, hemagglutinin (HA), nucleoprotein (NP), and nonstructural (NS) protein genes of pandemic (H1N1) 2009 virus descended directly from the classic swine infl uenza A virus of North American lineage, which can be traced back to the 1918 virus. Originating from the Eurasian swine virus, the remaining 2 genes, neuraminidase (NA) and matrix (M), were introduced from birds around 1979 (2,3). Limited information is available as to how this unique combination of gene segments evolved from 1998 until it was identifi ed in April 2009 or on the molecular transitions or evolutionary path of this virus before it was transmitted among humans.
Our previous study developed an entropy-based computational scheme to identify host-specifi c genomic signatures of human and avian infl uenza viruses (4). This method is based on an entropy threshold computed from the amino acid composition at the well known PB2-627 position of avian infl uenza viruses (entropy value of 0.4 was based on 95 avian infl uenza genomes, as of early 2006), which contains mostly glutamic acid in the native avian hosts of the viruses. This threshold was then used to identify the 52 species-associated positions at which each of the 2 viruses settles as a distinct amino acid residue that is characteristic of the host. Although the origin of the gene segments in pandemic (H1N1) 2009 virus has been determined (2,3), the mechanism of transformation of the host-specifi c ami-no acid signatures is unclear, because the new viral genes evolved after they were introduced into the swine population some years ago.
By adopting the entropy profi ling approach, this study attempts to update infl uenza A viral signatures on the basis of all infl uenza sequences from the National Center for Biotechnology Information (NCBI). In addition to providing an updated list of human-avian signatures, this study also computes the human-swine signatures and analyzes the amino acid sequences of pandemic (H1N1) 2009 virus at the host species-specifi c positions to elucidate the adaptive mutation of infl uenza A viruses in these host species. As more new infl uenza virus isolates are collected and their sequences analyzed, the signatures at the host speciesspecifi c positions serve as predictors of adaptive mutation, subsequently providing valuable information to help in preparing for potential pandemics.

Infl uenza Virus Sequences
All infl uenza A virus protein sequences from the NCBI, as of May 28, 2009, were downloaded and analyzed. These full-length or partial sequences were grouped according to the hosts from which the viruses were isolated: humans, avian, and swine. In particular, to observe how these viruses vary in terms of residues, the newly deposited pandemic (H1N1) 2009 virus sequences were considered separately from the human isolates. For each host-specifi c group, sequences belonging to each viral protein were aligned using the program ClustalW (5). Based on the proposed signature identifi cation procedure, 2 surface proteins, HA and NA, were not analyzed because their extensive genetic diversity prevents satisfactory multiple alignment within either human or avian viruses. As an alternatively translated protein product from the PB1 gene, PB1-F2 is also not included in the analysis because it terminates prematurely at position 12. For each of the 4 groups of data, i.e., human, avian, swine, and pandemic (H1N1) 2009, eight alignments were analyzed: PB2, PB1, PA, NP, M1, M2, NS1, and NS2. The total number of sequences varied from gene to gene and from host to host, subject to their availability at the NCBI. For human-isolated viruses (excluding strains of pandemic (H1N1) 2009 virus), >3,000 sequences of the 8 proteins were analyzed. For avian-isolated, swine-isolated, and pandemic (H1N1) 2009 viruses, the numbers of sequences were ≈3,500, 350, and 70, respectively.

Recent Ancestors of Pandemic (H1N1) 2009 Viruses
Smith et al. (6) performed evolutionary analysis of the early development of the pandemic, indicating that sporadic infection of humans with triple reassortant and other subsequent reassortant swine viruses occurred before the 2009 human outbreak. To elucidate the transition of amino acid residues along this evolutionary course, we collected and analyzed the protein sequences of 18 recent ancestral swine viruses of the new H1N1 viruses (hereinafter termed "recent ancestral swine viruses") for 1999-2009 from the ancestral lineages of the new pandemic (H1N1) 2009 strains. The sampling was based on the phylogenetic trees published in a study by Smith et al. (6). Although a number of swine virus origins have been reported, resulting in various genetic lineages and subtypes, we are most interested in identifying a swine virus population from which the current pandemic (H1N1) 2009 virus might have evolved directly. Not only are those 18 strains chronologically closer (after the years 1997-1998) to the pandemic (H1N1) 2009 viruses but their PB2 and PA genes are also descendants of avian viruses, which complies with the conclusion drawn from recent publications. The online Appendix Table, available from http://www.cdc.gov/EID/content/15/12/1897-appT. htm, summarizes the strain names and accession numbers of recent ancestral swine viruses included in this study.

Entropy-based Signature Identifi cation
For each amino acid position of the aligned sequences of the same virus type, i.e., avian, human, swine, or pandemic (H1N1) 2009, an entropy value was computed by using the formula -ΣP i × ln(P i ), as described by Chen et al. (4). This formula follows the defi nition of Shannon entropy (7) that has been used to evaluate the diversity of a system. In this study, an entropy was used to measure the variability of aligned amino acid residues at a given genomic position, where i = 1 to 20 represents 20 different amino acid residues, and P i represents the probability density of the respective residue. An entropy value ranges from 0 (only 1 residue present at that position) to 2.996 (all 20 residues are equally represented). As is assumed, a position at which the entropy is less than or equal to a prespecifi ed threshold has a consensus residue for that virus type. When viruses isolated from 2 host species are compared, a species-specifi c signature position is considered to have different consensus amino acid residue from each of the 2 viruses at the same position. In this study, an entropy threshold of 0.33 was used, based on the PB2-627 position of 3,391 avian infl uenza sequences.

Results
In 2006, we reported 52 avian-human signatures based on a small set of infl uenza sequence data of 15,785 protein sequences. The selection was based on an entropy threshold value of 0.4 set at position 627 in the PB2 gene (82 Es and 13 Ks from 95 avian PB2 sequences) because that position has been considered associated with host-restriction (8)(9)(10)(11). Of the 52 positions, 45 are in the genes PB2, PB1, PA, NP, M1, M2, NS1, and NS2 examined in this work.
Today, >100,000 infl uenza protein sequences are available at NCBI, and a new entropy threshold of 0.33 was set based on the currently available avian sequences of PB2-627, which contain 3,113 Es, 228 Ks, 46 Vs, 2 As, and 2 Gs. This threshold was adopted to update the list of 47 avianhuman signatures in Table 1   reported are PB1-375 and PA-382; the latter has already been mentioned above. The other missing position in Table  1 Table 1 presents the updated avian-human signatures for infl uenza A viruses; Table 2 summarizes the swinehuman signatures. Medical literature documents that the swine virus population has distinct evolutionary lineages that originated from the classic 1918 virus referred to as classic or North American swine virus, and the others of post-1979 Eurasian swine virus and subsequent triple reassortants. Because the residue diversity at many positions markedly increased for these swine viruses because of their distinct origins, only 8 swine-human signatures met the 0.33 threshold. Unlike some positions in which humanlike signatures of pandemic (H1N1) 2009 were found (Table 1), in this study, all 8 locations of the swine-human signature of this new virus are characteristic of swine. Notably, Table 1 lists all 8 positions in Table 2, with each having the same signature as in the avian virus. Restated, avian and swine viruses contain the same amino acid residue at the 8 human-swine signature positions.
We attempted to further elucidate the transition of the amino acid residue on the pandemic (H1N1) 2009 virus that have human signatures by sampling 18 recent ancestral swine viruses (online Appendix Table). Doing so enables us to examine more closely the prevalence of amino acid residues specifi cally with pandemic (H1N1) 2009 viruses. Table 3 summarizes the amino acid statistics of these recent ancestral swine viruses together with avian, human, and pandemic (H1N1) 2009 sequences at the 8 positions containing human residues for pandemic (H1N1) 2009 virus in Table 1 Table 3. The change in the amino acid that may be associated with the transformation of pandemic (H1N1) 2009 virus is summarized in Table 4. As well as PA-356, already shown in Table 2, two additional positions, PB2-684 and PA-204, showed the same dominant amino acid residue in avian and recent ancestral swine viruses, but a different dominant residue in pandemic (H1N1) 2009 viruses and human viruses. Dominance is defi ned here as 1 residue containing the largest sequence count compared with other residues at a particular aligned position. The previously used entropy measurement in Tables 1, 2, and 3 does not apply to the positions listed in Table 4, in which we emphasize the amino acid transition of dominant residues instead of highly conserved ones subject to the prescribed entropy threshold 0.33. Other than those 3 positions, PB1-216 was found to contain a human residue G in 8 of 9 recent ancestral swine viruses that are closer to pandemic (H1N1) 2009 viruses in the phylogenetic tree published in a study by Smith et al. (6). However, for the other 7 recent ancestors that are more distant from pandemic (H1N1) 2009 viruses, PB1-216 maintains an avian-residue S in 6 of 7 viruses. Our results show that the position-specifi c transition may serve as a molecular marker for monitoring such adaptive mutations in the future.

Discussion
Although most studies confer that the death rate associated with pandemic (H1N1) 2009 infection is more moderate than that of subtype H5N1 infection, its virulence may vary with adaptive mutations in viral genes, subsequently increasing the likelihood that the new virus alters its virulence in the new host species. Many of the previously identifi ed virulence factors are apparently not involved. For instance, no E to K mutation at position 627 of PB2 is observed, which has been considered an important factor for avian virus to effi ciently replicate in mammalian systems (8)(9)(10)(11). Previous studies have indicated that PB1-F2 contributes viral pathogenesis in the mammalian system (13,14). No PB1-F2, however, is predicted in pandemic (H1N1) 2009 viruses because it terminates prematurely at position 12. Its NS1 protein is truncated at position 220 and,   therefore, lacks a PDZ ligand interacting domain. As suggested recently, the presence of this PDZ ligand domain increases the pathogenicity of avian infl uenza A viruses (15). Regardless of whether these known factors are missed, a previous study has demonstrated that the virulence of pandemic (H1N1) 2009 virus is higher than that of seasonal infl uenza A viruses (16). Although a virulence marker and a host range factor may not be necessarily linked tightly, recent investigations have also demonstrated that altering PB2-627 from E to K in the avian viruses increases its virulence in the mammalian experimental system (9)(10)(11). For example, avian infl uenza virus subtype H7N7 reportedly infects humans (17). A human isolate from a fatal case had its PB2-627 changed from avian-characteristic E to K. Correspondingly, the species-associated signatures identifi ed in this study may serve as potential molecular targets for further evaluating how they impact the virulence of pandemic (H1N1) 2009 viruses in humans. As shown in Tables 1 and 2, the number of signature positions decreases signifi cantly from 47 (human vs. avian) to 8 (human vs. swine), and the positions of the latter are a subset of those of the former. These observations may have the following implications. First, the 3 host species of interest differ, with each providing a unique environment for infection by the infl uenza virus. When the avian virus enters humans or swine, its genetic feature is shaped by a particular evolutionary path. The viruses, therefore, have different signatures. Second, some avian-like signatures are preserved in swine viruses, suggesting that both avian species and swine may provide similar conditions for harboring infl uenza A viruses. The body temperature may be a determinant. As is generally known, many avian species have a body temperature exceeding 40 o C; for most pigs it is variable but still higher than the human body temperature, which is 37 o C. Consequently, the signatures are retained when an avian virus enters the swine population, with similar signature-related viral replication mechanisms in both species. Third, the 39 signature positions shown in Table  1, but absent from Table 2, may be correlated with certain functional domains that interact with host factors unique in humans while differing signifi cantly from those of avian and swine. Finally, the number of signature positions of swine versus humans is substantially lower than those of avian versus humans, suggesting that the species barrier to humans is easier for a swine virus to cross than for an avian virus.
The entropy-based computation depends strongly on a good multiple sequence alignment. The 2 surface proteins HA and NA are excluded from this analysis because both contain sequences that diverged suffi ciently from so many subtypes of a given species. Locating conserved residues at particular positions on the basis of these alignments is extremely diffi cult. The entropy threshold is the other param-eter requiring attention to locate a signature position. In this study, the entropy determined from PB2-627 of the aligned residues of all avian viruses is used because PB2-627 is the most laboratory-proved host-restriction marker (8)(9)(10)(11). A complete new set of signatures can be reproduced rapidly by using a different entropy threshold based on other factors. The diverse genetic origins of infl uenza viruses would also have great impact on the reported signatures. The proposed entropy-based method to reach the 8 positions listed in Table 2 was based on all swine viruses of different origins, including North American-(classic 1918) origin strains, Eurasian (post-1979 avian)-origin strains, and recent triple reassortants. A comparison of, for example, all human viruses versus classic 1918-origin swine viruses before 1978 (≈75 strains, or 20% of our swine sequence population) would report 60 signature positions (data not shown). In this work, we included all swine viruses of multiple origins in producing Table 2 to consider only host-specifi c genomic signatures that have been shaped by the same swine species regardless of origin. For the same reason, we did not subdivide avian or human populations into lineages when reporting avian-human signatures in Table 1.
This study analyzed a complete collection of speciesspecifi c infl uenza A viral sequences, including the longevolving avian, recent ancestral swine and human viruses, as well as pandemic (H1N1) 2009 viruses, which is still in its infancy. The amino acid sequence transition of pandemic (H1N1) 2009 virus at the signature positions was also elucidated by applying the entropy-based signature analysis to these sequences. They were found mostly to be characteristic of avian species, as presented in Table 1. Notably, 8 of them changed from avian-like signatures to human-like signatures. Close examination of the residue transition at these 8 positions in Table 3 showed that PA-356, unlike the other 7 positions, retained an avian-like signature in the recent ancestral swine population and changed to a human-like signature only in pandemic (H1N1) 2009. This fi nding suggests that PA-356 may be related to hostrestriction factors from swine to human species. Similarly, all ribonucleoprotein positions were scanned for the same transitioning pattern as in PA-356, i.e., a retained avianlike residue in the recent ancestral swine population and a change to the human residue in pandemic (H1N1) 2009 viruses. Table 4 lists them all. Although 1 of the positions, PB1-216, was not dominated by the residue S as we would have expected, it exhibited a mixture of 2 residues involving a transition from avian to human viruses. In summary, Table 4 provides a list of candidate host-restriction factors that we believe are important to adaptive mutation of infl uenza A viruses among the 3 host species. Continuous monitoring of these signatures in nonhuman species will help in infl uenza surveillance and in evaluating the likelihood of further adaptation to humans.