Genomic Signatures of Human versus Avian Influenza A Viruses

Fifty-two species-associated amino acid residues were found between human and avian influenza viruses.


P andemic influenza A virus infections have occurred 3
times during the past century; the 1957 (H2N2) and 1968 (H3N2) pandemic strains emerged from a reassortment of human and avian viruses (1). Recently, all 8 genome segments from the 1918 (H1N1) influenza A virus were completely sequenced. The results indicate that the 1918 pandemic virus may not have emerged by a reassortment of avian and human virus as did the 2 other pandemic strains. Although the 1918 H1N1 is not considered an avian virus, it is the most avianlike of all mammalian influenza viruses (2,3). The recent circulation of highly pathogenic avian H5N1 viruses in Asia from 2003 to 2006 has caused >90 human deaths and has raised concern about a new pandemic (4). Therefore, we need to understand what genetic variations could render avian influenza virus capable of becoming a pandemic strain. Genomewide comparison of human versus avian influenza A viruses would show the evolutionary similarities and differences between them and thus provide information for studying the mechanism of influenza viral infection and replication in different host species.
Although many research efforts have focused on the molecular evolution of specific genes of influenza viruses, comprehensive comparisons among the nucleotide sequences of all 8 genomic segments and among the 11 encoded protein sequences have not been extensively reported. In this study, we used several computational approaches for finding specific genetic signatures characteristic of human and avian influenza A viral genomes. We subsequently validated the robustness of those signatures with human and avian protein sequences downloaded from Influenza Virus Resources at the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm. nih. gov/genomes/FLU/FLU.html). automated DNA sequencer. Sequence editing and processing were performed with Lasergene, version 3.18 (DNAS-TAR, Madison, WI, USA). Multiple sequence alignment was performed with ClustalW version 1.83 (ftp://ftp. ebi.ac.uk/pub/software/unix/clustalw). Global sequence comparison that yielded pairwise sequence identities used in histogram analysis was done with the program Needle in the EMBOSS package (5). Amino acid sequences were translated from coding sequences and aligned by BioEdit (6). An entropy value was defined at an aligned amino acid position according to the formula ΣP i *log(P i ), in which i is the observed probability for each of the 20 amino acids (aa) (7). A graphic tool was developed in Java for displaying the entropy plot used in this work. All amino acid numberings are based on influenza virus A/Puerto Rico/8/1934 (PR8).

Sequences Used in Study
To show the host-associated amino acid signatures, we retrieved full genome sequences (as of August 22, 2005) from the genome browser at Influenza Sequence Database (ISD) (8). To differentiate between avian and human influenza viruses, we excluded human-isolated avian influenza viruses from the human dataset and examined those sequences separately. Altogether, we had 95 avian and 306 human influenza viral genomes, henceforth termed "primary dataset." All 11 viral proteins encoded by the 8 genomic RNA segments were compared: PB2, PB1, PB1-F2, PA, HA, NP, NA, M1, M2, NS1, and NS2.
Avian influenza viruses from human influenza patients were separately retrieved from NCBI as well as from ISD. Altogether, we had 417 protein sequences from 60 avian influenza strains, in which 21 strains contain sequences (full or nearly full length) from all 8 genomic RNA segments.
For validating the signatures obtained from analyzing the primary dataset, we further retrieved 15,785 human or avian influenza A viral protein sequences from NCBI's Influenza Virus Resources. Details for the sequences used can be found in online Appendix, Supporting Materials and Methods (available from http://www.cdc.gov/ncidod/ EID/vol12no09/06-0276.htm#app), as well as in online Appendix Table l (http://www.cdc.gov/ncidod/EID/vol12 no09/06-0276_appT1.htm) and online Appendix Table 2 (http://www.cdc.gov/ncidod/EID/vol12no09/06-0276_ appT2.htm). Eleven Taiwanese genomes produced in this work have been deposited in GenBank with accession numbers DQ415283 through DQ415370.

Differing Amino Acid Residues
Using previously described methods (7), we separately calculated an entropy value for every aligned amino acid position for 95 avian influenza viruses and 306 human influenza viruses. Those amino acid residues with an entropy value between 0 and -0.4 for both the human and avian strains were identified as most highly conserved. We chose this entropy threshold on the basis of the entropy value -0.379, calculated at position 627 of PB2 for the 95 avian viruses. This widely reported, species-associated residue is highly conserved; it has E (Glu) in 83 and K (Lys) in 12 avian isolates and Lys in all 306 human isolates. We then selected those conserved positions with distinct amino acid residues between human and avian influenza viruses as potential host-associated signatures. An entropy plot for identifying such signature residues for avian versus human influenza virus NP segments is shown in Figure  In addition to the previously mentioned 3 positions with distinct amino acid residues between avian and human strains, we found 225 additional positions with nearly distinct amino acid residues, with their computed entropy values less negative than -0.4 in both the 306 human and 95 avian strains that we analyzed. To assess the robustness of those 228 residues used in differentiating human from avian influenza viruses, we further examined 15,785 influenza A protein sequences from NCBI. After validation, 52 positions still showed an entropy value less negative than -0.4 and conserved to distinct amino acid residues between human and avian viruses (Table 1). From this entropy analysis, we identified an additional 51 aa positions that may be as important as the well-known position 627 of PB2. We designated these 52 positions as "species-associated" signatures. Among 11 ORFs, NP contains the highest number of such signatures (15 positions),  Table 1. The complete results of genome scanning and validation can be found in online Appendix Table 3 (http://www.cdc.gov/ncidod/EID/vol12no09/06-0276_appT3.htm) and online Appendix Table 4 (http:// www.cdc.gov/ncidod/EID/vol12no09/06-0276_ appT4.htm).

Amino Acid Signatures in Human Viruses
We examined how the amino acid sequences varied at those proposed signature positions for avian influenza viruses isolated from humans. At 9 of these 52 positions, residue changes were characteristic of human rather than avian viruses (Table 2). For example, 34 sequences (27 H5N1, 3 H9N2, and 4 H7N7) were available for inspection at position 199 of PB2 (data not shown). Aside from 10 sequences with gaps (sequences did not cover this position), 19 of the remaining 24 still have Ala, which is typical for avian viruses. Five of them (all H5N1), on the other hand, have this residue changed to Ser, which is mostly seen in human viruses. At the well-known position 627 of PB2, 5 sequences had gaps, 22 (32).
To understand how mutations had accumulated within a specific virus, we summarized the amino acid changes for 21 of these avian viruses that contained full or nearly fulllength sequences for each segment (Table 3). We found that 19 of 21 strains contained >1 species-associated amino acid change, and 7 of them contained >2 substitutions; A/Netherlands/219/2003(H7N7) had the highest count for mutation accumulation (3 positions). Among these 52 species-associated signatures, the mutation combinations at positions PB2 199 and PA 409 were most commonly seen in H5N1 human isolates from Hong Kong in 1997.

RNA Segment 5
Our observation that NP contained the highest number (15 of 52) for species-associated amino acids suggested that NP might serve as a molecular target for differentiation between human and avian influenza A viruses. To indicate such host specificity, or the "genetic boundary" between these 2 viruses at the nucleotide level, we performed a pairwise sequence comparison for all 11 ORFs on our 401-genome primary dataset and produced histograms on their computed pairwise identities. In online Appendix Figure 2 (http://www.cdc.gov/ncidod/EID/ vol12no09/06-0276_appG2.htm), pairs with 2 sequences of the same host species (human to human, or avian to avian; termed homopairs) and pairs for sequences that cross host species (human to avian, or avian to human; termed heteropairs) are shown. HA and NA genes exhibited considerable sequence differences between strains, with identities as low as 47%. Also noted was a wide spectrum of percent identities (e.g., 55%-95% in the horizontal axis) containing few sequence pairs for these 2 genes. For both of these proteins, some strains from the same species can have identities as low as 50%. However, the ORF of another surface protein, M2 ion channel protein, is relatively conserved (>74% identity for viruses across species). The histograms for the polymerase genes (PB2, PB1, and PA), NP, and M1, on the other hand, are much less varied (mostly <20% variation). In particular, the NP gene was found to exhibit a fairly clear boundary between homopairs and heteropairs, at ≈86%.

Discussion
The glutamic acid residue at PB2 627, which is commonly seen in avian viruses, restricts viral growth in humans and monkeys, but a change to lysine restores virus replication in mammalian cells (33). In this study we computed for every amino acid position (distributed in the 11 known influenza viral ORFs) an entropy value that represents how conserved an amino acid residue is at that given position. We found the entropy value -0.379 at 627 of PB2 and therefore used -0.4 as a threshold to discover other amino acid residues that might be potential determinants of host-cell tropism. Another 51 positions were found to be distinct or nearly distinct between human and avian viruses by this entropy threshold. Most of these (40 of 52) are located in viral ribonucleoproteins (RNPs) (PB2, PB1, PA, and NP), which are essential for viral replication. Taubenberger et al. reported 10 amino acid residues that distinguish human and avian influenza viral polymerases (3). Six of them were also identified in this study. The entropy values of the 4 missing ones were also found close to the preset threshold (-0.4). For example, PB2 567 showed a human entropy of -0.039 and avian entropy of -0.490, PB1 375 with human entropy -0.165 and avian entropy -0.693, and PA 100 with human entropy -0.061 and avian entropy -0.406. All 3 positions were eliminated earlier from the stage of analyzing the 401-genome primary dataset. The fourth position, PB2 702, although in the first-round list, marginally failed in the subsequent validation with human entropy -0.057 and avian entropy -0.404.
We proposed a computational approach capable of indicating species-associated signatures in studying human versus avian influenza viral genomes. Although we intended to analyze a comprehensive set of avian versus human influenza A viral genomes, the available sequences are predominated by H5N1 in avian viruses and H3N2 in human viruses. The short supply of sequences other than those 2 subtypes may inevitably cause a certain amount of bias in our results. At the completion of this study, we noticed a recent article by Obenauer et al., who had made 169 newly sequenced avian influenza viral genomes available to GenBank on January 26, 2006 (34); these were not included in our analysis. We checked on our 52 signature positions against these new genomes and found only 2 of them that showed an entropy value slightly over our threshold -0.4. These are PB1-F2 87 and HA 237, with entropy values of -0.522, and -0.692, respectively. The choice of entropy threshold would also affect the number of signatures found. Originally we chose -0.4 on the basis of the value -0.379, computed from PB2 627 by using 95 avian genomes. We noticed that this entropy value reduced to -0.299 at PB2 627 (see online Appendix Table 4) at the later validation stage, when we found 197 E and 19 K from a total of 215 avian PB2 sequences. If we chose to use a more stringent entropy threshold of -0.3, our analysis still showed 46 of those 52 reported signatures; missing were positions 73, 79, and 82 from PB1-F2, 409 from PA, and 237 and 389 from HA.
In addition to the data limitations, this approach of looking for species-associated signatures by entropy is less useful for HA and NA genes. The genetic diversity that exists in either human or avian viruses for these 2 gene segments can markedly boost their respective entropy to more negative values, thus making it difficult to find residues conserved enough for identifying such signatures.
We additionally performed the analysis on human H1, H2, and H3 versus avian HA (online Appendix Figure 1). For NA we performed the analysis on human N1 and N2 versus avian NA. We compared 10 human H1, 3 human H2, and 293 human H3 with 95 avian HA sequences and found 13, 13, and 69 signatures (with entropy values for both human and avian within -0.4), respectively. This finding indicates that the human H1 and H2 strains are less distinct from avian strains (H5 dominant) than H3. For NA we found only 6 signatures, in comparison with 8 human N1 versus 95 avian (N1-dominant), and we found only 5 signatures when we compared 298 human N2 and 95 avian sequences. Entropy plots for these analyses can be seen in online Appendix Figure 1.
Two genetic alleles (allele A and B) have been described for the NS gene in avian influenza A virus. We decomposed those 95 avian NS genes into 43 in allele A and 52 in allele B and compared their amino acid sequences with 306 human NS genes. For NS1, 6 signatures were found between human viruses and avian allele A viruses, and 35 signatures were found between human viruses and avian allele B viruses. For NS2, 3 signatures were found between human viruses and allele A viruses, and 6 signatures were found between human viruses and allele B viruses. These results suggest that avian allele B viruses are more distinct from human viruses than are allele A viruses. Entropy plots and histograms for these analyses can be seen in online Appendix Figure 1 and online Appendix Figure 3 (http://www.cdc.gov/ncidod/ EID/vol12no09/06-0276_appG3.htm).
From the histograms, we found that some of the 11 genes vary greatly between human and avian viruses, while some others vary little. No boundaries were found between homopairs and heteropairs for HA, NA, and PB1 for human versus avian viruses. This finding seems reasonable because the 2 recent pandemic strains, the 1957 H2N2 and the 1968 H3N2, both originated from reassortment with avian influenza viruses (HA, NA, and PB1 gene segments were from avian influenza). On the other hand, because histograms of NP, followed by PA and PB2, may be used to distinguish human influenza viruses from avian influenza viruses, perhaps some biologic constraints against the occurrence of reassortment exist for these 3 genes. Both the M and NS genes are less differentiable between these 2 types of influenza A viruses.
NP not only displays a clear boundary between human and avian viruses from histogram analysis but also contains more species-associated amino acid signatures (15 of 52) than other ORFs. In addition to NP, polymerase proteins PB2, PB1, and PA also contain abundant speciesassociated signatures. Most signatures in these viral RNPs are located on the functional domains related to RNP-RNP interactions that are necessary to form replicase/ transcriptase complex (3P and NP), which suggests that specific combinations of polymerase complex and NP would allow an influenza virus to replicate itself efficiently (Table 1). In addition to RNA-interacting domains, many species-associated amino acid signatures of 3P and NP are located in regions related to nuclear localization signals. Influenza viral replication is highly dependent on nuclear function (35), making it worthwhile to further examine the roles of those amino acid signatures on nuclear localization of viral RNP in avian versus human cells. We also noticed that several amino acid signatures in NP are located in the regions that interact with cellular proteins, such as splicing factor (BAT1/UAP56) or MxA, which plays a certain role in cellular antiviral mechanisms. What species-specific host factors may affect influenza viral replication rates is not clear. Biologic experiments are required for further understanding the roles of those amino acid residues and related functional domains in the mechanism of interspecies infection.
PB1-F2 is a novel influenza viral protein translated from alternative initiation of PB1 gene. PB1-F2 of PR8 (H1N1) has been shown to target mitochondria and then trigger host cell apoptosis (36). Our previous research has found that several strains contain truncated PB1-F2 (37). In this study, 379 of 401 PB1 sequences (in the primary dataset) contained PB1-F2 >87 and <90 aa. For the other 22 sequences, 2 H3N2 strains missed a start codon, 3 H3N2 had the translation stopped at 11 aa, 1 H9N2 stopped at 8 aa, 5 H1N1 stopped at 57 aa, and 3 H9N2 and 7 H3N2 stopped at 79 aa. One H5N1 contained extra residues; its PB1-F2 was 101 aa. We also noted 5 speciesassociated signatures on PB1-F2; all of them are within the C-terminal domain, which is important for mitochondria targeting (15,16). Further investigation of the mitochondria localization of those PB1-F2 variants and their abilities for triggering apoptosis in cells derived from different species is warranted. How many mutations would make an avian virus capable of infecting humans efficiently, or how many mutations would render an influenza virus a pandemic strain, is difficult to predict. We have examined sequences from the 1918 strain, which is the only pandemic influenza virus that could be entirely derived from avian strains. Of the 52 species-associated positions, 16 have residues typical for human strains; the others remained as avian signatures. The result supports the hypothesis that the 1918 pandemic virus is more closely related to the avian influenza A virus than are other human influenza viruses (2). From the 21 avian viruses isolated from humans in this study, we found 19 (90.5%) that contain >1 change at the species-associated sites. Upon examining signature changes from similarly sized sets of randomly selected human viruses, randomly selected avian viruses, and randomly selected viruses (avian plus human), we found 29.4%, 71.4%, and 47.1%, respectively, contain species-associated mutations. Although predicting the emergence of a pandemic strain is difficult, close monitoring of how those species-associated signature positions have changed from bird-specific to human-specific signatures may provide a measurement for the prediction of such events. Dr Chen is an assistant professor at the Department of Computer Science and Information Engineering, Chang Gung University. His research interests include viral bioinformatics, biological sequence analysis, data mining, and software development.