Emergence and Genetic Variation of Neuraminidase Stalk Deletions in Avian Influenza Viruses

When avian influenza viruses (AIVs) are transmitted from their reservoir hosts (wild waterfowl and shorebirds) to domestic bird species, they undergo genetic changes that have been linked to higher virulence and broader host range. Common genetic AIV modifications in viral proteins of poultry isolates are deletions in the stalk region of the neuraminidase (NA) and additions of glycosylation sites on the hemagglutinin (HA). Even though these NA deletion mutations occur in several AIV subtypes, they have not been analyzed comprehensively. In this study, 4,920 NA nucleotide sequences, 5,596 HA nucleotide and 4,702 HA amino acid sequences were analyzed to elucidate the widespread emergence of NA stalk deletions in gallinaceous hosts, the genetic polymorphism of the deletion patterns and association between the stalk deletions in NA and amino acid variants in HA. Forty-seven different NA stalk deletion patterns were identified in six NA subtypes, N1–N3 and N5–N7. An analysis that controlled for phylogenetic dependence due to shared ancestry showed that NA stalk deletions are statistically correlated with gallinaceous hosts and certain amino acid features on the HA protein. Those HA features included five glycosylation sites, one insertion and one deletion. The correlations between NA stalk deletions and HA features are HA-NA-subtype-specific. Our results demonstrate that stalk deletions in the NA proteins of AIV are relatively common. Understanding the NA stalk deletion and related HA features may be important for vaccine and drug development and could be useful in establishing effective early detection and warning systems for the poultry industry.


Introduction
Wild aquatic birds, such as Anseriformes (ducks and swans) and Charadriiformes (gulls and shorebirds) are reservoir hosts for avian influenza viruses (AIV) [1,2,3]. However, AIVs can cause outbreaks in poultry [4,5,6,7,8]. In some instances, AIV strains from poultry hosts have increased pathogenicity for poultry species [9,10], and have acquired an ability to infect mammalian hosts [11,12,13], and/or have caused fatal infections in humans [14,15]. Therefore, to minimize adverse effects in humans and poultry from AIV infections, it is important to understand what evolutionary changes occur in AIVs when they are transmitted from wild birds to poultry. Understanding these evolutionary changes can lead to better detection, prevention and control strategies.
AIVs interact with their hosts mostly through two glycoproteins, Hemagglutinin (HA) and Neuraminidase (NA). HA recognizes receptors on target cells and NA, a sialidase, assists virus entry and release [16,17]. One observation that has been reported in viruses isolated during separate poultry outbreaks is a deletion in the stalk region of the NA [18,19,20,21]. The stalk is a structure that separates the enzymatically and antigenically active ''head'' from the hydrophobic domain embedded in the viral membrane [22,23,24]. Little is known about the biological function of the NA stalk. Previous studies have shown that deletions in the NA stalk region influenced the virus' replication efficiency in vivo, increased its host range, reduced its NA enzymatic activity, and in some cases increased the virus' virulence [13,25,26]. Stalk deleted NAs (referred to as SDNA hereafter) were reported sporadically in some AIV subtypes, e.g. H5N1, H6N1, H7N1, H7N3 and H9N2 [8,9,27,28,29]. SDNA are often accompanied by observations of variants on the HA protein, such as the addition of glycosylation sites, presumably to maintain functional balance between HA and NA which is necessary for viral infectivity [30,31,32]. These HA variants could further influence viral antigenicity, virulence and pathogenicity [32,33].
Although the SDNA has been identified in different subtypes of poultry isolates, it is unclear whether there is a general correlation of SDNAs with species in the order of Galliformes across different NA subtypes as claimed in previous publications [8,9]. Galliformes (gallinaceous hosts) is an order of birds that includes important domestic and game birds, such as chickens, turkeys, pheasants, and quails. The aim of this study is to provide a broad understanding of the emergence of SDNA through a comprehensive analysis of NA sequences. Our analysis showed SDNA prevalence by subtypes, host and time period and demonstrated -with some exceptions -a general correlation between SDNA and gallinaceous hosts. In addition, we analyzed the statistical correlation between SDNA and variants on HA proteins.
The site of deletion and the frequency of each missing amino acid vary among NA subtypes. For each NA subtype, the most commonly dispensed positions are from 55 to 70 among all SDNA subtypes except N5 ( Figure S1).

Distribution of NA stalk deletions
The phylogenetic trees of NA nucleotides show that deletion patterns usually group into monophyletic clades that contain sequences belonging to a single deletion pattern (Figures 2, S2, S3, S4, S5 for N1, N2, N3 and N7 genes). Exceptions to monophyletic deletions are instances in which the same deletion pattern arose multiple times independently (deletion patterns 9, 12, 22, and 29, Figures 2, S2 and S3), or patterns that are nested within a larger clade with a different deletion pattern (patterns 6, 7 and 10 within pattern 5, and pattern 17 within pattern 14). The patterns in the latter exception were most likely derived from existing patterns with additional deletions since they have the same deleted amino acid positions as those in the clade they are nested in (Figures 1, S2 and S3).
Most (38/47, 79%) of the SDNA patterns were associated with a single HA subtype and a few patterns combined with multiple HA subtypes (9/47, 19%) (Table S1). Most mutants were limited to small geographic areas and a few patterns were found in isolates from multiple locations (10/47, 21%) ( Table S1). The majority of SDNAs (67%) persisted for less than a year ( Figure 3). However, some patterns (4, 5, 14, 19, and 22) existed for many years ( Figure 3,   Table S1). Pattern 22 persisted for 25 years and appeared in both Asian and North American isolates (Figure 3 and S3).

Gallinaceous hosts and NA stalk deletions
The percentage of SDNA virus was higher among gallinaceous (1,355/1,955, 69%) than non-gallinaceous hosts (893/2,424, 36%), mainly in the orders Charadriiformes and Anseriformes. A high percentage of SDNA mutants was observed among non-gallinaceous isolates in N1 deletions in the 2000s (804/1,086, 74%) due to the large number (,800) of HPAIV H5N1 samples from infected wild birds. Apart from the HPAI H5N1 viruses, there were 97 SDNA isolates from non-gallinaceous hosts. Among them, only two SDNAs of the non-gallinaceous isolates had no obvious connection to domestic poultry. One was an H5N3 virus from a migrating mallard in Japan [45] and the other an H6N5 virus from a shearwater in Australia [46]. Most other SDNAs of nongallinaceous isolates were either from domestic birds or linked to poultry for the following reasons: (i) 33 were isolated from farm ducks [8,47,48,49], (ii) ten were isolated from birds in live bird markets [50,51,52], (iii) 18 were isolated from hosts species that were not indigenous to the geographic region of the isolate, therefore, domestic, and (iv) the remaining 34 clustered within gallinaceous isolates in the phylogenetic tree.
Our analysis of the joint transition rates between all combinations of host types and NA states showed that estimated transition rates generally created a positive correlation between gallinaceous hosts and SDNA. The rates leading to character combinations supporting a positive correlation (the combinations of gallinaceous/SDNA and non-gallinaceous/full-length-NA) were subtracted from the rates leading to opposing character combinations (non-gallinaceous/ SDNA and gallinaceous/full-length-NA). A posterior distribution for this rate difference (DQ) was estimated. From this posterior distribution, a posterior probability that the rate difference exceeds zero was calculated ( Table 2). A positive rate difference indicates that the positive correlation between gallinaceous hosts and SDNA is due to uneven transition rates. For subclade 1 of N1, subclade 1 of N2 and N3, the posterior probabilities for a positive rate difference were greater than 0.95 (Table 2). However, this is not true for the following two groups, subclade 2 in N1 and subclade 2 in N2. Subclade 2 of N2, composed of N2 genes of H9N2 isolates of poultry outbreaks in China, includes 89% viruses from gallinaceous hosts but contains only 31% SDNA mutants ( Figure S3). The reverse is true for the HPAIV H5N1 subclade that contains 53% viruses from non-gallinaceous hosts yet has 99% SDNA sequences ( Figure S2).
Models with a zero transition rate from deleted to full-length NA in Galliformes have high posterior probabilities among N1 and N2 sequences. This result implies that restoration of full-length-NA from SDNA is unlikely for N1 and N2 in Galliformes (Table 2). These posterior probabilities are generally low in non-Galliformes (Table 2).

Geographic area and NA stalk deletions
We tested for a correlation between SDNA and Asia for N1, N2 and N3. In subclade 1 of N1 and in both N2 subclades, there is a significant positive correlation between SDNA and an isolate being from Asia (Table 3). In subclade 2 of N1, the correlation is negative and no correlation was found for N3 (Table 3).

HA modifications and NA stalk deletions
To understand the potential impact of SDNA on other viral properties, such as antigenicity, we tested whether any amino acid residues of HA protein features, such as glycosylation sites, are statistically correlated with the SDNA genotype. Seven HA features were positively associated with SDNA as indicated by the posterior probability of a positive rate difference (i.e. a rate difference that generates a positive correlation) exceeding 0.95. The features include five glycosylation sites, one deletion and one insertion that are identified in H5, H6 and H7 ( Figure 4). All of these associations are NA-subtype-specific because no HA variant is significantly associated with SDNA in more than one NA subtype ( Figure 4, Table 4). The distribution of the seven features on the HA phylogenies is shown in the supplemental material ( Figures S6, S7, S8 for H5-H7).

Discussion
In this study, we conducted a comprehensive analysis of the polymorphic AIV NA stalk regions using a large set of sequences of natural isolates (as opposed to laboratory-adapted isolates). Forty-seven different NA stalk deletion (SDNA) patterns were identified in six NA subtypes, N1-N3 and N5-N7. An analysis that controlled for phylogenetic dependence due to shared ancestry showed SDNA to be positively correlated with gallinaceous hosts and with some amino acid features on the HA. The analysis of the HA features estimated rates at which SDNA were gained and lost on the HA tree depending on the presence of an HA feature. This analysis concerned only the overall rate of transitions and did not differentiate between transitions on the NA due to reassortment or due to mutation. It was therefore not necessary to estimate the reassortment rate between HA and SDNA which is unknown for the analyzed dataset. Five glycosylation sites, one insertion and one deletion on the HA were identified to be statistically SDNA-associated and all of these associations were specific to HA-NA subtype combinations. Our results further suggested that SDNA mutations essentially cannot be restored to full-length in gallinaceous hosts. One challenge for our analysis was the unevenness of the sequence data due to varying sampling schemes. By fitting transition rates based on phylogenies, we could account for the uneven relatedness and the tendency of shared characteristics among closely related sequences. This was especially important for HPAIVs that are the source of a large number of very closely related sequences. However, when no SDNAs were observed in non-gallinaceous hosts, the fitted transition rates could not distinguish between a restoration to a full-length NA stalk and an extinction of SDNA mutants. Despite the limitation of our analysis, the findings shed light on the origin, spread and persistence of SDNA as well as the functional balance between HA and NA [53]. The results of our analysis that controlled for phylogenetic dependence due to shared NA ancestry suggest that the positive correlation between SDNA and gallinaceous hosts is caused by host-dependent transition rates between full-length-NA and SDNA. Previous studies have indicated that naturally-occurring SDNA mutants tend to appear in gallinaceous hosts but did not  demonstrate a general correlation between host and SDNA [8,9,28]. Laboratory studies showed that SDNA mutations arose from duck-origin viruses after these viruses were passaged through gallinaceous hosts [13,19,54]. Our results provide further evidence that the processes demonstrated in laboratory studies generated the patterns observed in naturally-occurring viruses. The emergence of SDNA in AIV infected gallinaceous hosts seems to be a general phenomenon that is widely detected in several NA subtypes. However, the correlation between gallinaceous hosts and SDNA mutations is not ubiquitous. In the N2 subclade 2, only 31% of the   H9N2 isolates from poultry outbreaks were SDNA mutants (Tables 2 and S1, Figure S3). As for the N1 subclade of HPAIV H5N1, almost all non-gallinaceous isolates had SDNA ( Figure S2). Besides the HPAIV H5N1 N1 subclade, there were 97 SDNA sequences of other subtypes from non-gallinaceous isolates. Hence, SDNA mutants appear to be generated in Galliformes but can be transmitted to other hosts. Whether the SDNA of other subtypes can persist at low prevalence in non-gallinaceous birds is unknown. The high diversity of deletion patterns found in several NA subtypes suggests that these deletions convey a general advantage for the viruses in gallinaceous hosts that does not depend on the particular NA amino acid sequence. The fact that the same deletion pattern arose more than once in several instances indicates that the number of possible deletion patterns that confer this advantage is limited. Our results showed a positive correlation between Asian locations and SDNAs in three instances (both subclades of N2 and subclade 1 of N1). This correlation could be due to different farm practices in Asia (lower compliance with biosecurity measures and more mixed-species farms) and/or AIV sampling schemes biased towards poultry species in Asia. Among HP H5N1, there is a negative correlation between Asian locations and SDNAs because all HP H5N1 that spread from Asia to other regions carried SDNAs whereas within Asia two forms of NA (SDNA and full-length-NA) of HP H5N1 co-existed ( Figure S2).
This study also uncovered amino acid features on HA proteins that are statistically correlated with SDNA. Previous studies demonstrated that shortened NA stalk length reduces NA activity [25] and that an efficient virus replication requires a matching reduction of HA activity [32,55]. It has been shown for H1N1 and H7N1 that glycosylation sites on the HA globular head structure reduce HA affinity for receptor binding and make the virus less dependent on NA function [32,56,57,58]. Viruses with these additional HA glycosylation sites replicated efficiently when combined with SDNA and were less sensitive to NA inhibiting drugs [32,33,56,57,58]. In H7N1, the glycosylation site that conferred these effects was Asn149 [32,33,57] (or 133 according to the H3 numbering [59]), the same glycosylation site that is significantly correlated with SDNA on N1 according to our results ( Figure 4, Table 4). Our analysis showed other putative glycosylation sites in HA of other subtypes, as well as insertion and deletion that are statistically correlated with SDNA ( Figure 4, Table 4). Given the requirement of functional balance between HA and NA [32,56,57,58], we speculate that the HA features identified by our methods could also reduce the HA affinity for host cell receptors and therefore, the virus' sensitivity to NA inhibitors. The fact that SDNAs are widely associated with isolates of gallinaceous hosts while each HA feature occurs only with a single NA subtype could be an evidence that SDNAs are adaptations to gallinaceous hosts whereas the accompanying HA features are secondary adaptations to SDNAs. In theory, some of the correlations between HA features and SDNAs could be the by-product of two pair-wise correlations with a third feature on an internal gene. Investigating such effects would require techniques to analyze multiple correlations while controlling for phylogenetic dependencies, which will be topics for future studies.
In summary, our results showed that SDNAs are widely observed in AIVs. They are repeatedly associated with poultry outbreaks but are also found in non-poultry hosts. SDNA mutants should be of special concern for the poultry industry since they could imply an adaptation of a virus to gallinaceous hosts. Previous researchers suggested that these viruses bear the risk of a pandemic [60,61,62]. AIV with SDNA and SDNA-associated HA features might be less sensitive to NA inhibiting drugs or reduce the efficacy of vaccines developed using a similar virus with full-length NA. Therefore, we believe it is important to closely monitor the emergence of SDNA mutants in poultry and other species and prevent extended AIV circulation in poultry.

Sequence retrieval
AIV NA and HA nucleotide sequences and HA amino acid sequences were retrieved on December 17, 2009 from public influenza database (http://www.flu.lanl.gov/) using keywords type A and avian host. Nucleotide sequences of all lengths were retrieved while amino acid sequences were restricted to full length. NA nucleotide sequences were excluded from the analysis if the stalk region was not fully sequenced, i.e. if the first sequenced nucleotide position was after position 90 or the last sequenced position was before 270, counting adenine (A) in the start codon (ATG) as position 1. Furthermore, HA and NA sequences were excluded if they were : Difference of all rates that lead to the character combinations feature present/SDNA and feature absent/full-length-NA minus all rates that lead the character combinations feature absent/SDNA and feature present/full-length-NA (refer to Methods). b : Blanks in HA sequences that are associated with SDNA are designated as deletions and blanks in HA sequences that are associated with full-length-NA are designated as insertions in HA sequences associated with SDNA. c : NA deletion pattern IDs that are associated with the HA feature that has a higher proportion among AIVs with SDNA than among AIVs with full-length-NA. d : There is no exact corresponding position on the H3 sequence. This position is on an insertion between position 159 and 160 on the H3 sequence. doi:10.1371/journal.pone.0014722.t004 acquired from viruses that were manipulated in laboratories (e.g. being expanded through laboratory animal or modified for the purpose of vaccine development), or if sequences were duplicates of an earlier submission of the same gene of the same strain, or erroneous as identified in a previous publication [63], or from nonavian isolates. A total of 4,920 NA, 5,596 HA nucleotide and 4,702 HA amino acid sequences were included in this study.

Sequence alignment and identification of stalk deletions
All qualified nucleotide sequences of the same NA subtype were aligned using Clustal W (BioEdit v7.0.5, Ibis Therapeutics, Carlsbad, CA). All nucleotide sequences were translated into amino acid sequences within each subtype alignment using the same program. Deletion patterns in the stalk region (positions 30 to 90 for amino acids and 90 to 270 for nucleotides) were adjusted manually at both nucleotide and amino acid levels by moving nucleotides or amino acids between both positions that flank the deleted region. Nucleotides were adjusted to avoid stretches of missing nucleotides being flanked by incomplete codons. After translation, amino acids flanking a deletion were assigned to the flanking positions that best matched the consensus sequence. If permitted by the consensus sequences, deletion patterns were further adjusted to minimize the numbers of distinct deletion patterns. Deletion patterns were ordered by NA subtype and by size of the deleted region within each NA subtype. An identification (ID) number was assigned to each deletion pattern by numbering the ordered deletion patterns consecutively.
Phylogenetic trees were constructed for each NA subtype from aligned NA nucleotide sequences using weighted neighbor-joining (function bionj in the R-package ape) [64]. Trees were rooted with corresponding gene segments of the 1918 H1N1 human virus. The tree branches were labeled with deletion patterns. Due to their sizes, all trees are included in the supplemental material ( Figures  S2, S3, S4, S5). The first and last observation of each separate emerging SDNA was recorded. Each different SDNA pattern and distinct clades of the same SDNA pattern were counted as separate emerging SDNA. Two clades of the same SDNA pattern were counted as distinct if they were separated by non-deleted sequences and had an estimated distance of more than 1% nucleotide changes between their nodes.
Identification of HA features correlated with NA stalk deletion HA nucleotide and amino acid sequences were aligned using MUSCLE with two alignment iterations [65,66]. Aligned HA amino acid sequences, excluding positions within cleavage site, were screened for three possible features, namely putative glycosylation sites (NXT/S, where X could be any amino acid residues except Proline) [28], deletions or insertions. A script was written in R [67] that identifies the positions of all three features in all HA sequences (Script S1). HA and NA sequences were linked by strain names. HA and NA sequences with matching location, host species, serial number and year of sampling were assumed to be from the same isolate. The H3 numbering system was used for all HA subtypes as described in a previous study [59].

Estimation of correlations between two characters
We tested for correlations of SDNA with host, with HA features and with geographic region while controlling for phylogenetic dependence due to shared NA or HA ancestry. The associations between NA stalk state and host or region were analyzed based on NA trees. The association between NA stalk state and HA features was analyzed based on HA trees. Characters of interest (i.e. state of NA stalk, host type, presence of HA feature and geographic region) were coded as binary characters and estimating the transition rates among the four possible combinations of two binary characters based on phylogenetic trees using the software package BayesTraits [68,69] (Figure 5). Character A represents NA stalk state (with or without stalk deletion) and character B can be either host (gallinaceous or non-gallinaceous), HA amino acid feature (present or absent) or region (Asia or non-Asia). Independent character evolution would imply that the transition rates between two states of one character do not depend on the state of the other character.
For a given association between character A and B, there are two supporting and two opposing character state combinations described in Figure 5. For example, an association between gallinaceous hosts and SDNAs is supported by the character combinations Galliformes/SDNA and non-Galliformes/full-length-NA and opposed by non-Galliformes/SDNA and Galliformes/fulllength-NA. The difference between all rates leading to the two supporting character state combinations minus all rates leading to the opposing character state combinations, DQ, was calculated as a measure for whether the character correlation is generated by interdependent transition rates ( Figure 5).
Using a Bayesian framework, posterior distributions of the transition rates were estimated given a distribution of phylogenetic trees. The distributions of phylogenetic HA and NA trees were estimated by fitting a general time reversible (GTR) model with gamma distributed rates to nucleotide sequences using MrBayes 3.1.2 [70]. The Markov chains for the trees were run for 3 Mio time steps with a burn-in period of 250,000 and trees were sampled every 2,000 steps leading to 1375 sampled trees per chain. Two independent chains were run for each tree and it was verified that the chains converged to the same tree by comparing the topological distances between trees within and between the chains. Each BayesTraits analysis sampled from a total of 2750 trees per subtype.
Analysis using large trees can lead to log-likelihood values of the transition model below the smallest number that the BayesTraits code can represent. To avoid such low log-likelihood values, sequences were divided into groups and a separate parameter estimation was performed per group. NA sequences were grouped into NA subtypes and N1 and N2 were divided into two subclades within each subtype. N1 sequences were distinguished into the subclade that contains all HPAIV H5N1 NAs or all other N1 sequences. N2 sequences were separated into those from isolates of H9N2 outbreaks in China and those from other isolates. For HA sequences, trees were constructed for each HA-NA subtype, e.g. a separate tree was constructed for HA sequences from H5N1 viruses and from H5N2 viruses. HA sequences were grouped in this fashion since HA trees ( Figures S6,S7, S8) suggested that the association between HA features and SDNA depended on NA subtype. The N1 and H5 trees of the HPAIV H5N1 clade were further reduced by randomly selecting a single sequence from all clades whose maximum distance from its basal node was less than 0.01. Representative sequences belonging to closely related branches were retained by this trimming method.
A reversible-jump Monte Carlo Markov Chain was used to simultaneously estimate the posterior distributions of transition rates and the posterior probabilities of constraints imposed on the transition rates [69]. A posterior distribution of DQ was calculated. A range containing 95% of DQ values and a posterior probability that DQ exceeds zero was determined from the posterior distribution. A high posterior probability of DQ exceeding zero was interpreted as evidence that observed positive correlations between characters were due to correlated transition rates and not simply due to shared ancestry.
The correlation between SDNA and HA features were analyzed based on HA trees. HA trees were chosen because in that case SDNA can be acquired or lost by mutation and/or reassortment. Hence the NA states are a more labile character on the HA tree than on the NA tree. Larger volatility of the NA state is more likely to reveal functional correlations if they are present. A single rate was fitted for each transition, without distinguishing whether transitions were due to mutations or reassortment.  Figure S2 Neighbor-joining tree of N1 sequences. Branch colors indicate host types, pink for Galliformes, yellow for Anseriformes, green for Charadriiformes and grey for birds from other orders. Genes of isolates from gallinaceous hosts are shown by tree branches with extended grey lines. The first column from the left shows a black dash for each isolate with SDNA. The second column shows the deletion pattern ID. Sequences with the same deletion pattern are denoted by square brackets numbered with pattern ID. Deletion patterns that are nested within larger deletion pattern clades are indicated by deletion pattern IDs on the left side of the brackets. The third column indicates the HA subtype of each isolate. The fourth column shows the continent each isolated is from. The two N1 subclades are shown by brackets in the last column. The box indicates the part of the tree that is shown in Figure 2.  Author Contributions