Evolutionary dynamics of human and avian metapneumoviruses

Received 29 August 2008 Accepted 19 September 2008 Human (HMPV) and avian (AMPV) metapneumoviruses are closely related viruses that cause respiratory tract illnesses in humans and birds, respectively. Although HMPV was first discovered in 2001, retrospective studies have shown that HMPV has been circulating in humans for at least 50 years. AMPV was first isolated in the 1970s, and can be classified into four subgroups, A–D. AMPV subgroup C is more closely related to HMPV than to any other AMPV subgroup, suggesting that HMPV has emerged from AMPV-C upon zoonosis. Presently, at least four genetic lineages of HMPV circulate in human populations – A1, A2, B1 and B2 – of which lineages A and B are antigenically distinct. We used a Bayesian Markov Chain Monte Carlo (MCMC) framework to determine the evolutionary and epidemiological dynamics of HMPV and AMPV-C. The rates of nucleotide substitution, relative genetic diversity and time to the most recent common ancestor (TMRCA) were estimated using large sets of sequences of the nucleoprotein, the fusion protein and attachment protein genes. The sampled genetic diversity of HMPV was found to have arisen within the past 119–133 years, with consistent results across all three genes, while the TMRCA for HMPV and AMPV-C was estimated to have existed around 200 years ago. The relative genetic diversity observed in the four HMPV lineages was low, most likely reflecting continual population bottlenecks, with only limited evidence for positive selection.


INTRODUCTION
Human metapneumovirus (HMPV) is a respiratory pathogen that was first described in 2001 (van den Hoogen et al., 2001). HMPV causes respiratory illnesses ranging from mild upper respiratory symptoms to severe lower respiratory tract disease (van den Hoogen et al., 2001(van den Hoogen et al., , 2003Williams et al., 2004). HMPV is an enveloped, non-segmented, negative-strand RNA virus and was classified as a member of the family Paramyxoviridae, the subfamily Pneumovirinae, the genus Metapneumovirus. HPMV seroprevalence in the human population reaches 100 % by the age of five (van den Hoogen et al., 2001). Experimental HMPV infections in macaques have been shown to induce transient protection , which is in agreement with relatively frequent occurrences of reinfection in humans (Ebihara et al., 2004). HMPV can be divided in two main genetic lineages (A and B), each consisting of at least two sublineages -A1, A2, B1 and B2 (van den Hoogen et al., 2004). Annual circulation of all four lineages has been observed worldwide, with reports of a predominating strain changing on a yearly basis in some studies, but not in others (Galiano et al., 2006;Sloots et al., 2006;van den Hoogen et al., 2004).
The only other member of the genus Metapneumovirus is avian metapneumovirus (AMPV). AMPV has been found to infect domestic poultry worldwide, causing acute respiratory infections (Cook, 2000b). AMPVs have been classified into four subgroups, A through to D (Bayon-Auboyer et al., 1999;Eterradossi et al., 1995;Juhasz & Easton, 1994;Seal, 1998). AMPV subgroups A and B were first detected in South Africa in 1978 and were subsequently found in Europe, Israel and Asia (Cook, 2000a). AMPV subgroup D was first detected in France (Bayon-Auboyer et al., 1999). Finally, AMPV subgroup C was first detected in the USA in 1996 and is more closely related to HMPV than to any other AMPV subgroup (Govindarajan & Samal, 2004Govindarajan et al., 2004;Toquin et al., 2003 The genome organizations of HMPV and AMPV are similar: from the 39 to 59 ends the genomes encode nucleoprotein (N), phosphoprotein (P), matrix protein (M), fusion protein (F), M2-1 and M2-2, small hydrophobic protein (SH), attachment protein (G) and the large polymerase protein (L). Overall, HMPV and AMPV-C share 80 % amino acid identity, with N being most conserved protein and SH and G the most variable proteins (Biacchesi et al., 2003;van den Hoogen et al., 2001van den Hoogen et al., , 2002. The viral RNA is encapsidated by the N protein, which, together with the P and L proteins, forms the ribonucleoprotein complex required for genome replication. With analogy to respiratory syncytial virus (RSV), M2-1 is a transcriptional elongation factor that enhances synthesis of readthrough mRNAs (Collins et al., 1996;Fearns & Collins, 1999;Hardy & Wertz, 1998), although for HMPV its function is not completely understood as M2-1 is not essential for virus viability Herfst et al., 2004). HMPV has three surface glycoproteins: SH, G and F. The F protein is synthesized as an inactive precursor (F 0 ) that is cleaved by host proteases into the functional subunits F 1 and F 2 . The F protein of HMPV is highly immunogenic and protective, whereas G and SH are not Skiadopoulos et al., 2006).
The close genetic relationship between HMPV and AMPV-C is not reflected in host range. To date, no AMPV-C infections have been reported in humans and it has been shown that HMPV does not replicate in turkeys (van den Hoogen et al., 2001). However, using both chimeric viruses and minireplicon systems it has been shown that the ribonucleoprotein complex proteins of HMPV and AMPV-C are functionally interchangeable (de Graaf et al., 2008;Govindarajan et al., 2006;Pham et al., 2005). Spill-over events of pathogens from birds to humans are not uncommon (Reed et al., 2003). Since HMPV is most closely related to AMPV-C, it is plausible that an AMPV-Clike virus was the ancestor of HMPV. Retrospective analysis of serum samples collected from humans in 1958 revealed that HMPV has been widespread in the human population for at least 50 years. Thus, any cross-species transmission event must have taken place before 1958 (van den Hoogen et al., 2001). Herein, we test this hypothesis using sequence data from the N, G and F genes of HMPV and AMPV-C collected over a 25-year-sampling period. In addition, to understand the evolutionary dynamics of HMPV in more detail we estimated rates of nucleotide substitution, times to common ancestry and relative genetic diversity of all HMPV lineages using a Bayesian coalescent approach, as well as gene-and site-specific selection pressures.
Estimating evolutionary dynamics. Overall rates of evolutionary change (nucleotide substitutions per site, per year), the time to the most recent common ancestor (TMRCA) in years and relative genetic diversity (expressed as N e t, where N e is the effective population size and t the generation), were estimated using the BEAST program (http://beast.bio.ed.ac.uk/). This program employs a Bayesian Markov Chain Monte Carlo (MCMC) approach, utilizing the number and temporal distribution of genetic differences among viruses sampled at different times (Drummond et al., 2005(Drummond et al., , 2006Drummond & Rambaut, 2007). To estimate the substitution rates with as much accuracy as possible, we compared a variety of models of demographical history, namely constant population size, exponential population growth, and Bayesian skyline reconstruction, and with both strict (constant) and relaxed (variable) molecular clocks, and using a variety of prior values for the substitution rate. As similar rate/ date estimates were obtained under all demographical models, we only report those from the Bayesian skyline model as this employs the least constrained coalescent prior. Similarly, a preliminary analysis of a subset of the data collected here (data not shown) revealed that the uncorrelated lognormal relaxed clock model gave the best estimation, although rate/date estimates were again highly congruent across all clock models, suggesting that they are robust. Because the sequences analysed here were very closely related, sampled over a relatively narrow time-span and exhibited little multiple substitution at single nucleotide sites, we used the simple HKY85 model of nucleotide substitution in each case as more complex models sometimes failed to converge (data not shown). All MCMC chains were run for a sufficient length of time to ensure convergence (allowing a 10 % burn-in) and uncertainty in parameter estimates is reported as values of the 95 % highest probability density (HPD). Finally, the MCMC samples were summarized to infer the maximum clade credibility (MCC) trees of each dataset using TreeAnnotator v1.4.7 (Drummond & Rambaut, 2007), with posterior probability values shown for the major branches. The MCC trees were depicted with estimated branch lengths in time (years).
Analysis of selection pressures. We estimated overall (i.e. genespecific) and site-specific pressures in all datasets using the Datamonkey web interface of the HY-PHY package (www.datamonkey.org) (Pond & Frost, 2005). This method quantifies selection pressures as the ratio of non-synonymous (d N ) to synonymous substitutions (d S ) per site, using the random effects likelihood (REL) method for datasets smaller than 40 sequences, the fixed effects likelihood (FEL) method for the datasets up to 100 sequences and the single likelihood ancestor counting (SLAC) for the datasets containing more than 100 sequences and the mean d N /d S (v) values. These were run with three different nucleotide substitution models -HKY85, TrN93 and REV -although for most alignments the TrN93 model was the best-fitting model of nucleotide substitution.
For an additional analysis of selection pressures we used the CODEML program (Yang et al., 2000) to estimate the maximum-likelihood v value for all branches of each phylogeny (the 'one-ratio' model), and compared this to v values estimated separately for the external (v e ) Journal of General Virology 89 and internal (v i ) branches using the 'two-ratio' model. Input phylogenetic trees for this analysis were estimated under the general time reversible (GTR)+I+C 4 substitution model, using PAUP* (Swofford, 2003), and employing tree bisection-reconnection (TBR) branch-swapping.

Standing genetic diversity
The demographical history of the HMPV dataset was inferred using the Bayesian skyline plot (Fig. 2). Notably, the relative genetic diversity (N e t) in all HMPV lineages was very low, ranging from only 3.3 to 96.8. In addition, the Bayesian skyline plot also showed that there was an increase and subsequent decrease in genetic diversity between~1996 and~2001 (although the 95 % HPD interval is wide), a pattern that was seen for all lineages and in both the G and F genes.

Selection pressures
Our analysis of selection pressures through computation of the ratio of non-synonymous (d N ) to synonymous (d S ) substitutions per site d N /d S (v) revealed limited positive selection and abundant negative selection for the G (mean v50.524) and F proteins (mean v50.075) ( Table 1). It should be noted that for the HMPV set the G sequences were shortened to ensure proper alignment. Neither positive nor negative selection could be detected for the N protein (mean v50.038). Overall, this analysis revealed that the mean v is roughly 10-fold higher for the G than for the F and N genes, compatible with the localized action of immune selection on the former. Positively selected sites in the F protein were only found for the A1 (aa 34) and A2 (aa 34 and 50) sets. Both sites are located in the extracellular domain of F. Positively selected sites for the G protein were found for the A1 (aa 82, 125, 156 and 158), A2 (aa 146 and 151), B2 (aa 85) and HMPV (aa, 85, 86, 93, 111, 112, 113, 140 and 142) sets and were all located in the extracellular domain. By using the NetOglyc (version 3.1) program it was predicted that all but one (aa 86) of the positively selected sites found in the G protein corresponded to potential O-linked glycosylated sites in one or more strains of that dataset (Hansen et al., 1998).
To obtain a more precise picture of selection pressures we also estimated v separately for the internal (v i ) and external (v e ) branches of each phylogeny using the CODEML program (Table 2). In case of the F and N genes the v values were higher on the tips compared with the internal branches of the trees, resulting in an increased v e /v i value, and strongly suggesting that most of the non-synonymous mutations in these proteins are transiently deleterious (so that they are purged by purifying selection before they can reach high frequencies). However, values of both v i and v e values were high in the case of the G gene, resulting in a low v e /v i value, and again suggestive of immune selection.

DISCUSSION
In the present study we explored the evolutionary dynamics of HMPV and AMPV through analyses of a large set of HMPV and AMPV-C G, F and N gene sequences. For HMPV, our results were highly consistent across all three datasets, indicating that the admittedly sparse sampling did Evolutionary dynamics of metapneumoviruses Evolutionary dynamics of metapneumoviruses not introduce major biases. In particular, both the highly variable G sequences and the less variable F and N sequences gave similar demographical signatures and times to common ancestry, although the nucleotide substitution rates were higher for the G gene. The results for the G gene were comparable to those described previously (Padhi & Verghese, 2008). Overall, evolutionary rates were high and of the level expected for RNA viruses (Hanada et al., 2004;Jenkins et al., 2002). In addition, the genetic diversity found within the different sublineages of HMPV was of a very recent origin, and HMPV as a whole was characterized by low levels of standing genetic variation (as reflected in estimates of N e t). Such limited genetic diversity, which is also characteristic of human influenza A virus (Rambaut et al., 2008), as well as other paramyxoviruses (Pomeroy et al., 2008) is most likely due to repeated (annual) population bottlenecks, which periodically purge genetic variation.
Overall, the Bayesian skyline plots for HMPV suggest that the population dynamics of this virus are rather complex, with both increases and decreases in population size, and again expected for a virus where there are distinct peaks and troughs in prevalence. The general trend for all four lineages was an increase and subsequent decrease in population size between~1996 and~2001. Unfortunately the resolution of these Bayesian skyline plots was not sufficient to allow detailed analyses of the population dynamics in each viral lineage, or their precise cause. However, for the future it will be interesting to determine whether the increases and decreases in genetic diversity are in-phase or out-of-phase, as demonstrated in some other RNA viruses (Adams et al., 2006). At present it is not possible to compare these demographics to epidemiological studies published, as none of the studies cover a comparable period of time and include data of viruses isolated from patients and healthy individuals, including children as well as adults. Our analysis also provides new insights into the timescale of HMPV evolution. Although HMPV had already been shown to be widely spread in the human population in 1958, our coalescent analysis reveals that the TMRCA of HMPV existed approximately 70 years before this, so that HMPV was circulating in human populations long before its detection. Since HMPV and AMPV-C are highly similar in terms of protein composition and sequence identity it has been speculated that HMPV is the result of a cross-species transmission event from birds to humans (van den Hoogen et al., 2001). Further, since there are a wider variety of AMPV subgroups compared with HMPV subtypes it seems most likely that the direction of transmission is from birds to humans. As the TMRCA for HMPV and AMPV-C was calculated to be around 200 years ago, we suggest that the cross-species transmission event from birds to humans occurred at about this time. Despite its capacity to cause human disease, we found few positively selected sites for the F and G genes and none for the N gene of HMPV. Indeed, negatively selected sites were abundant for both the F and G genes. The positively selected sites found for the F protein were located in the extracellular domain of the F1 chain in the cysteine-rich region. Although the location of the antigenic sites of the F protein of HMPV have yet to be identified, some neutralizing monoclonal antibodies specific for the F protein have been described previously (Ulbrandt et al., 2006). The epitopes of these monoclonal antibodies do not correspond to the location of the positively selected sites described here. For RSV, most antigenic sites are located in the cysteine-rich region (Lopez et al., 1998). However, alignment of HMPV and RSV F sequences showed that the positively selected sites did not correspond to any of these antigenic sites. All positively selected sites found for the G protein were located in the extracellular domain, and all but one corresponded to potential O-linked glycosylated sites in one or more HMPV strain. For RSV, there is a strong association between the positively selected sites and neutralizing epitopes (Zlateva et al., 2004). However, the G protein of HMPV is only weakly immunogenic and protective in animal models in contrast to RSV (Olmsted et al., 1986;Skiadopoulos et al., 2006). The mean v was approximately 10-fold higher for the G gene than for the N and F genes. In addition, separate estimates of the v ratio on internal versus external branches revealed a major difference between the G, F and N genes, with a far greater v e in the former protein. Hence, the N and F genes are evidently more constrained than the G gene, so that mutations in these genes are more likely to be deleterious and result in an elevation in the number of (young) nonsynonymous mutations towards the tips of the trees. In contrast, a greater proportion of the mutations occurring in the G gene are likely to be either neutral or positively selected (although this latter process is difficult to detect using the analytical methods employed here), resulting in a higher level of non-synonymous diversity and lower v e /v i values. Similar results have been observed in the G gene of bovine respiratory syncytial virus (Pybus et al., 2007;Zwickl, 2006).
In conclusion, we were able to elucidate the evolutionary history and dynamics of metapneumoviruses by using dated genomic AMPV and HMPV sequences. Metapneumoviruses have high substitution rates like most RNA viruses due to high mutation rates, short generation times and large population sizes. Although such high mutation rates are not necessarily adaptive, and most often result in deleterious mutations, as reflected in the purifying selection on the F and N genes, they may occasionally facilitate viral emergence in novel hosts. Genetic heterogeneity of G and F protein genes from Argentinean human metapneumovirus strains. J Med Virol 78, 631-637. Govindarajan, D. & Samal, S. K. (2004). Sequence analysis of the large polymerase (L) protein of the US strain of avian metapneumovirus indicates a close resemblance to that of the human metapneumovirus. Virus Res 105, 59-66. Govindarajan, D. & Samal, S. K. (2005). Analysis of the complete genome sequence of avian metapneumovirus subgroup C indicates that it possesses the longest genome among metapneumoviruses. Virus Genes 30, 331-333.