Molecular Evolution of Classic Human Astrovirus, as Revealed by the Analysis of the Capsid Protein Gene.

Classic human astroviruses (HAstV) are major global viral agents for gastroenteritis, but the molecular characteristics of classic HAstVs are not well understood. Here, we presented the molecular evolution of all classic HAstV serotypes by the analysis of the capsid protein sequences. Our results show that classic HAstVs can be divided into four groups with the most recent common ancestor (TMRCA) of 749. The overall evolutionary rate of classic HAstVs on the capsid gene was 4.509 × 10−4 substitutions/site/year, and most of the serotypes present a clock-like evolution with an amino acid accumulation of mutations over time. The mean effective population size of classic HAstVs is in a downward trend, and some positive and more than 500 negative selection sites were determined. Taken together, these results reveal that classic HAstVs evolve at the intra-serotype level with high genetic heterogeneity and are driven by strong purifying selection. Long-term surveillance of classic HAstVs are needed to enrich the genomic data for further analysis.


Introduction
Astroviruses are non-enveloped, positive sense, single-stranded RNA viruses [1]. Their genome is 6.8-7.9 kb in length and consists of three open reading frames (ORFs), designated ORF1a, ORF1b, and ORF2. ORF1a and ORF1b, at the 5 end of the genome, encode nonstructural proteins, including the RNA-dependent RNA polymerase (RdRp), while ORF2, at the 3 end, encodes the capsid protein precursor [2]. Astroviruses are classified into the genera Mamastrovirus and Avastrovirus [3] and can infect various hosts from birds to mammals, including humans [4]. Human astroviruses (HAstVs) were first recognized in children stool samples with diarrhea in 1975 [5]. Since then, HAstVs have been the well-established viral agents of gastroenteritis globally, and the number of astrovirus-related publications increase steadily [6].
In recent years, two novel astroviruses clades, namely, Melbourne (MLB) and Virginia/Human-Mink-Ovine-like (VA/HMO), emerged and have been detected in stool samples from humans with gastroenteritis worldwide [7][8][9]. However, the association between novel HAstVs and gastroenteritis is to be confirmed. The overall detection rate of novel HAstVs in stool was lower [10], and classic HAstVs (classified into eight serotypes: HAstV-1 to HAstV-8) are still the second or third most common viral agents responsible for gastroenteritis in young children [11]. Serum antibodies to at least one classic HAstV serotype were detectable in 90% of children by the age of 5 years [6]. In the US, approximately 10% of reported acute gastroenteritis outbreaks in childcare centers were caused by classic HAstVs [12].
The global disease burden of classic HAstVs is high. However, owing to the unavailablity of cell culture systems and robust small animal models, classic HAstVs are among the least studied enteric RNA viruses [6]. Happily, with the development of bioinformatics technologies, molecular analysis using the sequences from public databases has been successfully conducted in various viruses, like norovirus, rotavirus and influenza virus [13][14][15]. It is becoming valuable to elucidate the molecular characteristics of viruses from genomic data for epidemic prediction and management. But the molecular analysis of classic HAstVs is inadequate. Here, in order to gain a better understanding of the molecular characteristics of classic HAstVs, we presented a comprehensive description of the molecular evolution of classic HAstV serotypes by analyzing the complete capsid gene for all strains.

Dataset
In order to obtain classic HAstV ORF2 sequences, we searched the corresponding taxonomy ID of HAstV (Taxonomy ID: 1868658) in NCBI's GenBank Database. Interrogation of the database was terminated on March 2019, and the isolated time and location for each strain were retrieved from the GenBank database or the associated publications. The complete or nearly complete sequences (nucleotide position 4328 to 6691 according to HAstV-1/Oxford-1/1993/UK, GenBank accession No. L23513) were selected to analyze the characteristics of molecular evolution.

Genetic Diversity Analysis
Multiple alignment was performed by ClustalW as implemented in MEGA v7.0.26 [16], and the recombination event was determined by the Recombination Detection Program (RDP) v4.56 with p-value < 0.05 in 3 or more methods [17]. The nucleotide and amino acid identities were calculated using BioEdit 7.1.3.0 [18]. The inter-serotype mean amino acid distance was estimated based on Poisson model by MEGA v7.0.26 [16].

Root-to-Tip Divergence Analysis
The root-to-tip divergence was calculated by TempEst v1.5 [19] based on the inferred maximum likelihood (ML) tree which was constructed using MEGA v7.0.26 [16]. The best-fitting root option was selected to ensure the best correlation of the root-to-tip divergence. Then the root-to-tip divergence against the isolation year of each sequence was plotted and visualized to evaluate the evolutionary clock-like nature of classic HAstVs.

Accumulation Pattern of Amino Acid Substitutions
In order to visualize the accumulation of amino acid substitutions over time, we calculated the pairwise amino acid difference by MEGA 7.0.26 [16] among sequences with the same serotype. Then, the mean amino acid difference was calculated with the same time-span of isolation for each serotype. Finally, the mean amino acid difference and the time-span of isolation was plotted and visualized to evaluate whether amino acid substitution accumulated over time. The fitting line was also estimated to discuss the possible linear accumulation and the accumulative rate.

Evolutionary Analysis
We used the Bayesian Markov Chain Monte Carlo (MCMC) method in BEAST package v1.8.3 to estimate the time-scale maximum clade credibility (MCC) tree and evolutionary rate (nucleotide substitutions/site/year) of all complete classic HAstV ORF2 sequences [20]. Briefly, the best-fit nucleotide substitution model was estimated by IQ-TREE web server on the basis of corrected Akaike's Information Criterion (AICc) score [21]. Three clock models (strict clock, uncorrelated lognormal relaxed clock, and uncorrelated exponential relaxed clock) and Bayesian skyline coalescent tree were selected and compared by Akaike's Information Criterion through MCMC (AICM) [22] using Tracer v1.6 (http://tree.bio.ed.ac.uk/software/tracer/), and the model with the lowest AICM value was used. The convergence of parameters was evaluated using Tracer v1.6 and an effective sample size value >200 was considered acceptable. The MCC tree was obtained after 10% burn-in using TreeAnnotator v1.8.2 and visualized by FigTree v1.4 [23]. The Bayesian skyline plot (BSP) of all complete classic HAstV ORF2 sequences was constructed using Tracer 1.6. Furthermore, according to the aforementioned methods, the evolutionary rate and Bayesian skyline plots of HAstV-1, -3, -4 and -5, which have more than 10 sequences, were also estimated.

Selection Pressure Analysis
The selection pressure on the capsid gene of classic HAstVs was evaluated by estimating the nonsynonymous (dN) and synonymous (dS) substitutions ratio (dN/dS) using the Datamonkey server [24]. The site under positive selection (dN>dS) was determined with a p-value threshold of 0.1 using the single-likelihood ancestor counting (SLAC), fixed effects likelihood (FEL) and mixed effects model of evolution (MEME) methods. The site under negative selection (dN < dS) was determined with a p-value threshold of 0.1 using the SLAC and FEL methods.

Genetic Diversity of Classic HAstV ORF2 Sequences
After excluding the possible recombination sequences determined by RDP 4.5.6, a total of 116 complete (or nearly complete, 2316-2340 bp) capsid sequences of classic HAstVs were retrieved from the GenBank database and analyzed for molecular evolution in this study (Supplemental Table  S2). These sequences were isolated from 1971 to 2015, and all serotypes had a range of collection years more than 15. The nucleotide and amino acid similarity of the sequences ranged from 58.8 to 100% and 57.5 to 100%, respectively (Table 1). HAstV-2 and -4 had a higher amino acid distance when compared with other types. The minimum inter-serotype mean amino acid distance was between HAstV-3 and HAstV-7 (0.167, Table 1), and the maximum inter-serotype mean amino acid distance was between HAstV-4 and HAstV-7 (0.431, Table 1).

Root-to-Tip Divergence Analysis
To investigate the evolutionary clock-like nature of classic HAstVs on the ORF2 region, the root-to-tip divergence plots were conducted based on the inferred ML trees. The results showed that classic HAstVs evolved with a poor clock-like signal with a coefficient of determination (R 2 ) value of 0.109 (Figure 1a). The root-to-tip divergence for each serotype was also analyzed except for HAstV-7, which has only three sequences and could not allow us to infer an ML tree by bootstrapping with 1000 times. The plots (Figure 1b-h) revealed that classic HAstVs presented a linear evolution at the intra-serotype level. HAstV-2, -3 and -5 presented a stronger clock-like evolution, with R 2 values of 0.994, 0.851 and 0.850, respectively. Nevertheless, HAstV-1, -4, -6 and -8 presented a moderate clock-like pattern with R 2 values of 0.678, 0.444, 0.587 and 0.613, respectively.

Accumulation Pattern of Amino Acid Substitutions
We developed an algorithm to evaluate the relationship between amino acid diversity and time-span among strains from a given serotype (HAstV-7 was also not analyzed in this part because the number of sequences was relatively small). Firstly, the pairwise amino acid differences were calculated and averaged according to the time-span of isolation for each serotype. Afterwards, the algorithm generated a diagram in which the mean amino acid difference plotted against the timespan of isolation. Our results showed that the mean pairwise amino acid difference of the capsid protein from HAstV-6 and HAstV-8 were not influenced by the time-span of isolation (Figure 2f,g). Nevertheless, the mean amino acid difference of the capsid protein from other serotypes accumulated continually over time (Figure 2a-e), and HAstV-1, -2, -4 and -5 sequences presented moderate linear accumulation (R 2 : 0.408-0.793). In addition, the amino acid substitutions of HAstV-2 accumulated faster than other serotypes (Slope = 2.050), and HAstV-5 owned the slowest accumulative rate (Slope = 0.495).

Time-Scale Phylogenetic Tree
The time-scale phylogenetic tree of the complete ORF2 sequences of classic HAstVs was constructed using the Bayesian MCMC method (Figure 3), which was in a balanced branching pattern and showed that classic HAstVs can be divided into four groups. Group I only contained one serotype (HAstV-1), and the other groups contained more than two serotypes (Group II: HAstV-

Phylodynamics of Classic HAstVs Strains
The Bayesian skyline coalescent model was selected as the tree in this study to evaluate the changes in the effective population size of classic HAstVs on the ORF2 gene. On the whole, the effective population sizes of classic HAstVs have actually fallen and descended rapidly around 2000 (Figure 4a). For each serotype, the mean effective population sizes of HAstV-1 remained unstable and presented drastic changes in the past 50 years, which began to grow around 1975 and reached the peak around 1985. After that, it decreased slowly and presented a sharp fall from 2005 to 2012. Subsequently, it began growing again (Figure 4b), whereas the mean effective population sizes of HAstV-3 and -5 were in a slow decline (Figure 4c,e). The mean effective population size of HAstV-4 decreased slightly around 1975 and began to increase from 1985 to 1995, then remained constant (Figure 4d).

Phylodynamics of Classic HAstVs Strains
The Bayesian skyline coalescent model was selected as the tree in this study to evaluate the changes in the effective population size of classic HAstVs on the ORF2 gene. On the whole, the effective population sizes of classic HAstVs have actually fallen and descended rapidly around 2000 (Figure 4a). For each serotype, the mean effective population sizes of HAstV-1 remained unstable and presented drastic changes in the past 50 years, which began to grow around 1975 and reached the peak around 1985. After that, it decreased slowly and presented a sharp fall from 2005 to 2012. Subsequently, it began growing again (Figure 4b), whereas the mean effective population sizes of HAstV-3 and -5 were in a slow decline (Figure 4c,e). The mean effective population size of HAstV-4 decreased slightly around 1975 and began to increase from 1985 to 1995, then remained constant (Figure 4d).

Selective Pressure Analysis
To investigate the selective pressure on each site in the capsid gene of classic HAstVs, we calculated the ratio of nonsynonymous to synonymous substitution. The mean dN/dS value was 0.142, and the SLAC and FEL methods both recognized more than 500 negative selected sites.

Discussion
Classic HAstVs are the leading viral agents for gastroenteritis. However, the molecular evolution of classic HAstVs has not been discussed in detail. A complete sequence can provide us with more detailed information about the molecular evolution. In this study, a large number of the capsid protein sequences (complete or nearly complete) of classic HAstVs in the GenBank database were retrieved and analyzed to obtain a picture of their molecular characteristics.
A root-to-tip divergence analysis was conducted in this study, which is an approach to explore the clock-like manner of the evolution [25,26]. It has been reported that norovirus non-GII.4 genotypes presented a linear evolution at the intra-variant level, and the emergence of a novel norovirus GII.17 variant in 2014-2015 was speculated to be caused by this evolution pattern [26]. Classic HAstVs can also be classified into many variants for each serotype [27]. Nevertheless, our results show that classic HAstVs present a linear evolution at the intra-serotype level, indicating that each serotype of classic HAstVs evolves as a whole. Recombination and mutation are the main factors determining the molecular evolution of RNA viruses [28,29]. Recombination events of classic HAstVs were often identified at the ORF1b-ORF2 junction, which can contribute to the acquisition of a novel polymerase and change the evolution of the capsid protein [30][31][32][33]. The co-evolution of classic HAstVs at the intra-serotype level may also suggest that polymerase types have a minimal impact on the long-term clock-like evolution of the capsid protein of classic HAstVs, just like non-GII.4 norovirus [26].
Previous research by Parra et al. identified two different evolutionary patterns of norovirus: evolving and static. The evolving genotype (represented by the GII.4) presents amino acid accumulation of mutations over time, whereas static genotypes do not and have low prevalence due to their highly conserved and possible genetic fragility [34]. Our data show that several serotypes of classic HAstVs including HAstV-1 present an evolving pattern and these serotypes almost have relatively higher prevalence when compared with others [35]. This indicates again that the evolutionary patterns may be a signal to evaluate the relative prevalence of genotypes (or serotypes) in some viruses. In this study, the results of TMRCA for each serotype reveal that these serotypes persisted and co-circulated for a long time in humans. The time-scale MCC tree of the capsid protein gene of classic HAstVs results in four separated groups and serotypes in each group may have a common ancestor. Nevertheless, phylogenetic analysis based on ORF1a gene showed that classic HAstVs were only divided into two groups [36,37]. These may indicate that different ORFs of classic HAstVs evolve independently. Furthermore, skewed or ladder-like phylogenetic topology means the repeated occurrences of punctuated immune escape and there is a temporal replacement of predominant variants driven by the immune response of the host, such as norovirus GII.4, influenza H3N2 viruses [38]. In this study, the time-scale MCC tree of classic HAstVs presents as well balanced and lacks significant temporal structures at the variant level for each serotype, which reiterates that classic HAstVs may evolve at the intra-serotype level.
The nucleotide evolutionary rate of positive-strand RNA viruses can range from 10 −9 to 10 −2 substitutions/site/year determined by their genome and replication strategies [39][40][41]. A previous study has reported the evolutionary rate of classic HAstVs was approximately 3.7 × 10 −3 substitutions/site/year based on the genome fragments without recombination breakpoints [42], which was much higher than our estimate (4.509 × 10 −4 substitutions/site/year). But a limited number of sequences (16 sequences belonging to five serotypes) were included in that study, and we believe that our estimate result is more accurate. In addition, our results reveal that classic HAstVs evolve relatively slower when compared with other gastroenteritis viruses, such as norovirus GI and GII which evolve with similar evolutionary rates of about 10 −3 substations/sites/year at the capsid level [43,44]. This may partially explain the prevalent discrepancy between norovirus and classic HAstVs, since substitution rates are thought to be associated with the rates of inter-host transmission [39]. To examine the change in the effective population sizes in classic HAstVs, BSP analyses were performed in this study. BSP is the most widely applied demographic inference method, using standard MCMC sampling procedures to predict the relative effective population size over time directly from a sample of gene sequences, where changes in effective population size reflect a change in genetic diversity and help illustrate a demographical history [45,46]. We found that the effective population size of HAstV-1 was growing in the past few years. The effective population size of HAstV-4 was stable after it reached the peak around 1995. Such BSP data may predict that the prevalence of HAstV-1 and HAstV-4 will be still relatively higher in the future.
Next, selection pressure analysis was performed. Selection pressure analysis allows identifying putative sites with positive selection for immune escape. In this study, the discrepancy of detected sites under positive selection by three methods are attributed to the algorithmic models [47,48], and the sites prone to positive selection are located at the C-terminal half of capsid protein, which is similar to a previous report that included at least one sequence from each serotype [49]. This position is deemed to comprise the outer surface of the viral capsid and these positively selected sites may be associated with the residues exposed to the immune pressure or involved in receptor recognition [49,50]. Furthermore, the mean dN/dS value was relatively low, and a large number of negatively selected sites were recognized, revealing that positive selection at the codon level is not the dominant mechanism driving diversity and classic HAstVs are driven by strong purifying selection.
In summary, although the number of sequences analyzed in this study is limited, and selection bias or the origin of sequences (i.e. immunocompromised vs. immunocompetent), which was not provided in each analyzed sequence in the GenBank database, may affect the accuracy of the data, our results still provide valuable information. We found that HAstV-1 and HAstV-4 were the two most predominant serotypes in the GenBank database. Classic HAstVs can be divided into four groups, and most of the serotypes evolve as a whole with a higher evolutionary rate and amino acid accumulation of mutations over time. The mean effective population size of classic HAstVs is in a downward trend, and purifying selection is a dominant force among the capsid genes of classic HAstVs. This study also highlights that genomic sequences from public databases can be analyzed to increase the understanding of the molecular evolution of viruses, and molecular epidemiological study should be enhanced, especially in developing countries, to enrich the genomic databases.
Supplementary Materials: The following are available online at http://www.mdpi.com/1999-4915/11/8/707/s1, Table S1: All partial and complete sequences of classic HAstV ORF2 gene used in this study. Table S2: Complete sequences of classic HAstV ORF2 gene used in the study. Table S3

Conflicts of Interest:
The authors declare no conflict of interest.