Automated, phylogeny-based genotype delimitation of the Hepatitis Viruses HBV and HCV

Background The classification of hepatitis viruses still predominantly relies on ad hoc criteria, i.e., phenotypic traits and arbitrary genetic distance thresholds. Given the subjectivity of such practices coupled with the constant sequencing of samples and discovery of new strains, this manual approach to virus classification becomes cumbersome and impossible to generalize. Methods Using two well-studied hepatitis virus datasets, HBV and HCV, we assess if computational methods for molecular species delimitation that are typically applied to barcoding biodiversity studies can also be successfully deployed for hepatitis virus classification. For comparison, we also used ABGD, a tool that in contrast to other distance methods attempts to automatically identify the barcoding gap using pairwise genetic distances for a set of aligned input sequences. Results—Discussion We found that the mPTP species delimitation tool identified even without adapting its default parameters taxonomic clusters that either correspond to the currently acknowledged genotypes or to known subdivision of genotypes (subtypes or subgenotypes). In the cases where the delimited cluster corresponded to subtype or subgenotype, there were previous concerns that their status may be underestimated. The clusters obtained from the ABGD analysis differed depending on the parameters used. However, under certain values the results were very similar to the taxonomy and mPTP which indicates the usefulness of distance based methods in virus taxonomy under appropriate parameter settings. The overlap of predicted clusters with taxonomically acknowledged genotypes implies that virus classification can be successfully automated.

56 Introduction 57 58 The continuous advances in next generation sequencing technologies lead to an increasingly 59 easier and inexpensive production of genome and metabarcoding data. The wealth of available 60 data has triggered the development of new models of molecular evolution, algorithms, and 61 software, that aim to improve molecular sequence analyses in terms of biological realism, 62 computational efficiency, or a trade-off between the two. In response to such technological and 63 technical advancements, several fields of biology have undergone a substantial transformation.  Batovska et al., 2017). Among others, the development of novel species delimitation 70 tools has substantially advanced the study of biodiversity of microorganism that are often hard to 71 isolate and study (Taberlet et al., 2012;Gibson et al., 2014;Thomsen & Willerslev, 2015). The 72 sequencing of environmental samples in conjunction with algorithms for genetic clustering has 73 led to the identification of a plethora of previously unknown organisms and a re-assessment of 74 the microbial biodiversity in several settings.

75
In a similar context, genetic information has been a rich source of information for viral 76 species. Several studies show how phylogenetic information can be deployed for identifying the 77 spatial and temporal origin of a virus, potential factors that trigger its dispersal, and other key 78 epidemiological parameters (Stadler et al., 2012a;Stadler et al., 2014b;Gire et al., 2014). In an 79 era of high human mobility, such methods are important, as the increase of emerging and re-80 emerging epidemics is even more prominent than in the past (Balcan et al., 2009;Meloni et al., 81 2011;Pybus et al., 2015). Nevertheless, phylogenetic information is still not used in the context 82 of virus species classification or identification. As we have witnessed for other microorganisms, 83 using or adapting already available methods for fast and automated delimitation or identification 84 of virus species can greatly contribute to better understand their evolution.

85
To date, the official taxonomy of viruses (ICTV, i.e., International Committee on 86 Taxonomy of Viruses) has mainly been based on established biological classification criteria as 210 sampling, except the split of 3k subtype from its sister group (Fig. 1), which may be due to the 211 limited amount of corresponding sequences.

212
The number of clusters inferred with ABGD ranged from 1 to 208 depending on the 213 value of the maximum intraspecific divergence threshold (Fig. 2). The most reasonable result 214 (i.e., the one closest to the current standard taxonomy) comprised 19 clusters and was obtained 215 for a minimum of intraspecific genetic diversity of 5.99% (i.e., p=0.0599). Under this threshold, 216 the delimitation is largely identical to the delimitation obtained with mPTP ( Fig. 1), with three 217 differences: i) that genotype 3 was split into four clusters, instead of three, ii) genotype six was 218 divided into nine clusters instead of eight, and, iii) genotype 7 is divided into two clusters. When 219 the prior intraspecific divergence was increased to a higher minimum of 10%, all sequences were 220 grouped in a single cluster. When the threshold was set to a lower value (3.6%) the number of 221 clusters increased to 135 (Fig. 2). Nevertheless, the delimitation with the 5.99% threshold is 222 largely congruent to current taxonomy and the clusters obtained with mPTP, thus indicating the 223 usefulness of distance-based methods in virus taxonomy under well informed parameters.

224
The so far classification of HCV into genotypes and subtypes has been defined mostly by 225 visual identification of clades in phylogenetic inference of HCV sequences (Simmonds et al., 226 2005;. Specifically, the genotypes correspond to the seven major highly-227 supported phylogenetic HCV clusters while subtypes were defined as the secondary hierarchical 228 clusters found within each genotype . This classification scheme has been 229 widely adopted (Combet et al., 2007;Yusim et al., 2015) and has been shown to be robust (in 230 terms of stability of the HCV phylogeny) and relevant for clinical practice, since response rates 231 to immunomodulatory treatment for the chronic hepatitis C differs across genotypes. Manuscript to be reviewed 240 identification of the HCV genotypes could be of clinical importance in providing the appropriate 241 medical treatment (Strader et al., 2004;Ge et al., 2009). 242 243 HBV clustering 244 In the case of HBV, the mPTP clustering is almost identical to the current classification (Norder 245 et al., 2004;Kramvis et al., 2007) of the virus that comprises eight genotypes, except for 246 subgenotype C4 which formed a new cluster (Fig. 3, Suppl. Fig. 3 and 4, Suppl. Appendix). This 247 is in line with the greater genetic divergence of C4 compared to the other subgenotypes due to its 248 ancient origin in native populations in Oceania (Paraskevis et al, 2013). However, the split of C4 249 from its sister cluster (genotype C) is not supported by the MCMC sampling, potentially 250 reflecting the lack of adequate sampling. On the other hand, the number of clusters identified by 251 ABGD varied from 1 to 85 under different thresholds of minimum intraspecific divergence, 252 while the delimitation for a threshold of 1.29% exactly matched the eight genotypes of the HBV 253 classification (Fig. 2 and 3). Both ABGD and mPTP identified seven of the genotypes (A-F) as 254 distinct genetic clusters. The only difference was that mPTP split genotype C into two distinct 255 clusters (Fig. 3), i.e., subtype C4 was recovered as a distinct cluster from the remaining seven 256 subtypes of genotype. The application of mPTP to the HCV and HBV data sets shows that automated viral species 260 delimitation using phylogeny-aware methods yields clusters that largely agree with the current 261 standard taxonomy. The additional clusters identified for HCV by mPTP is not surprising as they 262 have been previously considered divergent sub-clusters within the genotypes 3 and 6. 263 Analogously, for HBV, mPTP yielded almost identical results to the current nomenclature 264 system with the exception of a single sub-genotype, C4, that was previously mentioned to be 265 more genetically divergent within genotype C (Paraskevis et al, 2013). In both cases, these new 266 clusters indicate the potential need for taxonomic revision. However, given the wide use of the 267 current nomenclature in the medical field, and the lack of other sources of information such as 268 recombination, particularly for HBV, and, response to treatment, we wouldn't suggest taxonomic 269 changes at present. Regarding distance methods, the example of HCV and HBV, shows that 270 meaningful parameter values for distance-based methods may differ substantially among Figure 1 Clustering of the HCV samples into genotypes Figure 1: Clustering of the HCV samples into genotypes; the first bar of colors corresponds to the genotypes currently acknowledged by ICTV, the second to the mPTP ML clustering and the third to the ABGD clustering (p=0.0599, X=1.5). The numbers indicate the support for a particular node being a speciation node obtained by the MCMC sampling under the mPTP model (support < 0.5 not shown, but see Suppl. Fig. 2). The phylogenetic relationships were inferred using RAxML under the GTR+Γ model.