CpG di-nucleotide odds ratio measures drive Unsupervised Clustering to characterize the Andes Hantavirus

Hantaviruses belong to the Bunyaviridae family with small mammals hosting them. Humans are infected either by inhaling virus-containing aerosols or through contact with animal droppings. Even if rodents host the pathogenic species and humans are dead-end hosts, they still get accidentally infected. The Andes Orthohantavirus (ANDV) seems to be the only species with documented person-to-person transmission. Hemorrhagic fever with renal syndrome (HFRS) and Hantavirus cardiopulmonary syndrome (HCPS) are both serious syndromes associated with hantavirus infections. For both syndromes, the mortality rate is near 40%. Decades of studies have already highlighted the CpG repression in RNA viruses, and both the estimation of the CpG odds ratio and the correlation with their genome polarity were dominant factors in figuring out the CpG bias. We conducted the differential analysis of the CpG odds ratio for all the orthohantaviruses on the full segmented genomes (L, M, S). The results suggested the statistical significance of the three groups. The “Small” genomes were more informative from the CpG odd ratio point of view. We calculated the CpG odds ratio for all the Orthohantaviruses within these segments and furthermore estimated the correlation coefficient with the relative coding sequences (CDS). Preliminary results first confirmed the CpG odds ratio as the lowest among all the nucleotides. Second, the Andes virus was highlighted as the one with the highest CpG odds ratio within CDS. The use of these two measures as features for unsupervised clustering algorithms has allowed us to identify four different subgroups within the Orthohantaviridae family. The evidence is that the Andes Hantavirus exhibits a peculiar CpG odds ratio distribution, probably linked to its unique characteristic of passing from person to person.


Introduction
Hantaviruses are enveloped RNA viruses with the negative-sense, tri-segmented genome. The large (L), medium (M), and small (S) segments code for viral transcriptase or polymerase, glycoprotein precursors (GPC), and the N protein that makes up the nucleocapsid, respectively [1]. Infected rodents transmit Hantaviruses to humans without causing any significant illness.
Hantavirus cardiopulmonary syndrome (HCPS) is an acute, severe, and sometimes fatal respiratory disease caused by an infection from the Andes orthohantavirus. Initial symptoms are linked to the respiratory apparatus (shortness of breath, progressive cough, and tachycardia), muscle-ache, fatigue, and fever, making it difficult to distinguish from simple flu. HCPS symptoms can quickly evolve and, in extreme cases, infected individuals may have to be incubated and receive oxygen therapy [2]. Complications of cardiogenic shock, lactic acidosis, and hemoconcentration can cause death within hours of hospitalization. In South America, the Andes Hantavirus (ANDV) is the primary etiologic agent. In Chile, in the period 2001-2009, the authorities reported over 600 cases of ANDV-related Hantavirus with a 36% mortality rate.

The Andes Hantavirus
The Andes Orthohantavirus (ANDV) is a major causative agent of hantavirus cardiopulmonary syndrome [3]. The hantavirus cardiopulmonary syndrome is a severe respiratory disease with a fatality rate of 35-40% [4]. The Andes orthohantavirus is the only Hantavirus that can spread from person to person either by bodily fluids or long-term contact [5][6][7]. The Andes virus causing HPS in human hosts was first identified in 1995 in samples from patients in southern Argentina [8], even if sporadic cases of HPS (Hantavirus Pulmonary Syndrome) were retrospectively identified [9] from as early as 1987. In 1995 doctors first identified the Andes virus in the lungs of a patient from El Bolson; the outbreak studied in a past dispatch began on September 22, 1996. Oligoryzomys spp. rodents appear to be the principal reservoirs for most Andes viruses [10]. A previous study [11] presented the N. spinosus mice as a reservoir for the Andes virus variant found in Madre de Dios and Puno.

CpG di-nucleotides in RNA viruses
The CpG sites are regions of DNA or RNA where a guanine nucleotide follows a cytosine nucleotide in the linear sequence of bases along its 5'  3' direction. CpG sites occur with high frequency in genomic regions called CpG islands. CpG di-nucleotides occur with a much lower frequency in the sequence of vertebrate genomes than would be expected due to random chance. This underrepresentation is a consequence of the high mutation rate of methylated CpG sites. The spontaneously occurring deamination of methylated cytosine results in thymine and the resulting G: T mismatched bases are often improperly resolved to A: T; whereas the deamination of cytosine results in uracil, which, as a foreign base, is quickly replaced by a cytosine (base excision repair mechanism). The transition rate at methylated CpG sites is tenfold higher than at unmethylated sites. Thus, we consider the over-representation of CpA and TpG as a consequence of the underrepresentation of CpG. CpG has also been observed to be predominantly under-represented in RNA viruses [12,13] and the mechanism that contributes to the deficiency in the case of riboviruses (RNA nucleic acid) is largely unknown. Because riboviruses do not form DNA intermediates during genome replication, the methylation-deamination model is unlikely to apply, whereas the host innate immunity model evasion seems to be more appropriate. The CpG odds ratio values of mammalinfecting riboviruses are lower than the riboviruses infecting other taxa and the CpG motif in an AUrich oligonucleotide can significantly stimulate the immune response of plasmacytoid dendritic cells [14]. Previous research also pointed out the huge variations of CpG bias in RNA viruses and brought out the observed under-representation of CpG in RNA viruses as not caused by the biased CpG usage in the non-coding regions but determined mainly by the coding regions [15].
This study aimed to understand whether the CpG odds ratio of the Andes Hantavirus is peculiar in some way. From a cluster perspective, the study also intended to verify whether the Andes Hantavirus constitutes an isolated-cluster. The confirmation that this di-nucleotide odds ratio is so distinctive, in that it is actually able to discriminate between groups of viruses in the same family, might well help define the role of CpG islands in orthohantaviruses. Both the recurrent manifestation of the acute pulmonary syndrome in America and the urgency to understand why the Andes virus is the unique anthroponotic orthohantaviridae make research necessary.

Materials and Methods
The genomic data needed for this study were downloaded from the ViPR [16] database. Tables 6-8 in the Appendix section give the complete list of the RNA sequences we treated. In detail, we collected 27 RNA sequences of the large genome, 39 sequences of RNA from the medium-sized genome, and 170 small genomic RNA sequences, for a total of 236 genomic segments from the Hantaviridae family. We used R version 3.6.2 and Bio Python version 1.71 to respectively perform the statistical analysis and calculate the CpG odds ratio. Figure 16 in the Appendix section shows the steps taken to obtain the CpG odds ratio for all the segmented genomic sequences. Figures 17-18 in the Appendix section give the scripts used for the ANOVA analysis and the unsupervised clustering in R.

Statistical significance
Taking as null hypothesis ( 0 ) that the means values of the CpG odds ratio from the three groups (L, M, and S) are equal, we applied for the analysis of variance (ANOVA) to accept or reject 0 . To check the normality property, we based on the formality tests of Shapiro-Wilk with the α=0.05, while the QQ plot-chart supported our analysis as a graphical method.
Levene's Homogeneity Variance test was performed using both traditional mean-centered methodology and the R default median centered methodology. Dunn's Multiple Comparison Test [17,18] is a post hoc (e.g. after ANOVA) nonparametric test. The function used (Dunn. Test) performed multiple pairwise comparisons based on Dunn's z-teststatistic approximations to actual rank statistics. Several options were available to adjust p-values for multiple comparisons, including methods to control the family-wise error rate (FWER) and check the false discovery rate (FDR). We used the Bonferroni adjustment (FWER) to verify Dunn's test results and adjusted p-values = max (1, pm).

Statistical clues to identify the most significant group concerning CpG frequency
We searched for extra statistical clues to identify the more significant group for the CpG odds ratio. In our perspective, one group is more meaningful when presents a wider range of variation for the CpG odds ratio than the others.

Average and median variances for the di-nucleotide odd ratio
Let us introduce two measures that we will use in the coming section. Equation 1 defines the average value of the variances for the odds ratio. We calculated the value over all the di-nucleotides (n=16) in each group as:

Unsupervised clustering
To find the optimal number of clusters, we used the following four different approaches: the Elbow Curve Method, Silhouette Score Method, Gap Statistic Method, and Clustree Discovery [19][20][21]. We carried out the K-means, DBSCAN, and HCA algorithms to identify the groups of similar Hantaviruses. The CpG odds ratio, both from the full genome and from the CDS regions and their median values from the group of small genomic segments determined the degree of similarity.

Normality and homogeneity tests suggest the use of the non-parametric method
Our results in checking for the normality property suggested we perform an equivalent nonparametric test such as a Kruskal-Wallis Test [22], which does not require normality assumption. Table 1 reports the results for the normality test. It shows the P-value < 0.05 for the three groups, indicating that the data are not distributed normally. The QQ plots in Figure 1 give the distribution of the CpG odds ratio for all three groups. As an assumption, the vast majority of points should follow the theoretical normal reference line and fall within the curved 95% bootstrapped confidence bands to be considered normally distributed, but this is not the case here.

Figure 1 Normality QQ plots, 1 stay for group L, 2 for Medium and 3 for Small respectively
Results from Levene's Homogeneity Variance Test indicate that the null hypothesis must be rejected and conclude that variances are not equal for at least one of our groups. Table 2 displays the test statistics for two different versions of Levene's test. A p-value = 0.0003128 or 0.0001199 tends to accept the alternative hypothesis of inequality variances. The boxplot reported in Figure 2 also indicates some major outliers, giving enough evidence to use the Kruskal-Wallis ANOVA as a nonparametric test.

Kruskal-Wallis test suggests the null hypothesis be rejected
The Kruskal-Wallis test results in a two-sided test − < 2.2 −16 , indicating rejection of the null hypothesis, meaning ranks are equal across groups and conclude that there is a significant difference in CpG odds ratio distribution. Descriptive statistics indicate that the median value with 95% confidence intervals for group L is 0.277, group M is 0.187, and group S is 0.314. That is to say, the difference between the median values of each segment L and M is about 0.09 (p=1.137969e-04), segments L and S are about 0.037 (p=7.471942e-04), and segments M and S is about 0.127 (p=2.173163e-21).

Dunn test result: significant difference in the CpG odds ratio between the L, M, and S groups
Dunn's Multiple Comparison method tests stochastic dominance and gives the results among multiple pairwise comparisons after a Kruskal-Wallis test among k groups. The null hypothesis for each pairwise comparison is that the probability of observing a randomly selected value from the first group is larger than a randomly selected value from the second group equals one-half, and so rejecting 0 based on p≤α/2. Table 3 reports the result from Dunn's test providing all possible pairwise comparisons. In the table, the adjusted p-values will have an asterisk, so, we would reject the null hypotheses at the specified significance level, comparisons rejected with the Bonferroni adjustment at the α level (two-sided test). Figure 7 shows the test output between groups. If we consider the total distance from each group to the others, then it suggests that the difference between group n. 3 (Small segments) and the other groups is more significant.

Figure 3 Boxplots representation of the Dunn's test
The variance of the CpG di-nucleotide's odd ratio in the group of small genomic segments will be biologically relevant To our way of thinking one group might be more meaningful than another one when it presents a wider range of variation for the CpG odds ratio. Variance (σ 2 ) is a measurement of the spread between numbers in a data set. It measures how far each number in the set is from the mean and consequently from every other number in the collection. Figure 8 reports the variance of the odds ratio for every di-nucleotide in each group of genomic segments. Results show how the CpG dinucleotide tends to be more conservative compared to the other ones. Also, the group of small genomic segments, with a value close to 0.002, presents the lowest CpG odds ratio variation. The outcomes suggest that possible variation inside of this group should be biologically relevant.

Figure 4 Variance of the di-nucleotide frequency for the three genomic groups (L, M and S)
The group of small genomic segments is the most informative from the CpG odds ratio point of view Figure 5 compares the odds ratio variance of CpG di-nucleotide, the average and median variance for any di-nucleotide, grouped by genomic segments (L, M, and S). While the measurements do not represent meaningful differences for the large and medium genomic segments, the small genomic group depicts a more unusual situation. The value of the variance for the small genomic CpG is far from the median and average values of the di-nucleotides from the other groups. The observation becomes more evident from the diagram in Figure 6, indicating the distance values. For each group of genomic segments (L, M, and S), we estimated the following measurements: 1.

/ 2
, the average of the variance for all the di-nucleotides 2. ∆ , the distance of CpG odds ratio variance from / 2 3. ∆ , the distance of CpG odds ratio variance from / 2 The consideration of those measures for the three groups (L, M, and S) brings us to the following observations: 1. A higher value than the average value of the variances of the odds ratio over all the di-nucleotides compared to the other groups indicates that within the small group the odds ratio tends to change more frequently; 2. A higher value than the distance between the variance of the odds ratio for the CpG di-nucleotide and the average value of all the variances of all the frequencies for all the di-nucleotides compared to other groups shows that within the small group CpG hold the highest variability rate; 3. A higher value than the distance between the median of the odds ratio for the CpG di-nucleotide and the median value of all the variances of all the frequencies for all the dinucleotides compared to the other groups shows that within the small group even the CpG median has the most representative value. These results indicate that the group of small genomic segments is the most informative from the CpG odds ratio point of view.

The CpG Odds ratio inside CDS regions of short genomic segments is the lowest for the Hantaviridae family
Previous studies have already emphasized the CpG odds ratio as the lowest compared to those of the other di-nucleotides, even in the case of RNA viruses [15]. The calculation of the odds ratio for all the di-nucleotides around the CDS regions restricted our study to ten different RNA viruses from the Hantaviridae family: Andes, Tunari, Bayou, Choclo, Dobrava-Belgrade, Hantaan, Hantaanvirus, Puumala, Seoul, and Tula. This computation confirmed the CpG odds ratio in CDS as the lowest also for a group of small genomic segments, as given in Figure 7. It shows that the odds ratio for CpG in CDS regions is the lowest compared to the odds ratio of other di-nucleotides for the ten considered viruses.

The Andes Hantavirus and CpG frequency from CDS regions
Dealing with the frequency of CpG inside the coding regions, clearly show how the Andes Hantaviridae virus can be considered a specific case. We calculated the Pearson correlation coefficient from the data reported in Table 4 and got a rate near 0.98. Such a result supports the positive correlation between the CpG odds ratio over the full genome and the CpG odds ratio in the CDS. This result highlights the CpG di-nucleotides performing a function in the coding regions. The data collected in Table 4 shows how the CpG frequency in CDS for the Andes Hantaviridae represents the highest value, 7.58% greater than the second-highest value (Hantaviridae Dobrava-Belgrade). The odds ratio bars in Figure 8 prove that the Andes H. compared with the other Hantaviruses, has the highest CpG odds ratio in CDS regions.

Figure 8 Odds ratio of CpG into CDS regions
Examining the odds ratio of CpG along the full genome, the CpG odds ratio in the CDS regions, and the median value for Hantaviruses pointed out above, the Hantaviridae Andes has the highest rates for all three measurements. Table 5 and Figure 9 show that the CpG odds ratio value in CDS of Hantaviridae Andes is 7.58% greater than the same value from Hantaviridae Dobrava-Belgrade (the second virus sorted by CpG odds ratio in CDS values). And again, Hantaviridae Andes is 3.08% and 5.78% greater than Hantaviridae Puumala (the second virus for CpG in the full genome and CpG median values), compared to the CpG in the full genome and CpG median values. In conclusion, the Andes Hantaviridae has the highest value of CpG odds ratio in the CDS regions. The Pearson correlation close to 0.98 confirms that the CpG odds ratio along the full small genomic segment and the CpG odds ratio in the CDS regions of the same genomic segment are positively related. This clue stresses the possible roles carried out by the CpG islands in the coding regions. Lastly, by facing the CpG odds ratio from the full genome from the CDS regions as well as the median values, it draws attention to a stronger concentration of CpG islands both along the full small genomic segment and in the CDS regions for the Andes virus.

Optimal number of clusters for Hantaviruses
As mentioned in the Method section, we used the Elbow Curve, Silhouette Score, Gap Statistic, and Clustree Discovery to investigate the optimal number of clusters (k).
The Elbow Curve Method looks at the total within-cluster sum of squares (WSS) as a function of the number of clusters. We considered as a suitable k value indicator the location of a knee in the plot. The Elbow Method suggested k=4 as the optimal partitioning. The Silhouette Score Method measures the quality of clustering and determines how well each point lies within its cluster, and in our case, it suggested k=2. The optimal k is the one that maximizes the Gap Statistic. Approaching the problem by the GAP statistical method, only 1 cluster was suggested (which is a useless clustering). Figure 10 gives all three results.
All three approaches suggested a different number of clusters, so we used the discovery Clustree approach to consider how samples change groupings as the number of clusters increases. This approach is useful for showing which clusters are distinct and unstable as well as for exploring possible choices.
In Figure 11, the size of each node corresponds to the number of samples in each cluster. It also colors the arrows according to the number of samples each group receives. In this graph, passing from k=2 to k=3, several viruses are reassigned from the lookers-left cluster to the third cluster on the right. Moving from k=4 to k=5, two nodes present multiple incoming edges, indicating overclustering data. Results show enough reasons to set k=4 as the optimal number of clusters for our dataset.

K-means, DBSCAN and HCA vs Hantavirus
We performed the K-means, DBSCAN, and HCA algorithms to find the more similar groups of Hantaviruses according to the CpG odds ratio (both from the full genome and the CDS regions), and to their median values (from the group of small genomic segments). We focused attention on the Andes Hantavirus, being the sole Hantavirus able to pass from person to person. The K-means algorithm showed the Andes H. as an element of the 4th cluster with the Puumala H. but showing a relevant distance from it ( Figure 12). The DBSCAN algorithm showed four groups of viruses, even if it did not clearly show the distance between them ( Figure 13). The HCA agglomerative and divisive gave the same dendrogram, showing that the Andes H. is a "borderline" virus like the Tula H., even if belonging to two different clusters ( Figure 14). Representing the clustering obtained by the hierarchical methods, we again got evidence that the Andes H. looks like an isolated cluster (like the Tula H.), suggesting the critical contrast with the other viruses from the same Hantaviridae family (Figure 15)

Discussion and Conclusions
In the current study, we analyzed the Orthohantaviridae family from the CpG odds ratio point of view. The first result proved the statistical difference between the three groups of segmented genomes. We have identified the small genomic segments group as the more informative, giving us the chance to reduce the research area. To avoid the influence of the CpG odds ratio from the noncoding regions, we first calculated the CpG odds ratio in the coding regions for the small segments of all viruses. As a result, we confirmed that the CpG frequency is the lowest compared to the other di-nucleotides also that the Andes Hantavirus showed its highest CpG odds ratio in CDS. Secondly, we considered the correlation coefficient between the CpG odds ratio and the CpG odds ratio of the coding regions. We performed the calculation from the small genomic groups of all the viruses, bearing in mind that a positive correlation implies a more significant CpG odds ratio from the small genomic segment group. The correlation analysis between the CpG odds ratio from the full size of the segmented small genome and the CDS regions resulted in a positive index. This result emphasizes the possible function of the CpG islands inside the coding regions. Comparing the CpG over the full genome, the CpG over the CDS and the median values over the ten viruses suggested a stronger concentration of the CpG islands both along the full-size genome and the CDS regions in the Andes virus. Using both the CpG odds ratio measurements (based on the CDS and full genome size) from the group of small genomic segments as features, the unsupervised clustering analysis identified four different sub-groups inside the Orthohantaviridae family. The unsupervised clustering corroborated the evidence that the Andes Hantavirus (similar in some way to the Tula H.) exhibits a peculiar CpG odds ratio distribution, perhaps linked to its uniqueness in being able to pass from person to person. Previous research already pointed out the huge variations of CpG bias in RNA viruses and brought out the observed under-representation of CpG in RNA viruses as not caused by the biased CpG usage in the non-coding regions, but determined by the coding regions [12,13]. The current study suggests that the Andes H. characteristic of being transmitted from person to person could be linked to its distribution of CpG di-nucleotides. The research pointed out that in any case, the frequency of CpG islands in the Andes H. virus is such as to be identified as a cluster in its own right. In the case of Tula orthohantavirus (infections being rarely found in humans [23][24][25]) and even if there is no current evidence to suggest diversification of this virus from the rest of the family, it is questionable whether this similarity suggests a potential anthroponotic capacity in this virus. We can certainly assert that even in this case the distribution of CpG di-nucleotides suggests further research is needed. As a possible step forward in the research carried out, surely the use of further features related to the distribution of CpG di-nucleotides as a relationship index with the CpG distribution of the host or with the distribution of the CpG islands in the regions within the codons and between the codons could provide more detailed clustering results. The research carried out has already given many important results, such as the significant statistical difference between the distributions of CpG di-nucleotides in the different genomic segments (S, M, and L), the identification of numerical indices useful when applying unsupervised clustering algorithms as well as the identification of subgroups within the family of orthohantaviruses. These subgroups included the Andes and Tula H. as cases worthy of particular attention, especially in reference to the Andes H. whose peculiar anthroponicity is particularly dangerous for humans.