Rampant C→U Hypermutation in the Genomes of SARS-CoV-2 and Other Coronaviruses: Causes and Consequences for Their Short- and Long-Term Evolutionary Trajectories

The wealth of accurately curated sequence data for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), its long genome, and its low substitution rate provides a relatively blank canvas with which to investigate effects of mutational and editing processes imposed by the host cell. The finding that a large proportion of sequence change in SARS-CoV-2 in the initial months of the pandemic comprised C→U mutations in a host APOBEC-like context provides evidence for a potent host-driven antiviral editing mechanism against coronaviruses more often associated with antiretroviral defense. In evolutionary terms, the contribution of biased, convergent, and context-dependent mutations to sequence change in SARS-CoV-2 is substantial, and these processes are not incorporated by standard models used in molecular epidemiology investigations.

S evere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged late in 2019 in the Hubei province, China, as a cause of respiratory disease occasionally leading to acute respiratory distress syndrome and death   (1)(2)(3)(4). Since the first reports in December 2019, infections with SARS-CoV-2 were reported from a rapidly increasing number of countries worldwide and led to its declaration as a pandemic by the World Health Organization in March 2020. To understand the origins and transmission dynamics of SARS-CoV-2, sequencing of SARS-CoV-2 directly from samples of infected individuals worldwide has been performed on an unprecedented scale. These efforts have generated many thousands of high-quality consensus sequences spanning the length of the genome and have defined a series of geographically defined clusters that recapitulate the early routes of international spread. However, as commented elsewhere (5), there is remarkably little virus diversity at this early stage of the pandemic, and analyses of its evolutionary dynamics remain at an early stage.
The relative infrequency of substitutions is the consequence of a much lower error rate on genome copying by the viral RNA polymerase of the larger nidovirales, including coronaviruses. This is achieved through the development of a proofreading capability through mismatch detection and excision by a viral encoded exonuclease, Nsp14-ExoN (6)(7)(8). Consequently, coronaviruses show a low substitution rate over time, typically in the range of 1.5 ϫ 10 Ϫ4 to 10 ϫ 10 Ϫ4 substitutions per site per year (SSY) (9)(10)(11)(12)(13)(14). Applying a midrange estimate to the 3-to 5-month timescale of the SARS-CoV-2 pandemic indicates that epidemiologically unrelated strains might show around 6 to 10 nucleotide differences from each other over the 30,000-base length of their genomes.
In the present study, we have analyzed the nature of the sequence diversity generated within the SARS-CoV-2 virus populations revealed by current and ongoing virus sequencing studies. We obtained evidence for a preponderance of driven mutational events within the short evolutionary period following the zoonotic transmission of SARS-CoV-2 into humans. Sequence substitutions were characterized by a preponderance of cytidine-to-uridine (C¡U) transitions. The possibility that the initial diversity within a viral population was largely host induced would have major implications for evolutionary reconstruction of SARS-CoV-2 variants in the current pandemic as well as in our understanding both of host antiviral pathways against coronaviruses and of the longer-term shaping effects on their genome composition.

RESULTS
Sequence changes in SARS-CoV-2. Four separate data sets of full-length (near-) complete genome sequences of SARS-CoV-2 collected from the start of the pandemic to those most recently deposited on 24 April 2020 were aligned and analyzed (accession numbers listed in Table S1A in the supplemental material). Each data set showed minimal levels of sequence divergence, with mean pairwise distances ranging from 5.5 to 9.5 nucleotide differences between each sequence. However, several aspects of the frequencies and sequence contexts of the observed changes were unexpected. First, the ratio of nonsynonymous (amino acid changing) to synonymous substitutions (dN/dS) was high, in the range of 0.57 to 0.73 among the different SARS-CoV-2 data sets. This contrasts with a much lower ratio (consistently below Ͻ0.22) in sequence data sets assembled for the other human coronaviruses (Table 1). Including a range of coronaviruses in the analysis, there was a consistent association between dN/dS ratios and the degree of sequence divergence (Fig. 1).
We next estimated the frequencies of individual transitions and transversions occurring during the short-term evolution of SARS-CoV-2. Sequence differences between each SARS-CoV-2 full-genome sequence and a majority rule consensus sequence generated for each of the four SARS-CoV-2 data sets were calculated. The directionality of sequence change underlying the observed substitutions was inferred by restricting the analysis to polymorphic sites with a minimal number of variable bases (typically singletons). In practice, because of the scarcity of substitutions, variability thresholds of 10%, 5%, 2%, and 1% yielded similar numbers and relative frequencies of each transition and transversion. Equivalent evidence for directionality was obtained through comparison of each sequence in the data set with the first outbreak sequence (MN908947; Wuhan-Hu-1), approximately ancestral to the currently circulating SARS-CoV-2 strains (data not shown). For the purposes of the analysis presented here, a consensus-based 5% threshold was used.
A listing of the sequence changes revealed a striking (approximately 4-fold) excess of sites where C¡U substitutions occurred in SARS-CoV-2 sequences compared to the other three transitions ( Fig. 2A). This excess was the more remarkable given there was an almost 2-fold greater number of U bases in the SARS-CoV-2 genome than Cs (32.1% compared to 18.4%, respectively). To formally analyze the excess of C¡U transitions, we calculated an index of asymmetry (frequency [f][C¡U]/f[U¡C]) ϫ (fU/fC) and compared this with degrees of sequence divergence and dN/dS ratios in SARS-CoV-2 and other coronavirus data sets ( Fig. 2B and C). This comparison showed that the excess of C¡U substitutions was most marked among very recently diverged sequences associated with the SARS-CoV-2 and SARS-CoV outbreaks and was reduced significantly in sequence data sets of the more divergent human coronaviruses (NL63, OC43, 229E, and OC43) as sequences accumulated substitutions. A parallel analysis of the full-genome sequences of Ebola virus (EBOV) was performed to determine whether the compositional abnormalities observed in SARS-CoV-2 arose as artifacts of the next-generation sequencing (NGS) methods used to generate the data or indeed occurred in a different RNA virus with distinct entry, replication, and packaging strategies. Available sequences of EBOV on GenBank were divided into three groups, corresponding to those associated with the most recent outbreak in the Congo, in Sierra Leone, and elsewhere in West Africa in the 2014 outbreak, and finally a collection of older strains (see Table S1B). These showed mean levels of within-group sequence divergence of 0.1%, 1.1%, and 0.5%, respectively, spanning the range of divergences in the analyzed SARS-CoV-2 and other coronavirus data sets. In marked contrast to that of SARS-CoV-2, sequences consistently showed dN/dS ratios of Ͻ0.1 ( Fig. 1) and no mutational asymmetry of C¡U/U¡C (Fig. 2B), irrespective of their sequence divergence.
C¡U substitutions were scattered throughout the SARS-CoV-2 genome (Fig. 3). Long bars representing more polymorphic sites were frequently shared between replicate data sets, but unique substitutions (occurring once in the data set [short bars]) showed largely separate distributions. Substitutions were not focused toward any particular gene or intergenic region, although all three data sets showed marginally higher frequencies of substitutions in the N gene. A selection of sequences showing C¡U changes in different genome regions was plotted in a phylogenetic tree containing sequences from the SARS-CoV-2 data set ( Fig. 4). With the resolution possible in the tree generated from such a minimally divergent data set, many sequences with shared C¡U changes were not monophyletic (e.g., those with substitutions at positions 5784, 10319, 21575, 28657, and 28887). This lack of grouping is consistent with multiple de novo occurrences of the same mutation in different SARS-CoV-2 lineages.
The abnormally high dN/dS ratios of 0.6 to 0.7 in SARS-CoV-2 sequences (Table 1; Fig. 1) indicated that around 50% of nucleotide substitutions would produce amino acid changes (if approximately 75% of nucleotide changes are nonsynonymous). On analysis of amino acid sequence changes, a remarkable 52% of nonsynonymous transitions in the SARS-CoV-2 sequence data set were the consequence of C¡U transitions (Fig. 5), compared to 26%, 10%, and 7% for G¡A, U¡C, and A¡C transitions, respectively. These ratios are comparable to those at all sites ( Fig. 1) apart from the greater proportion of nonsynonymous G¡A changes. Some variability might be expected given the potential fitness effects of specific amino acid changes and their likelihood of fixation. Notwithstanding this, the underlying mechanism that leads to C¡U hypermutation therefore also drives much of the amino acid sequence diversity observed in SARS-CoV-2.
The context of cytidines within a sequence strongly influenced the likelihood of it mutating to a U (Fig. 6). The greatest numbers of mutations were observed if the upstream (5=) base was an A or U. There was also a similar approximately 4-fold increase in transitions if these bases were located on the downstream (3=) side. The effects of the 5= and 3= contexts were additive: C residues surrounded by an A or U at both 5= and 3= sides were 10-fold more likely to mutate than those flanked by C or G residues (mean of 31.9 transitions compared to 3.6). Splitting the data down into the 16 combinations of 5= and 3= contexts, a 5= U far more potently restricted non-C¡U substitutions than a 5= A (see Fig. S1), while 5= G or 5= C almost eliminated substitutions irrespective of the 3= context. No context created any substantial asymmetry in G¡A compared to A¡G transitions. The GϩC content of coronaviruses varied substantially between species, with highest frequencies in the recently emerged zoonotic coronaviruses (Middle East respiratory syndrome [MERS]-CoV, 41%; SARS-CoV, 41%; and SARS-CoV-2, 38%) and lowest in HKU1 (32%). Collectively, there was a significant relationship between C depletion and U enrichment with GϩC content (Fig. 7). The difference in GϩC content was indeed almost entirely attributable to changes in the frequencies of C and U bases: the 9% difference in GϩC content between MERS-CoV and HKU1 arose primarily from the 20% to 13% reduction in frequencies of C. There was a comparable 8% increase in the frequency of U. Their combined effects left frequencies of G and A relatively unchanged. It has been proposed that the asymmetry in C and U frequencies may originate in part through the selective loss of CpG dinucleotides in the genome (15). To investigate this, the degree of suppression in SARS-CoV-2, other sarbecoviruses, and other coronaviruses was compared with representative sequences of each currently classified mammalian RNA virus species (excluding double-stranded RNA [dsRNA] viruses). Mammalian RNA viruses (Fig. 8, gray circles) demonstrate the previously described relationship between GϩC content and CpG suppression (16). The data points for the separately labeled SARS-CoV-2, other SARS-like viruses in bats (sarbecoviruses; red), and the remainder of the coronaviruses (blue) and arteriviruses (green) overlap these values (Fig. 8). Overall, SARS-CoV-2 and other coronaviruses actually show less suppression of CpG for a given GϩC content than is typical for other RNA viruses. SARS-CoV-2 and other coronaviruses are therefore not compositionally unusual by these metrics, providing no evidence that CpG suppression per se is associated with their mutational and compositional abnormalities.

DISCUSSION
The wealth of sequence data generated since the outset of the SARS-CoV-2 pandemic, the accuracy of the sequences obtained by a range of NGS technologies, and the long genomes and very low substitution rate of coronaviruses provided a unique opportunity to investigate sequence diversification at very high resolution. The findings additionally provide insights into the mutational mechanisms and contexts where sequence changes occur. Third, it informs us about the longer-term evolution of viruses and potential effects of the cell in molding virus composition.
The mechanism of SARS-CoV-2 hypermutation. The most striking finding that emerged from the analysis of more than 1,000 SARS-CoV-2 genomes was the preponderance of C¡U transitions compared to other substitutions in the initial 4 to 5 months of its evolution. These accounted for 38% to 42% of all changes in the four SARS-CoV-2 data sets. In seeking alternative nonbiological explanations for this observation, they cannot have arisen through misincorporation errors in the nex-generation sequence methods used to produce the data set because the analysis in the present study was restricted to consensus sequences. These are generally assembled from libraries that typically possess reasonable coverage and read depth; error frequencies of Ͻ10 Ϫ4 per site (17) would therefore improbably create a consensus change in a sequence library. There was, furthermore, no comparable increase in G¡A mutations (Fig. 2), and the sequence context in which sequencing errors occur (a 5= or 3= C or G [17]) did not match the favored context for mutation observed in our data set (Fig. 6). Finally, the parallel analysis of a similarly large collection of consensus sequences of EBOV demonstrated very low dN/dS values and no excess of C¡U mutations over U¡C ( Fig. 1 and  2). As EBOV and SARS-CoV-2 data were generated by comparable NGS methods, it is therefore extremely unlikely that the mutational and compositional abnormalities of SARS-CoV-2 described in the study were the results of methodological artifacts.
This asymmetric mutation, furthermore, cannot have arisen through a mutational effect of the coronavirus RNA-dependent RNA polymerase during virus replication. By definition, a coronavirus RNA genome descends from any other through an equal number of copies of the positive and negative strands-any tendency to misincorpo- rate a U instead of a C would be reflected in a parallel number of G¡A mutations where this error occurred on the minus strand (or vice versa). As demonstrated, however, G¡A mutations occurred at a much lower frequency than C¡U mutations and similarly to A¡G ( Fig. 2A and 6).
The most cogent explanation for C¡U hypermutation is the action of RNA editing processes within the infected cell. A well-characterized antiviral pathway involves the interferon-inducible isoform of adenosine deaminase acting on RNA type 1 (ADAR1) (18). This edits A to inosine in regions of viral double-stranded RNA, which is subsequently copied as a G. Irrespective of its widely demonstrated antiviral role in a range of typically minus-stranded RNA viruses, the mutations it creates do not match those observed in SARS-CoV-2 or other coronaviruses. First, ADAR1 targets dsRNA, and so editing effects tend to be symmetric with A¡G substitutions being matched by U¡C mutations. Second, the direction of mutation is wrong. The focus of the analysis in the present study was on infrequent or unique polymorphisms where ancestral and mutant bases can be inferred. The excess of C¡U transitions is the opposite of those induced by ADAR1.
A second interferon-inducible pathway edits retroviral DNA during transcription and is strand specific; its typical antiviral activity is to mutate single-stranded proviral DNA formed after first-strand synthesis from genomic RNA (19)(20)(21). The deamination of Cs to Ts leads to the observed excess of G¡A changes in the complementary positivestranded RNA virus genome (22). This editing function is performed by members of the apolipoprotein B mRNA-editing enzyme, catalytic polypeptide-like (APOBEC) family, many of which possess defined antiviral functions against retroviruses, hepatitis B viruses, small DNA viruses, and intracellular mobile retroelements (reviewed in reference 23). The APOBEC3 gene family members that are primarily involved in antiviral defense show evidence of extensive positive selection and expansion over the course of mammalian evolution, particularly in the primate lineage. Humans possess 7 active antiviral proteins (A3A, A3B, A3C, A3D, A3F, A3G, and A3H) that contrast with the single A1 gene in rodents and a range of other mammals (24)(25)(26). Other mammals possess a diverse range of largely independently duplicated APOBEC3 genes, with four paralogs in cats, three in cows and sheep, six in horses, and often more than 10 in different bat species. However, their comparative activities and editing capabilities for different DNA and RNA substrates remain functionally largely unexplored.
While deamination of cytidines in single-stranded DNA sequences is a hallmark of APOBEC function, APOBECs show binding affinities for single-stranded RNA templates that may mediate antiviral functions. A3B and A3F have been shown to block retrotransposition of a LINE-1 transposon mRNA through a nondeamination pathway (27), potentially through binding to single-stranded RNA. Direct editing of HIV-1 RNA by the The associations between C depletion and U enrichment with GϩC content were both significant by linear regression at P ϭ 5 ϫ 10 Ϫ7 and P ϭ 5 ϫ 10 Ϫ6 , respectively. No significant associations were observed between GϩC content and G (P ϭ 0.05) or A (P ϭ 0.62) frequencies. Arrows are color coded as for Fig. 1. rat A1 APOBEC and the accumulation of C¡U hypermutation verified that RNA could also be used as a substrate for deamination (28). This suggested to the authors at that time that APOBEC-mediated RNA editing was a potential antiviral activity mechanism against RNA viruses as well as retroviruses.
Since then, evidence supporting this conjecture has been difficult to obtain; the virus inhibitory effect of APOBECs against enterovirus A71, measles, mumps, and respiratory syncytial viruses were not shown to be associated with the development of virus mutations (29,30). Similarly, A3C, A3F, or A3H, but not A3A, A3D, and A3G, were shown to inhibit the replication of the human coronavirus, HCoV-NL63, but their expression did not lead to de novo C¡U (or G¡A) mutations on virus passaging (31). On the other hand, it has been demonstrated that A3A and A3G possess potent RNA editing capability on mRNA expressed in hypoxic macrophages (32), natural killer cells (33), and transfected A3G-overexpressing HEK 293T cells (34). These latter findings verify that APOBECs possess RNA editing capabilities but do not provide any mechanistic context for the potential inhibition of RNA virus replication by this mechanism. Nevertheless, the pronounced asymmetry in C¡U transitions in SARS-CoV-2 and the preferential substitution of Cs flanked by U and A bases on both 5= and 3= sides (Fig. 6) that broadly matches what is known about the favored contexts of A3A, A3F, and A3H (35) provides strong circumstantial grounds for suspecting a role of one or more APOBEC proteins in coronavirus mutagenesis.
The findings of C¡U mutations at the consensus genome sequence level have also been observed within virus populations in a recent analysis of intrahost sequence diversity in lung-derived COVID-19 samples (36). Mutations showed the 5= and 3= A/U contexts as observed in the present study and were proposed by the authors as representing editing sites for APOBEC1. Intrahost diversity in this study was, however, dominated by minor populations generated from G¡A and U¡C substitutions; their symmetry and lack of 5= or 3= context led the authors to propose the editing effects of ADAR in viral dsRNA. These and other mutations such as A¡U and U¡A transversions mediated through an as-yet-uncharacterized mechanism hint at the complexity of host effects on virus sequence change. The combination of exceptionally long genomes  Table S1. CpG frequencies were expressed as the ratio of their observed frequency to the expected frequency based on their GϩC content (y axis).
(Ϸ30,000 bases), an otherwise low mutation rate, and the unprecedented size of the present data set of accurate minimally divergent SARS-CoV-2 sequences assembled postpandemic has enabled these mutational signatures to be so clearly observed. RNA editing may indeed represent a powerful antiviral mechanism with potentially lethal effects of even single mutations introduced into the genomic sequence. These make APOBEC-mediated anticoronaviral activity plausible in virological terms.
The key findings in the study were the combined evidence for an APOBEC-like editing process driving initial sequences changes in SARS-CoV-2 and that the observed substitutions have not arisen through a typical pattern on random mutation and fixation that is assumed in evolutionary models. A specific problem for evolutionary reconstructions would be the existence of highly uneven substitution rates at different sites; APOBEC-mediated editing (and indeed the pattern of C¡U transition in SARS-CoV-2 sequences) is strongly dependent on sequence context and, for at least two APOBECs, additionally influenced by their proximity to RNA secondary structure elements in the target sequence (32,35). Sequence changes in SARS-CoV-2 and other coronavirus genomes may therefore be partially or largely restricted to a number of mutational hot spots that may promote convergent changes between otherwise genetically unlinked strains. As demonstrated in Fig. 4, these can conflict with relationships reconstructed from phylogenetically informative sites. Furthermore, the substitution rate reconstructed for SARS-CoV-2 and potentially other coronaviruses may represent an uncomfortable amalgam of both the accumulation of neutral changes and forced changes induced by APOBEC-like editing processes that may obscure temporal reconstructions. A recent analysis of SARS-CoV-2 genomes illustrates these problems (5); only a tiny fraction of variable sites (0.34%) were found to phylogenetically informative, while a high frequency of unresolved quartets demonstrates further the lack of phylogenetic signal in SARS-CoV-2 evolution reconstructions. The occurrence of multiple driven changes under host-induced selection is consistent with these cautionary observations.
The other important consequence of C¡U hypermutation is that most of the amino acid sequence diversity observed in SARS-CoV-2 strains originates directly from forced mutations and therefore cannot be regarded in any way as adaptive for the virus (Fig. 5). An RNA editing mechanism of the type discussed above evidently places a huge mutational load on SARS-CoV-2 that may underpin the abnormally high dN/dS ratios recorded in SARS-CoV-2 and SARS-CoV sequence data sets (Fig. 1). It is likely that many or most amino acid changes are mildly deleterious and transient; repeated rounds of mutation at favored editing sites followed by reversion may therefore contribute to the large numbers of scattered substitutions in SARS-CoV-2 sequences that conflict with their phylogeny.
Finally, it is intriguing to speculate on the long-term effects of the C¡U/U¡C asymmetry and the extent to which this may contribute to the previously described compositional abnormalities of coronaviruses (15, 37). As described above in connection with mutation frequencies, the compositional asymmetries cannot directly arise through viral RdRp mutational biases, because any resulting base frequency differences would be symmetric (i.e., G Ϸ C and A Ϸ U). Instead, it appears that the observed imbalances in frequencies of complementary bases reflect the progressive depletion of C residues and accumulation of Us by the APOBEC-like mutational process on the genomic (ϩ) strand of coronaviruses. Culminating in the compositionally highly abnormal HKU1 sequences (15), this appears to have driven the GϩC content of coronaviruses as low as 32% while remarkably leaving G and A frequencies more or less unaltered (Fig. 7). Intriguingly, the bat-derived coronaviruses along with the recently zoonotically transferred viruses into humans show the least degree of compositional asymmetry.
The expansions in APOBEC gene numbers, extensive positive selection, and the consequent variability in APOBEC nucleic acid targeting (23) may indeed create distinct selection pressures on coronaviruses in different hosts. The immediate appearance of C¡U hypermutation in SARS-CoV-2 and SARS-CoV genomes in humans may therefore represent the initial effects of replication in a more hostile internal cellular environment than that found in what might be a better coadapted virus-tolerized immune system of a bat (38). Zoonotic origins are suspected for other human coronaviruses but at more remote times (39); perhaps they have taken their mutational and adaptive journeys already.

MATERIALS AND METHODS
SARS-CoV-2 and other coronavirus data sets. The 1,000 closest matched sequences to the prototype strain of SARS-CoV-2, NC_045512, were downloaded on 24 April 2020. Sequences with large internal gaps, ambiguous bases, and other markers of poor sequence quality were excluded, leaving a total of 865 sequences for analysis. These were divided into three data samples, corresponding to sequences 1 to 300, 301 to 600, and 601 to 865 (sequences listed in Table S1A in the supplemental material). An additional data set of 117 well-curated SARS-CoV-2 sequences was downloaded from Konsiliarlabor für Coronaviren (https://civnb.info/sequences/) on 13 April 2020 and represents a further independent sample set. A listing of further data sets of SARS-CoV, MERS-CoV, and other human coronaviruses is provided in Table 1. All available complete genome sequences of EBOV were downloaded from GenBank on 3 May 2020, of which 1,193 were used for mutational analysis after removal of incomplete, poor quality, and synthetic sequences (Table S1B).
Sequence analysis. Calculation of pairwise distances, nucleotide composition, and listing of sequence changes were performed using the SSE package version 1.4 (http://www.virus-evolution.org/ Downloads/Software/) (40).

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. FIG S1, DOCX file, 0.3 MB.

ACKNOWLEDGMENT
The work was supported by a Wellcome Investigator Award Grant (WT103767MA).