Excessive G–U transversions in novel allele variants in SARS-CoV-2 genomes

Background SARS-CoV-2 is a novel coronavirus that causes COVID-19 infection, with a closest known relative found in bats. For this virus, hundreds of genomes have been sequenced. This data provides insights into SARS-CoV-2 adaptations, determinants of pathogenicity and mutation patterns. A comparison between patterns of mutations that occurred before and after SARS-CoV-2 jumped to human hosts may reveal important evolutionary consequences of zoonotic transmission. Methods We used publically available complete genomes of SARS-CoV-2 to calculate relative frequencies of single nucleotide variations. These frequencies were compared with relative substitutions frequencies between SARS-CoV-2 and related animal coronaviruses. A similar analysis was performed for human coronaviruses SARS-CoV and HKU1. Results We found a 9-fold excess of G–U transversions among SARS-CoV-2 mutations over relative substitution frequencies between SARS-CoV-2 and a close relative coronavirus from bats (RaTG13). This suggests that mutation patterns of SARS-CoV-2 have changed after transmission to humans. The excess of G–U transversions was much smaller in a similar analysis for SARS-CoV and non-existent for HKU1. Remarkably, we did not find a similar excess of complementary C–A mutations in SARS-CoV-2. We discuss possible explanations for these observations.

INTRODUCTION SARS-CoV-2 is a novel coronavirus that causes an infectious respiratory disease called COVID-19 (Wu et al., 2020). SARS-CoV-2 is closely related to the bat coronavirus RaTG13 with around 96% whole genome nucleotide sequence identity . At the complete genome scale it also shares 93.3% identity with the bat-derived coronavirus RmYN02, with 97.2% identity in the 1ab gene . Since the discovery of SARS-CoV-2, its evolution became of particular interest to biologists because of the amount of sequencing data that was produced and the importance of this data to provide insights on the determinants of viral pathogenicity (Gussow et al., 2020) and adaptations to human hosts (Van Dorp et al., 2020). SARS-CoV-2 is also quite interesting from a genomics prospective, with an extreme deficiency of genomic CpG dinucleotides of debatable origin (Xia, 2020).
Comparative analysis of genomic data favors the natural origin of SARS-CoV-2, accompanied by natural selection either before or after zoonotic transfer directly from bats or through an intermediate host (Andersen et al., 2020). In either scenario, SARS-CoV-2 would be exposed to both a novel evolutionary landscape that affects the fitness of its genetic variants and novel cellular conditions that could affect its mutation rates directly. For example, it is known, that bats have evolved a number of adaptations including superior resistance to oxidative stress  that allow them to harbor multiple viruses without getting the corresponding diseases (Schountz et al., 2017). In light of this, we decided to investigate if the relative mutation frequencies in SARS-CoV-2 changed after its transmission to human hosts.
We compared the relative frequencies of single nucleotide variations (which we will refer to as mutations) in SARS-CoV-2 with the relative frequencies of substitutions that it acquired since the divergence with its last common ancestor with a closely related coronavirus from bats RaTG13. In our terminology, substitutions occurred before zoonotic transmission, while mutations were acquired after. A similar analysis was performed for SARS-CoV and HKU1 coronaviruses.
We found that SARS-CoV-2 has an over 9-fold excess of G-U mutations over substitutions. This effect was much weaker in a similar analysis for SARS-CoV and was not present for HKU1. On the other hand, the substitution profile of SARS-CoV-2 turned out to be quite similar to that of the other coronaviruses, lending further support to existing scenarios of its natural origin (Andersen et al., 2020) and suggesting that the changes in SARS-CoV-2 mutation frequencies have accompanied its transition to human hosts.

Mutation data
We obtained 1,271 SARS-CoV-2, 194 SARS-CoV and 38 HKU1 publicly available complete genomes from the NCBI NR database. To ensure similarity in data acquisition, we did this by using the BLASTn (Altschul et al., 1990) program with the three reference human coronavirus genomes as queries (NCBI Reference Sequences: NC_045512.2, NC_004718.3 and NC_006577.2). SARS-CoV-2 hits were filtered using the following strings: "Severe acute respiratory syndrome-related coronavirus" and "complete genome". SARS-CoV hits were filtered using the strings: "SARS coronavirus" and "complete genome". HKU1 genomes were filtered using strings "HKU1" and "complete genome". The final list of obtained accessions is available in Table S1.
We used Clustal Omega (Sievers et al., 2011) to create three multiple alignments: one for SARS-CoV-2, one for SARS-CoV and one for HKU1 genomes. In each alignment, we established the consensus sequence. The most frequent nucleotide in each position was used as the consensus nucleotide. If any individual sequence contained a nucleotide N2 that is different from the consensus N1, we considered that a N1->N2 mutation has occurred in that position. We assume that these mutations have occurred in viruses after zoonotic transfer. Note that if several sequences contained the same nucleotide N2 that is different from the consensus N1, this would still count as only one mutation. The three final alignments are available in fasta format in Supplemental Materials.
As additional controls, we performed the following (and only the following) subanalysis. For SARS-CoV-2: 1. SARS-CoV-2 genomes that were sequenced with Ion Torrent 2. SARS-CoV-2 genomes that were sequenced with Oxford Nanopore 3. SARS-CoV-2 genomes that were sequenced in USA 4. SARS-CoV-2 genomes that were sequenced in China 5. Exclude mutations that occurred in less than two sequences 6. Substitute the consensus sequences with the reference sequence 7. Mask the first and last 100 nucleotides of the alignment 8. Remove genomes with "N" sequences

Substitution data
For each of the three human coronaviruses we obtained the genome of a closely related animal coronavirus and a more distant (outgroup) coronavirus from NCBI NR. For SARS-CoV-2 we used RaTG13 (GenBank: MN996532.1) as the close relative and pangolin coronavirus PCoV_GX-P5L (GenBank: MT040335.1) as an outgroup. For SARS-CoV we used Bat SARS-like coronavirus isolate Rs4231 (GenBank: KY417146.1) and Bat coronavirus BtCoV/273/2005 (GenBank: DQ648856.1). For HKU1 we used Betacoronavirus sp. strain VZ_BetaCoV_16715_52 (GenBank: MH687968.1) and Camel coronavirus HKU23 Ry123 (GenBank: KT368891.1). Multiple alignments were performed using Clustal Omega with default parameters (Sievers et al., 2011). See Fig. S1 for a simple Neighbor-joining tree for all nine different complete coronavirus genomes created by Clustal Omega on the basis of the alignment (default parameters). See Jaimes et al. (2020) for a detailed phylogenetic analysis of SARS-CoV-2 and related coronaviruses.
Substitutions that occurred during human coronavirus evolution were identified by maximum parsimony. If a human coronavirus sequence contained a nucleotide N1 and the two animal coronaviruses contained a different nucleotide N2, we considered that an N2->N1 substitution has occurred (Method 1). We assume that these substitutions have occurred in viruses before zoonotic transfer. We did not count positions in which all three coronaviral genomes differed.
We also used a different measure of substitutions based on single nucleotide differences between each reference human coronavirus sequence and RaTG13, Rs4231 and Betacoronavirus sp. strain VZ_BetaCoV_16715_52 for SARS-CoV-2, SARS-CoV and HKU1 respectively (Method 2). This method does not allow us to establish a direction for substitutions, but provides a larger sample size. We used the same alignment as in Method 1.

RESULTS AND DISCUSSION
We identified 1,251, 1,128 and 2,039 single nucleotide variants in available SARS-CoV-2, SARS-CoV and HKU1 genomes. We assumed that these are mutations that occurred after zoonotic transfer to human hosts. Mutations are deviations of individual genomes from the consensus sequences, but it should be noted that our SARS-CoV-2 consensus had only five nucleotide differences from the reference genome.
In the corresponding genomes we also identified 450, 499 and 3,029 (Method 1, with parsimony-reconstructed ancestral states) and 1,141, 1,237 and 6,514 (Method 2) single nucleotide substitutions. We assume that these substitutions occurred before zoonotic transfer. Substitutions are deviations in the reference genome from the predicted ancestral states (Method 1) or deviations in the reference genome from a closely related animal coronavirus genome (Method 2). Figure 1 shows the proportion of each mutation/substitution type (substitutions based on Method 1) among all mutations/substitutions. The relative substitution frequencies between SARS-CoV-2 and RaTG13 appear to be similar to those between other human coronaviruses and their relatives, consistent with the natural emergence of the SARS-CoV-2 virus (Andersen et al., 2020).
Tables 1 and 2 give a more detailed view of the number of mutations and substitutions in SARS-CoV-2 (Methods 1 and 2 respectively). P-values are based on a two-tailed Fishers exact test.
Most SARS-CoV-2 genomes are currently sequenced using Illumina platforms. To rule out possible sequencing artifacts we performed several subanalysis. First, we searched for mutations in SARS-CoV-2 genomes that were sequenced using Ion Torrent (30 genomes) or Oxford Nanopore (192 genomes). Although the mutation sample size was small, the fraction of G-U mutations was similar to the rest of the data: 13/79 or 16.4% (Ion Torrent) and 30/199 or 15.1% (Oxford Nanopore) G-U mutations. Rayko & Komissarov (2020) have reported a lower transition/transversion ratio in singleton SARS-CoV-2 genomic variations (that can only be seen in one genome submission). As a separate control, we looked at mutations that were identified in two or more independent genome assemblies. This yielded a similar high proportion of G-U mutations (49/360 or 13.6%). Masking of the first and last 100 nucleotides of the alignments or removing all sequences with at least one "N" letter resulted in 16% (188/1,173) or 14.87% (132/888) G-U mutations correspondingly. De Maio et al. (2020) reported other sequencing issues of SARS-CoV-2 genomes, such as particular highly-mutable sites that might be recurring artifacts. However, the number of such reported sites is too low to affect our results as we counted several single nucleotide variations of the same type in one site as the result of one mutation.
We wanted to see if the effect is robust to other subsamples, such as SARS-CoV-2 genomes sequenced in USA or China. For USA genomes, we obtained 15.1% G-U mutations (159/1,053). For Chinese genomes, the mutation sample size was very low, however the effect was similar (20.5% or 17 out of 83 mutations are G-U).
Among the 193 G-U mutations in SARS-CoV-2, 21 are outside of coding regions, 21 are synonymous and 151 are nonsynonymous. We found no nonsense G-U mutations. Coordinates of all G-U mutations are available in Table S2.
There are two remarkable observations regarding the excess of G-U transversions in SARS-CoV-2. One is that it probably reflects a change in SARS-CoV-2 mutation rates after zoonotic transfer to humans, since the proportion of G-U substitutions measured between the SARS-CoV-2 and the bat coronavirus RaTG13 is unremarkable.
The second remarkable feature is that this excess of mutations is asymmetric: there is no similar effect for C-A mutations. SARS-CoV-2 is a positive (+) RNA strand virus. The copying of positive and negative strands of coronavirus RNA is executed by the same enzymes (Sola et al., 2015). If RNA copying was prone to G-U errors when creating the positive strand, the same mechanism would be expected to introduce G-U errors when copying the negative strand, resulting in additional C-A errors on the positive strand. Note that in SARS-CoV-2 the G (19.6%) and C (18.4%) content are similar, as are A (29.9%) and U (32.1%) content. Recently it was suggested that SARS-CoV-2 mutation rates could be affected by variations in its RNA-dependent RNA polymerase (Pachetti et al., 2020), but it is unclear how this could explain the asymmetric increase of G-U transversions.
The most known example of mutation bias is the excess of C-T mutations in the CpG context in the genomes of many animals (Cooper & Krawczak, 1989), although other important mutation contexts exist (Panchin et al., 2011). Usually in such cases, the excess of complementary mutations (such as G-A in the case of CpG context) is also present and is of the same magnitude.