Mutation density changes in SARS-CoV-2 are related to the pandemic stage but to a lesser extent in the dominant strain with mutations in spike and RdRp

Since its emergence in Wuhan, China in late 2019, the origin and evolution of SARS-CoV-2 have been among the most debated issues related to COVID-19. Throughout its spread around the world, the viral genome continued acquiring new mutations and some of them became widespread. Among them, 14408 C>T and 23403 A>G mutations in RdRp and S, respectively, became dominant in Europe and the US, which led to debates regarding their effects on the mutability and transmissibility of the virus. In this study, we aimed to investigate possible differences between time-dependent variation of mutation densities (MDe) of viral strains that carry these two mutations and those that do not. Our analyses at the genome and gene level led to two important findings: First, time-dependent changes in the average MDe of circulating SARS-CoV-2 genomes showed different characteristics before and after the beginning of April, when daily new case numbers started levelling off. Second, this pattern was much delayed or even non-existent for the “mutant” (MT) strain that harbored both 14408 C>T and 23403 A>G mutations. Although these differences were not limited to a few hotspots, it is intriguing that the MDe increase is most evident in two critical genes, S and Orf1ab, which are also the genes that harbor the defining mutations of the MT genotype. The nature of these unexpected relationships warrants further research.


Introduction
COVID-19 (coronavirus disease 2019) is an ongoing pandemic that has been observed in 7,553,182 patients and responsible for 423,349 deaths as of 13 June 2020. It is characterized by respiratory system problems and slow onset fever, and is caused by SARS-CoV-2, a novel betacoronavirus of presumably zoonotic origin. It has first been identified in the Hubei province of China in December 2019, with confirmed human transmission in January 20201,2, and is currently a global concern. COVID-19 has high transmissibility3,4, a capacity of asymptomatic cases to spread the disease5, and poses a high degree of danger to both vulnerable individuals and the healthcare systems via such widespread and invisible transmission6,7. Therefore, despite the disease having a current mortality rate of <6%, lower than similar diseases caused by betacoronaviruses such as SARS or MERS, a fuller understanding of and cure for the underlying pathogen is a high priority for researchers and clinicians everywhere.
Soon after its spread to Europe and the US, a strain of SARS-CoV-2 with two nonsynonymous mutations in the RdRp and Spike (S) proteins, namely 14408 C>T (P323L) and 23403 A>G (D614G) became the dominant form particularly in Europe and to some extent in the US, as well. Claims of increased transmissibility of this new strain due to D614G mutation by Korber et al. was met with caution, as other factors such as founder effect, drift etc. could be responsible for its dominance, as well8. Our previous study suggested that RdRp 14408 C>T mutation is associated with SARS-CoV-2 genome evolution and could even be working synergistically with the 23403 A>G (D614G) mutation9. Although these and other studies provided suggestive evidence that indeed this new strain could have altered phenotypic characteristics, only after mechanistic studies are performed in cell culture and animal models, we will be able to know whether these inferences are true. In this respect, findings of a recent study by Zhang et al. support the increased transmissibility hypothesis for viruses carrying the 23403 A>G mutation: authors pseudotyped retroviruses with the mutant spike protein, infected cells in culture, and compared to those with the wild-type spike10. They observed increased infectivity of the mutant viruses due to increased stability of the spike protein. Other studies also pointed to the importance of the spike protein for the unique properties of the SARS-CoV-211,12, and therefore, its mutations should probably be monitored closely and analyzed in-depth.
In our previous study, where we analysed 11,208 SARS-CoV-2 genome sequences, we showed that RdRp mutations, particularly 14408 C>T mutation, were associated with SARS-CoV-2 genome evolution and higher mutation density9. In the current study, we analyzed the time-dependent changes in the mutation densities of several SARS-CoV-2 genes and asked whether the patterns of change were different between the strain with the 14408 C>T / 23403 A>G mutations compared to those with neither mutation.

Materials and Methods
Genome sequence filtering, retrieval, and preprocessing SARS-CoV-2 isolate genome sequences were obtained from the GISAID EpiCoV database and their variants were annotated as previously described9,13. Briefly, the obtained alignments were filtered for full length genomes with high coverage and to remove environmental or nonhuman host samples. The remaining genomes were edited to mask low-quality base calls and aligned against the reference SARS-CoV-2 genome using the MAFFT multiple sequence alignment program. Variant sites were retrieved using snp-sites, and their effect on peptide sequences were annotated using the ANNOVAR suite of tools. Due to the low quality and gap-heavy content in a majority of isolates, the 5' untranslated regions (1-265) and the 100 nucleotides at the 3' end were also removed from analysis. In addition, we removed isolate genomes without well-defined time of sequencing or geographical location data, to ensure the time and location variables can be clearly associated with the mutations, for a final count of 19,705 genomes.

Mutation density calculation
The annotated variants were separated into synonymous and nonsynonymous groups, with variants in nucleotides not located in gene loci considered synonymous. Gene mutation densities were calculated separately for synonymous and nonsynonymous mutations, by dividing the number of variant sites in the gene locus of an isolate by the length of the locus.
Densities for each full open reading frame, including the surface glycoprotein (S), envelope glycoprotein (E), nucleocapsid phosphoprotein (N), and membrane glycoprotein (M) genes were calculated, as well as the RdRp coding region of ORF1ab and the whole genome.

Statistical Analysis
The categorical variables were analyzed with frequency tables and descriptive statistics were

C>T and 23403 A>G mutations are associated with mutation density increase over time
In order to determine whether the SARS-CoV-2 strain with 14408 C>T / 23403 A>G mutations was different from those that carried neither mutation, with respect to timedependent changes in mutation density, first, we determined the number of synonymous and non-synonymous mutations. The number of sites carrying mutations, after the low quality filters were applied to isolates and the 5' and 3' ends, was 7870, with 16563 potential variants. 11095 of these variants were characterized as synonymous, as they did not alter any known peptide sequence, while 5468 of them were non-synonymous, as they either changed amino acid residues in the translated proteins, or altered the protein length by changing start or stop codon locations. To eliminate the bias that would be caused by the 14408 C>T and 23403 A>G mutations on the nonsynonymous mutation density of isolates they were present in, which would be substantial considering their high frequency, these two nucleotides were also masked during calculations of non-synonymous mutation density.
For further analysis, we focused on the two countries with the highest number of isolates sequenced, the United Kingdom (UK) and the United States (US), which in total contributed 58.76% of all isolates that passed the quality filters. Furthermore, we focused only on isolates that were sequenced after the first genome with both 14408 C>T and 23403 A>G mutations were identified in the respective geographic regions (26 February 2020 for UK, 28 February 2020 for US). Isolates that carried both of the mutations were named "mutant", or "MT" in short, and those that were wild-type at both positions were named "wild-type", or "WT in short; few isolates that carried only one of the mutations was excluded. In total, the isolates were separated into four categories: UK-WT, UK-MT, US-WT, and US-MT.
We calculated the "average mutation density per day per genome" (hereafter referred as mutation density, or MDe for short) for both synonymous and non-synonymous mutations in all four categories, as well as the correlation of MDe with time, using Spearman correlation ( Fig. 1). We identified a strong positive correlation between non-synonymous MDe and time in both UK and US in MT samples (ρ = 0.70, p-value < 0.001), however, a much weaker correlation was observed in WT samples (ρ = 0.27, p-value = 0.002), indicating a potential relationship between non-synonymous mutations and 14408 C>T / 23403 A>G genotype. A similar correlation was identified for synonymous mutations in MT samples, albeit weaker than for non-synonymous mutations (ρ = 0.64, p-value < 0.001). No significant correlation was identified for synonymous mutations in WT samples (ρ = 0.15, p-value = 0.09). One possible explanation for this tight correlation is that MT genomes accumulate both synonymous and non-synonymous mutations over time, whereas WT strains show more variation in their mutation accumulation rate.
However, closer inspection of the plots points to further differences between MT and WT SARS-CoV-2 genomes: Rather than a monotonic change in MDe over time, certain time points seem to be critical milestones at which the mutation densities take sharp turns. One of them is roughly Day 100, which is defined as the 100th day following the first SARS-CoV-2 genome sequenced, and corresponds to 2 April 2020 (Fig. 1  showed differences in non-synonymous MDe between MT and WT genoypes for UK, this number was much higher for US, 9 of 12 genes showed differential non-synonymous MDe.
This marked difference between the two countries is likely due to the differences in the composition of the WT strains, which has much higher variety in the US compared to UK owing to heavier influence of Asian SARS-CoV-2 genomes and multiple independent lines of viral spread around the US15, 16. Indeed, such differences are expected, even in the absence of multiple founders, and justify our approach of comparing results of two countries to reach a consensus.
As non-synonymous mutations are subject to stronger selection than synonymous mutations and could have different patterns, we also determined possible differences in their MDe values between MT and WT SARS-CoV-2 genomes. Only 2 (S, N) of 12 genes showed consistent differences between the two genotypes in both countries (Table S1)

Non-synonymous mutation densities increase in later samples
After characterization of MDE for the time period that MT genotype existed in the UK or US, we next tested whether genome-level differences between MT and WT genotypes in terms of time dependent changes in MDe are also present at gene level. Therefore, we aimed to understand whether those differences at genome level are merely due to possible founder effects, or due to differences between MT and WT strains' accumulation of mutations over time. Since Day 100 appeared to be a turning point, we divided the SARS-CoV-2 genomes into two groups by the time of isolation: early (days 60-100) and late periods (days 101-140), and examined the relationships between the average MDe of the two groups for each gene and country (Figs. 2-3, Table S2). While none of the 12 genes in the WT genotype showed any significant difference in MDe between the early and late periods consistently between UK and US, it was the opposite for the MT genotype: S, RdRp, M, Orf1ab, and Orf3a genes showed increased MDe in the late period compared to the early period. It was particularly evident for the S and Orf1ab genes, as both synonymous and non-synonymous MDe in both genes were significantly increased in the late period in both countries. Interestingly, M gene showed a consistent decrease in the non-synonymous MDe only in the MT genomes, against the common trend of increased MDe in this genotype; however, no such association was observed for the synonymous MDe, it was even the opposite in the UK-MT isolates. On the other hand, the RdRp non-synonymous MDe was increased significantly in the late period in UK-MT isolates, and trended towards increase in US-MT isolates (Mean Rank: 29.76 vs. 38.64, pvalue = 0.058); RdRp non-synonymous MDe was also increased in late period UK-MT isolates, however, there was no such increase in the US isolates, consistent with timeindependent MDe values ( Figure S2), and with the marginally insignificant increase in synonymous MDe in the same isolates. Among changes that were significant for a singlecountry, there was an overall tendency towards increased MDe in MT genomes, but the opposite was true for WT genomes: 12/14 of single-country changes in MT genomes were associated with increased MDe, whereas 12/16 of such changes in WT genomes were associated with decreased MDe in different genes. These findings suggest that the steady increase in MDe specific to MT isolates even after Day 100 cannot simply be explained by founder effects, or linkage. One could speculate that at least some of the WT strains accumulated mutations that reduced their fitness under lock-down conditions and therefore led to their elimination in the late period (101-140 days), whereas the MT strain was fit enough to keep most or all mutations and even add new ones in this period. However, almost equal contribution of synonymous and non-synoymous mutations may point to other factors in addition to possible differences in fitness. Although S and Orf1ab genes seem to be the main drivers of the obverved genome-wide increase the MDe of MT genomes, it seems that other genes are also contributing to this trend.

Isolates with 14408 C>T & 23403 A>G mutations (MT genotype) show steady increase in mutations in both S and RdRp over time
To further investigate whether the shift in MDe happens due to a few distinct sites that are linked with the MT mutations, and could possibly directly affect the host adaptation process, or whether it is caused by a variety of mutations that spread widely both temporally and spatially in the genome, we further divided the early and late time periods into a total of four periods (60-80, 81-100, 101-120, 121-140), and analyzed the distributions of mutations in the S and RdRp coding regions in these four time periods (Figs. 4, S4-5). To do so, we calculated the number of mutations in 100-nucleotide bins per isolate sharing a genotype (MT or WT) in that each time period, and observed their distributions, divided into two groups as synonymous and non-synonymous mutations. We can see that in the 60-80 day period, the distribution of mutations across the regions were largely comparable between WT and MT genomes, with the exception of the synonymous 14805 C>T mutation in the RdRp, which is exclusive to WT isolates (Fig. 4A). The ratio of mutations in MT to WT steadily increase over genes. Therefore, two related questions remain to be answered: 1) Is there any causal relationship between mutation density changes and viral transmission in population? If the answer is 'yes', what is the nature of this relationship? 2) What is the biological basis of differences between the 'mutant' and 'wild-type' strains, if there is any? It is intriguing that the main drivers of the genome-level increase in MDe were two critical genes, S and Orf1a, which are also the genes that harbor the defining mutations of the MT genotype.
It should also be noted that the MDe of the MT genomes started showing signs of decline after Day 130 (2 May 2020), although no obvious change in the epidemiological data is seen around this date. Further studies are warranted that will investigate the relationship between mutations and the clinical phenotypes, as more data will become widely available.
In conclusion, we propose that monitoring the average mutation densities over time and relating to the epidemiological and clinical data may help establish genotype/phenotype relations, and possibly predict the increases and decreases in new cases.     The left barplots show the RdRp coding region, while the right barplots show the S gene.