Analysis of SARS-CoV-2 genome evolutionary patterns

ABSTRACT The spread of SARS-CoV-2 virus accompanied by public availability of abundant sequence data provides a window for the determination of viral evolutionary patterns. In this study, SARS-CoV-2 genome sequences were collected from seven countries in the period January 2020–December 2022. The sequences were classified into three phases, namely, pre-vaccination, post-vaccination, and recent period. Comparison was performed between these phases based on parameters like mutation rates, selection pressure (dN/dS ratio), and transition to transversion ratios (Ti/Tv). Similar comparisons were performed among SARS-CoV-2 variants. Statistical significance was tested using Graphpad unpaired t-test. The analysis showed an increase in the percent genomic mutation rates post-vaccination and in recent periods across all countries from the pre-vaccination sequences. Mutation rates were highest in NSP3, S, N, and NSP12b before and increased further after vaccination. NSP4 showed the largest change in mutation rates after vaccination. The dN/dS ratios showed purifying selection that shifted toward neutral selection after vaccination. N, ORF8, ORF3a, and ORF10 were under highest positive selection before vaccination. Shift toward neutral selection was driven by E, NSP3, and ORF7a in the after vaccination set. In recent sequences, the largest dN/dS change was observed in E, NSP1, and NSP13. The Ti/Tv ratios decreased with time. C→U and G→U were the most frequent transitions and transversions. However, U→G was the most frequent transversion in recent period. The Omicron variant had the highest genomic mutation rates, while Delta showed the highest dN/dS ratio. Protein-wise dN/dS ratio was also seen to vary across the different variants. IMPORTANCE To the best of our knowledge, there exists no other large-scale study of the genomic and protein-wise mutation patterns during the time course of evolution in different countries. Analyzing the SARS-CoV-2 evolutionary patterns in view of the varying spatial, temporal, and biological signals is important for diagnostics, therapeutics, and pharmacovigilance of SARS-CoV-2.

The S protein is responsible for SARS-CoV-2 attachment and entry by binding to the ACE2 (7)(8)(9)(10).Continuous mutations in the Spike increase virus adaptability for escap ing vaccine treatment resulting in high survival rates and spread of the virus (11)(12)(13)(14).Multiple mutations in the omicron variant S protein may also influence its interaction with ACE2, thereby leading to antibody escape (15,16).Mutations in the S region of SARS-CoV-2 have led to epitope loss, resulting in escape from the vaccine treatment, the most frequent mutation being D614G (17,18).D614G is also found in the S region of clades G/GR/GRY/GH/GV that has high human host infectivity rate due to efficient transmission (19,20).
The SARS-CoV-2 genome contains 14 ORFs.Of these, ORF1a encodes the NSPs 1-11, while ORF1b encodes NSP12-16.Together, NSPs 1-16 form the replicase-transcrip tase complex.This is followed by 13 ORFs encoding the four main structural proteins, namely, S, E, M, N, and interspersed by 9 accessory factors (21).Various studies have been performed on genomic mutations in the SARS-CoV-2 virus.The ratio of dN/dS > 1 indicates positive selection and has been reported in the S glycoprotein (22,23).Comparative analysis of SARS-CoV-2, SARS-CoV, and MERS-CoV showed a positive evolution model along with higher dN-dS due to the dominance of dN (24).Studies of mutations in the diagnostic targets in COVID-19 have suggested that the N gene has the highest number of mutations (25,26).
Analysis of 469 genome sequences from Indian patients led to the identification of 536 dN and dS mutations in the six genes; ORF1ab, S, N, ORF3a, ORF7a, and ORF8 (27).A broad analysis showed 33 different mutations in 837 Indian SARS-CoV-2 whole-genome sequence isolates, of which 18 were unique to India.S, N, NSP3, NSP12, and NSP2 coding genes showed novel mutations and dN was found to be more than dS by approximately threefold (28).Modeling of the epidemic with different strains and mutations showed the emergence of a virus with higher transmissibility and evolutionary adaptations (29).
In this study, we calculated the genomic rates of mutation and dN/dS ratios in SARS-CoV-2 genome sequences taken from seven countries and showed that they increased with time, whereas the Ti/Tv decreased.Similarly, these parameters were also estimated for different known SARS-CoV-2 variants, where Omicron variant sequences showed highest mutation rate as compared to the other variants, but delta showed the highest dN/dS ratio.The highest mutating protein along with their individual dN/dS ratios and mutation rates were determined within each country.NSP3, S, N, and NSP12b had the highest genomic mutation rates both in before and after vaccination phase.NSP3 showed highest genomic mutation rates before vaccination, which was replaced by S in the after vaccination, while a significant rise was observed in the NSP4 genomic mutation rate.N, ORF8, ORF3a, and ORF10 were under strong positive selection pressure before vaccination, whereas after vaccination, E, ORF7a, and NSP3 showed the highest increase in the dN/dS ratio.The recent sequences showed the highest dN/dS ratios in E, NSP1, and NSP13.The estimated properties showed similar patterns in genetic variability across all geographic regions despite the use of different vaccine technologies.This implies that the forces of evolution have been uniform across multiple parameters.However, there is a definite change in the pattern of mutations with time.While the highly mutating and positively selected genes are important for pharmacovigilance and vaccination, the negatively selected ones are important for diagnostics.This is the only study that has compared these different mutational parameters comprehensively with time and geographical regions, and in different variants both at the whole genome and gene level.

Collection of genomic sequences of SARS-CoV-2
To perform a critical analysis of the data, genome sequences of SARS-CoV-2 were retrieved in FASTA format for different countries using the GISAID database (30,31).The countries taken into consideration included India, England, Canada, Italy, France, USA (Washington), and the Netherlands.Additionally, genome sequences of different SARS-CoV-2 variants like Alpha, Beta, Gamma, Lambda, Mu, Delta, and Omicron were also retrieved.
The accuracy of genome sequences retrieved from the database was ensured using filters like complete sequence, high coverage, and complete collection date.The sequences downloaded were high coverage, having <1% Ns (unidentified nucleotides) and <0.05% unique amino acid mutations.In all the sequences, insertions and deletions were accepted only when verified by the submitter.

Classifying the genome sequences retrieved
SARS-CoV-2 genome sequences submitted in the GISAID database from the seven countries were downloaded from three distinct periods.The genome sequences were classified into three categories as follows: a.The sequences considered pre-vaccination sequences had the sample collection date of at least 1 month before the start date of vaccination taken from the website of WHO and from the Government vaccination websites of the concerned countries.The data sets were taken in triplicates.
b.The start date of vaccination was different for each country considered.After the date of start of vaccination, a buffer period is required for the vaccine to reach the population.Therefore, sequences collected 5-6 months after the date of start of vaccination were considered as after vaccination sequences.The data sets were taken in triplicates.
c.A set of sequences that constitute a recent data set were also taken.As the available number of recent sequences were less in number for many countries, triplicate data sets for this period could not be taken.
d. Additionally, approximately 1,000 sequences of each SARS-CoV-2 variant were obtained using the "Variant" filter of GISAID.

Sequence alignment and mutational data analysis
The first reported sequence, Wuhan-Hu-1 (Accession ID: NC045512), was taken as the reference for pre-vaccination, post-vaccination, and recent sequences.Thus, Wuhan-Hu-1 sequence was added as a reference for the alignment.The mutation rates of all the genomic sequences were calculated using COVID-19 genome annotator, an online web-based tool that aligns the input sequences in "nucmer" alignment tool and processes the alignment output using UNIX along with R scripts (32).The aligned sequences were compared with the reference genome (Wuhan-Hu-1) and amongst each other to find information about the mutational sites.The Jupyter notebook (33) was used in association with the "pandas" python library (34)  to the length of the data set taken, "gs" refers to the length of the genome sequence and the total no. of mutations estimated with respect to number of single nucleotide mutations (35).Mutation rate is the measure that refers to the frequency of mutations per generation in the population or in an organism (36).
b. Selection pressure has been calculated as the dN/dS ratio.If dN/dS ratio exceeds unity (dN/dS > 1), the mutations are said to be occurring under positive selec tion which promotes the accumulation of beneficial mutations, whereas if dN/dS ratio is below unity (dN/dS < 1), the mutations are said to be occurring under negative selection that promotes mutations that are favouring selective removal of deleterious alleles (37,38) The Ti/Tv in the RNA virus has been calculated as the ratio of Ti (C↔U and A↔G) to Tv (A↔C, A↔U, G↔C, and G↔U), determined for the pre-, post-, and recent vaccination sequence groups using the data obtained from the COVID-19 genome annotator.The mutation table data from the genome annotator provided the refvar (sequence at the mutation site on reference genome) and qvar (sequence at the mutation site on sample sequence) mutation information.
The number of Ti mutations C-U, U-C, A-G, and G-A was counted using Micro soft Excel.Similarly, C-A, A-C, U-A, A-U, G-U, U-G, C-G, and G-C changes were counted as the total number of Tv.Therefore, the Ti/Tv ratio was calculated as: Ti/Tv = Total no .of transitions Total no .of transversions

Statistical analysis
The statistical significance of the differences between the means was carried out.The unpaired t-test was performed on the % mutation rates, selection pressure and Ti/Tv ratios using GraphPad QuickCalcs (https://www.graphpad.com/quickcalcs/ttest1.cfm).

RESULTS AND DISCUSSION
In this work, we have carried out comparison of the SARS-CoV-2 genome sequences from seven different demographic regions, namely, India, France, England, Canada, Italy, Netherlands, and USA (Washington, D.C.).A total of 74,870 retrieved sequences were classified into three different time periods: a. Pre-vaccination phase: Broadly, the pre-vaccination sequences taken were ranging from January 2020 to December 2020 (GISAID Identifier: EPI_SET_230517 nt).
b. Post-vaccination phase: The post-vaccination sequences taken were ranging from May 2021 to April 2022 (GISAID Identifier: EPI_SET_230517va).
c. Recent period: The recent period sequences were taken from June 2022 to December 2022 (GISAID Identifier: EPI_SET_230518oq).
d. Additionally, a total of 7,209 sequences of different variants were also analyzed.As new variants emerged sequentially over time, collection dates of the sequences considered for the SARS-CoV-2 variants ranged from November 2020 to January 2022 (GISAID Identifier: EPI_SET_230518xv).
The sequences were compared for % mutation rates, dN/dS ratio and Ti/Tv ratio.The analysis was carried out for the whole genome as well as for every individual gene.

Evolutionary patterns in different geographical locations and time periods
a. Increase in mutation rates: The % genomic mutation rates estimated for each country in the pre-vaccination, post-vaccination, and the recent period are listed in Table 1.The genome mutation rates were calculated as (0.04 ± 0.02) % in the pre-vaccination phase.The average mutation rates increased to (0.17 ± 0.05) % in the post-vaccination period.A comparable increase was seen in all the seven countries studied.The mutation rates for sequences from a more recent interval were higher than the post-vaccination period.The mutation rate was found to be (0.28 ± 0.01) %, consistently across all the seven countries.Thus, there was an average increase of three-to fourfold in the % genomic mutation rates in all the countries after vaccination and six-to sevenfold increase in the recent period in comparison with mutation rates in the pre-vaccination sequences.Figure .1a shows the comparison of percent mutation rates within the countries in pre-vacci nation, post-vaccination, and recent period.There was a similar increase in the % genomic mutation rates in the post-vaccination period in comparison with the pre-vaccination period across all the seven countries studied.This has increased further in the recent sequences.Despite the large variation in mutation rates in each country, there was a definite increase in the post-vaccination sequences especially in the post-vaccination period.In the recent sequences, the mutation rates show a still higher trend in all the seven countries studied.The significance of the difference between the means was verified using unpaired t-test (P-value 0.0001; confidence interval of 95%) and is shown in Table 2.The differences in percent genomic mutation rates were found to be extremely statistically significant between the three phases with a two-tailed P-value of less than 0.0001.
In RNA viruses, proofreading incapability of RNA polymerases accounts for their high mutation rates.Natural selection further results in increased adaptability and faster replication of viruses referring to their high mutation rates (40,41).However, the highly conserved nsp14 in the Coronaviridae family has a proofreading function that may be a crucial factor for variation in the large and complex viral genome (42).Absolute estimates The prominent results reflecting change in mutation rates (%), dN/dS and Ti/Tv ratio from the pre-vaccination period for each country are shown in bold along with the overall change.
of mutation rates in coronaviruses ranged from 0.67 to 1.33 × 10 −5 per site per year in infectious bronchitis virus to 0.44-2.77× 10 −2 per site per year in mouse hepatitis virus.
Comparatively, the global SARS-CoV-2 mutation rate was estimated to be moderate at 6 × 10 −4 per site per year (43).The more than twofold increase in mutation rates with time in SARS-CoV-2 as observed in this study would bring the genomic mutation rates at par with the reported genomic mutation rates for non-coronaviruses and the other RNA viruses like Influenza virus and HIV-1 (44).• Mutation rates in SARS-CoV-2 proteins: The pattern of mutations in proteins within the SARS-CoV-2 genome sequences was analyzed for each protein in the pre-vaccination, post-vaccination, and recent period (Table 3).As shown, NSP3, S, N, and NSP12b coding regions had the highest % mutation rates in the pre-vacci nation period.There was little variation observed in the highest mutating genes in different countries studied.Previously, NSP3 was reported as the mutation hotspot of SARS-CoV-2 genome and mostly found to be co-mutating with NSP12b due to its involvement in replication and transcription complexes (45)(46)(47)(48)(49).However, in the post-vaccination and recent period, the highest mutating genes remained the same, where S was observed to uniformly have the highest % mutation rates followed by N and NSP3.From the pre-vaccination sequences, S showed gradual increase in the % mutation rates till the recent period across all the countries.S contains the receptor-binding domain (RBD) which is highly variable, and through its mutational changes, it is known to affect viral replication, transmission and is also involved in immune escape (50,51).In comparison with S, N has lower mutation rates and is comparatively stable (52).N is considered as a vital hotspot for mutations mainly in its serine-rich domain due to its involvement in viral replication and packaging (53,54) and can be considered a novel target for vaccine design (55).Comparison of mutation patterns of all the proteins in Table 4 showed that NSP4 had shown low mutation rates before vaccination but rose to be the fifth highest mutating protein post-vaccination.In recent sequences, it uniformly occupied the fourth position and overtook NSP12b in all the countries.NSP4 along with NSP6 and NSP3 transmembrane proteins, induce the develop ment of double membrane vesicles (DMVs) by reorganizing the endoplasmic     reticulum of the host cell, and therefore it may be co-mutating with NSP3 (56,57).In contrast with this, ORF10, NSP7, and NSP10 were the lowest mutating in all the three periods for all the countries taken into consideration.This was in line with previous studies where they were found to be conserved within the SARS-CoV-2 genome with few or no mutations (47,(58)(59)(60).b.Increasing Selection pressure: The dN/dS ratio in each country for the pre-vacci nation, post-vaccination and the recent period are listed in Table 1.The global dN/dS ratio in the SARS-CoV-2 genome during the pre-vaccination period was calculated as 0.48 ± 0.06, thus indicating a purifying selection pressure.The observed value is similar to previous studies showing that SARS-CoV-2 genome depicts overall purifying selection pressure (61-63) amd study reporting overall SARS-CoV-2 genome dN/dS of around 0.55 (61).The dN/dS ratio increased to 0.99 ± 0.11 in the post-vaccination period in various demographic regions, suggest ing SARS-CoV-2 genomes showing overall neutral selection.The recent period sequences showed an average dN/dS of 0.90 ± 0.05.The comparison of dN/dS ratios in all the countries for pre-vaccination, post-vaccination and recent period is shown in Fig. 1b.There was more variation in the dN/dS value of sequences taken from the post-vaccination period in comparison before vaccination.Taken together, a definite increase in dN/dS is seen in post-vaccination sequences but the data from recent sequences was inconclusive because of small sample size.
The differences in the mean dN/dS ratios were found to be extremely statistically significant between the three phases with a unpaired two-tailed t-test P-value of less than 0.0001 (Table 2).
The findings are in line with previous studies suggesting significant purifying selection which has been the primary factor driving the SARS-CoV-2 evolution and ongoing strong positive selection observed within certain sites of the SARS-CoV-2 genome (64).In a previous study, it has been suggested that increase in the selection pressures can cause increase in the mutation rates measured (65), which is also justified in our study as both mutation rates and dN/dS ratios are increasing with time.Combined with the persistence of recurrent infections in immunocompromised individuals, this may have induced the selection of viruses with lower pathogenicity or virulence and higher transmission (66).
• Selection pressure in SARS-CoV-2 proteins: The dN/dS ratios of the proteins in all the countries are shown in Table 5. SARS-CoV-2 proteins with the strongest positive selection pressures in the pre-vaccination period were N, ORF8, ORF3a, and ORF10.In a change from the pre-vaccination period, the post-vaccination marked substantial increase in E, ORF7a and NSP3.The recent phase sequences that E, NSP1, and NSP13 had and the highest dN/dS ratios.A continuous trend of positive selection was identified in S in all the three phases.Previous studies during the year 2020 have identified ORF10 to be under strong positive selection, along with ORF3a, ORF8 and N also depicting much higher number of dN (62,67,68).ORF8 is involved in host-pathogen interactions through its nine encoded proteins (69), ORF3a is involved in interference with host ion channels by encoding viroporin (70), while overexpression of ORF10 is known to downregulate IFN-1 expression, leading to suppression of the antiviral innate immune response (71).However, most viral proteins are multifunctional, and their other roles are likely to be discovered in future.ORF7a is responsible for triggering the NF-kappa B pathway and pro-inflammatory expression of cytokines (72).Also, evidence of strong positive selection pressures was observed in NSP3 from a 2021 study (73).NSP3 and its interaction with N is known to affect SARS-CoV-2 replication and  its pathogenesis (74).In the pre-vaccination phase, NSP1 was under negative selection pressure with dN/dS < 1.Also, in a previous study where NSP1 showed negatively selected sites only and it is designed to be evolutionary conserved for its functional requirement of host ribosomal complex binding (75,76).E and S indicated strong positive selection pressures in a 2021 study (73).E is involved in the expression of a small multi-functional protein that has an important role in host and virus interaction through ion-channel activity.Mutations within E are associated with a reduction in virulence (77,78).NSP13 has also shown strong evidence of positive selection which may be justified due to its role in inhibiting antiviral immunity by hijacking host deubiquitinase USP13 and suppresses type I IFN response through contact with TBK1 (64,79).S has shown continuous strong positive selection in our study in all the three phases as supported by other studies (64,80).NSP1 showed a large jump in the dN/dS ratio with a drastic increase in the number of dN after vaccination.NSP1 is suggested to suppress the host innate immunity by interacting with the 40S ribosomal subunit (76,81,82).NSP9 was observed to be the most common gene having dN/dS < 1 showing purifying selection.NSP9 is said to be highly conserved and involved in viral RNA synthesis (83).c.Decrease in Ti/Tv ratio: The estimated Ti/Tv ratios for each country in the prevaccination, post-vaccination, and the recent period are listed in Table 1.The Ti/Tv ratio in the pre-vaccination data set was found to be 3.77 ± 1.10.The value is in agreement with the range of Ti/Tv of 2.0 to 5.5 calculated in previous studies (84,85).The Ti/Tv ratio decreased in the post-vaccination period to 2.00 ± 0.12 and remained similar with the recent period sequences, i.e., 2.15 ± 0.05.The differences in the mean Ti/Tv ratios were found to be highly statistically significant between the three phases with P-value of less than 0.0001 (Table 2).related to increase in dN.Thus, a decrease in Ti/Tv ratio is correlated with an increase in the dN/dS ratio.Previous studies have shown that Ti saturate more rapidly with time in comparison with Tv, thus causing Ti/Tv ratios to decline with evolutionary time (86)(87)(88)(89).
• Effect of Ti and Tv patterns on CpG ratios: The percentage of Ti and Tv in the pre-vaccination, post-vaccination and recent period are shown in Fig. 2a through c respectively.The most frequent Ti and Tv observed was C→U and G→U respectively in the pre-and post-vaccination period.The C→U and G→U substitutions result in the deterioration of CpG content.In the post-vaccination period, the C→U substitutions decreased by 12% from the pre-vaccination period, whereas in recent period, a slight decrease of 4% was observed from the post-vaccination period.In comparison with C→U, G→U substitutions deteriorated at a slower rate, i.e., 5% reduction post-vaccination and about 7% reduction in recent period from post-vacci nation period.In the recent period sequences, C→U remained the most frequent Ti, whereas U→G was the most frequent Tv.A previous study that evaluated the sequences of the period January 2020 to March 2021 reported a drastic increase in C→U and G→U substitutions in the initial phase of SARS-CoV-2 infection, which eventually stabilized at a point of time and then decreased till March 2021 (90).Table 1 shows the %GC content in the genome sequences from each country for pre-vaccination, post vaccination and recent period.It can be observed that the %GC content slightly reduced after vaccination, whereas negligible reduction was seen in the recent period genome sequences.As a result, the overall decline in GC content is minimal and appears to be stabilizing in future.This is confirmed by another study which depicted minute variations in the CpG content in the SARS-CoV-2 genome sequences and concluded that CpG content reduced at a faster rate in the initial time of evolution which may slow down to become steady with the evolution in human host (91).Thus, these mutations were attributed to rapid evolution after transmission to the human host (92)(93)(94).The differences in the mean % GC content were found to be highly significant as tested a two-tailed P-value of less than 0.0001 (Table 2).It has also been reported that the CpG motifs are lost in the SARS-CoV-2 sequences, which may enable antiviral response escape through TLR7 (91,95).Higher U→G has been reported in the SARS-CoV-2 genome sequences collected between January 2022 and July 2022 (96).This further suggested that the amount of C→U and G→U substitutions are likely to decrease upon increase in divergence indicating that some portion of these mutations are knocked out by purifying selection, irrespective of the origin of SARS-CoV-2 (90,97).Reduction in C→U and G→U substitutions together with the increase in U→G may eventually lead to an increase in the CpG content in the SARS-CoV-2 genome.Higher CpG content has been linked to the attenuation of the virus (98,99).

Evolutionary patterns across different SARS-CoV-2 variants
The genomic mutation rates (%), dN/dS and Ti/Tv ratios were also determined for each SARS-CoV-2 variant (shown in Table 6).
a. Mutation rate (%) analysis: The variant-wise genomic mutation rate analysis data suggest Omicron variant has the highest mutation rates among all variants (0.22%) followed by Delta (0.14%).Beta variant displayed the lowest mutation rate (0.10%).Previous studies showed Omicron to have high genomic mutation rate along with the accumulation of highest number of mutations (100-102).
• It has been suggested that the S mutations are positively selected to generate new variants of SARS-CoV-2 that have improved overall fitness (66,103).To evaluate this, the average number of mutations, mutation rates and dN/dS ratios were determined for the S region in each variant.The average number of S mutations were calculated as a percentage of total number of S mutations in the genome.The resulting S mutations percentage for Alpha, Beta, Delta, Gamma, Lambda, Mu, and Omicron were 29.4%, 30.5%, 22.7%, 33.7%, 25.3%, 26.3%, and 47.8% respectively.As seen in Table 6, the highest mutation rates in the whole genome as well as in S are seen in the Omicron variant.S mutation rates of all the variants are similar except for Omicron, which has the highest S mutation rate of 0.1.This has also been observed in a previous study (100).The mutation rate patterns for proteins in each SARS-CoV-2 variant is depicted in Table 7.
b. dN/dS ratio analysis: As dN/dS ratios exceed unity for each SARS-CoV-2 variant, positive selection is responsible for promoting dN in the sequences.From Table 6, the average genomic dN/dS of all SARS-CoV-2 variants is estimated to be around 0.86 compared to the Wuhan reference strain which is similar with another study that estimated it to be 0.8446 (104).Also, it was observed that Delta variant had the highest dN/dS ratio (1.43) showing positive selection, followed by Omicron variant (1.01) showing neutral selection.Although the number of mutations is found to be higher in Omicron variant, the ratio of dN/dS is higher in the Delta variant.Delta variant was found to be highly virulent in comparison to the Omicron variant, which evolved to be more transmissible and less virulent (105)(106)(107).Therefore, the genomic dN/dS seems to correlate well with increased virulence.It has been argued that as virulence causes hindrance of transmission between hosts, dN/dS ratios may work to decrease virulence-thereby increasing transmission between hosts (108, 109).
• Virulent genes have been observed to be highly subject to dN and, therefore, are under strong positive selection (110).Therefore, the protein- wise contribution to dN/dS was also determined in each of the variants and shown in Table 8.The unique genes having highest dN/dS in the highly virulent Delta variant were ORF7b and ORF7a while the unique genes contributing most to dN/dS in the Omicron variant were E, NSP14, and NSP1.The accessory protein ORF7b has been reported to mediate the cellular apoptosis caused by Tumour Necrosis Factor α (111).The ORF7a protein initiates autophagy and helps in virus replication (112).Taken together, these proteins impair the host cell immune response and cellular function (113).
• The dN/dS ratio of S was greater than 1 for all the variants.This is in line with the previous findings which suggest that S protein coding gene had the dN/dS ratio greater than 1 in all the SARS-CoV-2 variants (114).Alpha variant S protein showed the highest dN/dS ratio whereas Lambda variant S protein had the lowest dN/dS ratio.However, dN/dS values of both genome and S are comparatively lower in Omicron variant.Thus, there seems to be no correlation between dN/dS ratios of S and genomic dN/dS ratios, virulence, or transmission.That Omicron and Delta variant have the highest and the lowest S mutations, respectively, was also remarked in a previous study (114).The contribution of the SARS-CoV-2 proteins toward the mutation rates and dN/dS ratios in each variant has been shown in Tables 7 and 8, respectively.
c. Ti/Tv ratio analysis: Lambda variant was found to have the lowest dN/dS ratio but the highest Ti/Tv ratio.A negative correlation has previously been observed between Ti/Tv ratio and dN/dS ratio under positive selection in the comparative analysis of SARS-CoV, SARS-CoV-2 and MERS-CoV for all the nucleotide substitution models (24).The reason for this could be explained by the fact that Tv substitu tions favour dN.Previous studies have also remarked that higher number of Ti in the Lambda variant would remark lesser dN (86)(87)(88)(89).

Conclusion
To the best of our knowledge, this is the only study that has compared the mutation rates, dN/dS ratios and Ti/Tv ratios during pre-, post-vaccination, and the recent periods in different geographical areas after the emergence of COVID-19.While individual studies find support from literature, this remains a comprehensive study on genomic parameters of SARS-CoV-2 in different regions and time periods.The mutation rates were observed to be increased from the before vaccination period to the recent period in each country.The dN/dS ratio has increased with time, signify ing accumulation of non-synonymous mutations favoring dN.However, the Ti/Tv ratio depicted significant decrease over time and may be correlated with viral attenuation as suggested earlier.Based on these parameters, the unpaired t-test helped in confirming that differences amongst the three phases were extremely statistically significant (P-value < 0.0001).As S is the main target for the vaccine, it has been proposed that mutations in this protein are driven by natural selection (66).While the number of mutations was observed to be increased in S in the Omicron variant, most of them were dS mutations.
A higher dN/dS ratio is also seen in another structural protein N. A 2020 communica tion proposed N as a vaccine target in view of its low mutation rates (115).Indeed, the current diagnostic kits also target N recognition.This may need to be re-evaluated in the light of the high mutation rates as well as dN/dS ratios of N as observed in our study.Conversely, ORF10, NSP7, and NSP10 coding genes had the lowest mutation rates across all the considered demographic regions and could further be targeted and explored for SARS-CoV-2 diagnostics and therapeutics.
Amongst the SARS-CoV-2 variants, Delta and Omicron had the highest dN/dS ratio and mutation rates respectively.The Delta variant has dN and correlates with high virulence in comparison with the Omicron variant, which is less virulent but has higher transmissibility.It was also observed that the pattern of highest to lowest gene-wise dN/dS ratios is unique in each variant genome.
The possible reasons for the observed increase in genomic mutation rates and selection pressure along with implications for vaccine efficacy against variants were considered.It was noted that: a. Vaccination is expected to slow down the number of infections, thus limiting the number of variants that the virus can explore (116).High rates of mutation may facilitate viral immune evasion and reduce the efficacy of the vaccine against infection and transmission, finally leading to reduced protection of the vaccine against several disease outcomes (117-119).The SARS-CoV-2 vaccines have indisputably lowered the COVID-19 disease burden in terms of infection, severity, and mortality (120)(121)(122).Therefore, it is likely that similar or higher mutation rates of the SARS-CoV-2 genome might have been recorded in the absence of a vaccine.
b.The premise that increases in genomic mutation rates and selection pressures causes increased virus fitness and immune escape variants has also been debated as most of the mutations are deleterious (123,124).It has been remarked that high mutation rates may even hinder the emergence of new highly adapted viruses as they do not allow viruses with advantageous genotypes to linger long enough to become fixed in a viral population (125,126).Other causes for high mutation rates in RNA viruses have been explored, such as selection for more robust viral population or faster replication time (127).
c. Differences in the evolutionary parameters like mutation rates and selection pressures in the SARS-CoV-2 genome before and after vaccination as observed in this study can be attributed to a range of factors associated with the virus, host immunity, or the environment.Furthermore, host immunity arising from natural causes as well as vaccination is expected to affect the evolution.The generation of new mutations in a viral population is not favored when the virus is well adap ted to its surroundings since most mutations become harmful due to purifying selection.So, mutations that become more common can either be neutral and fixed by genetic drift or beneficial and fixed by positive selection (128).
d. Pathogen fitness is characterized by an increased rate of spread in a population and has been described as a combination of increased infectivity, increased transmissibility, and a long infectious period.The public availability of genome sequences of SARS-CoV-2 variants provides a unique window to examine the changes in genomic parameters vis-à-vis the pathogen fitness.Pathogens can adapt to immunity by many mechanisms and such adaptation depends both on the new variants available in the environment at that time as well as their fitness in the host type.The over-representation of a variant in primed hosts indicates that it is immunity-adapted (129).modelled the evolution of the pathogen during vaccination campaigns by dividing it into two phases: • There is an initial phase during which most of the population is immunolog ically naïve.The first short phase favors the selection of generalist immun ity-adapted variants that are equally effective against naive and primed individuals.
• During the later second phase, most of the population acquires immunity either by natural infection or vaccination.This second phase favors specialist variants immunity-adapted to primed hosts.
The alpha, delta, and probably omicron can be classified as generalists based on epidemiological studies.It is predicted they would have spread regardless of vaccination.In future, specialist variants are predicted to appear, whereas immune-facilitated variants are rare (129)(130)(131).However, the duration required by the virus to acquire the number of mutations required for adaptation to the natural or vaccine acquired immunity cannot be predicted a priori.e. Vaccine-driven evolution is noted to occur in pathogens where its effects do not suppress infection, replication, or transmission sufficiently (132).Recent evidence for SARS-CoV-2 vaccines shows that the vaccine is not 100% effective against infection and disease severity.Furthermore, this effect has waned with time and emergence of new variants like omicron (130).While an array of diverse sub-line ages of the omicron variant have now emerged with the BQ and XBB sub-variants showing increased evasion of neutralizing antibodies (133), COVID-19 vaccines also elicit a specific T-cell response that may have an additional role in host protection (134).Parallels have also been made between these and the human influenza vaccines that are seasonally updated due to ongoing antigenic drift (129,134).Most of the novel variants, in this case, remain partially inhibited by vaccination (135).
f.While the currently approved SARS-CoV-2 vaccines continue to provide significant protection against severe disease and death (136), continuous monitoring and tracking for emergence of new fitter SARS-CoV-2 variants combined with molecular epidemiological surveillance is required.

FIG 1
FIG 1 Comparison of mutation rates, dN/dS and Ti/Tv within the countries in the three phases.(a) Mutation rates with respect to before (green), after vaccination (blue), and recent period (red); (b) Selection pressure (dN/dS ratio) with respect to before (violet), after vaccination (orange), and recent period (gray); (c) Transition to transversion (Ti/Tv) ratio with respect to before (yellow), after vaccination (pink), and recent period (brown).Black margins in the graph represent the standard deviation in the before and after vaccination.
, and recent period in each country along with their respective selection pressures Country Before vaccination (Jan 2020)-(Dec 2020) After vaccination (May 2021-April 2022) Recent period (June 2022-Dec 2022) ) protein-wise contribution to mutation rates in all the countries before, after vaccination and in the recent period

FIG 2
FIG 2 Percentage proportions of transition and transversion mutations in the countries in the three phases.(a) Before vaccination (pre-vaccination period);,(b) after vaccination (post-vaccination period), and (c) recent period.
.A mutation table was obtained after the multiple sequence alignment was performed by the COVID-19 genome annotator.The table included the varclass information, i.e., the type/class of mutation that occurred in the input sample sequences.Under varclass, the mutations descri bed as SNP were taken as observed non-synonymous mutations (N m ), whereas SNP_silent were taken observed synonymous mutations (S m ).Further, the selection pressure dN/dS were determined by normalizing the estimated N (39)d S m with the overall expected number of non-synonymous (N ref ) and synonymous (S ref ) sites in the Wuhan-Hu-1 reference genome.N ref and S ref sites were determined using a biopython script in which the codon substitution table from T. gojobori was implemented(39).Therefore, dN/dS ratio was calculated as:dN dS = N m /N ref S m /S ref c.The mutation table obtained from the COVID-19 genome annotator also provided information about mutation type/class (varclass) occurring in the respective protein region (protein).Both the "varclass" and "protein" information was considered together to calculate the number of SNP (N m ) and SNP_silent (S m ) for each SARS-CoV-2 protein.Furthermore, the selection pressure dN/dS of each protein were determined by normalizing the estimated N m and S m of the protein with their respective expected number of non-synonymous (N ref ) and synonymous (S ref ) sites within the Wuhan-Hu-1 reference genome.These N ref and S ref sites were determined using a biopython script.Therefore, dN/dS ratio was

TABLE 1
Data comparison of different countries before and after vaccination along with the recent data

TABLE 2 Unpaired
t-test calculations for the significance testing of the determined country-wise parameters

Group Comparison Parameters of comparison t-value Degrees of freedom (df) Mean difference Standard error of difference 95% confidence interval
(C.I.) Two-tailed P-

TABLE 3
Data comparison of highly mutating genes before, after vaccinations

TABLE 4
Highest (top) to lowest (bottom

TABLE 5
Highest (top)to lowest (bottom) protein-wise contribution to selection pressure (dN/dS ratio) in all the countries before, after vaccination and in the recent period

TABLE 6
Data comparison between genomic and spike protein among different SARS-CoV-2 variant sequences a The highest mutation rates, dN/dS and Ti/Tv ratios are shown in bold while the lowest are shown as italics. a

TABLE 7
Highest (top)to lowest (bottom) contribution of the proteins toward the mutation rates (%) in SARS-CoV-2 variants

TABLE 8
Highest (top)to lowest (bottom) contribution of the proteins toward the selection pressure (dN/dS ratios) in SARS-CoV-2 variants