Ongoing Positive Selection Drives the Evolution of SARS-CoV-2 Genomes

SARS-CoV-2 is a new RNA virus affecting humans and spreads extensively throughout the world since its first outbreak in December, 2019. Whether the transmissibility and pathogenicity of SARS-CoV-2 in humans after zoonotic transfer are actively evolving, and driven by adaptation to the new host and environments is still under debate. Understanding the evolutionary mechanism underlying epidemiological and pathological characteristics of COVID-19 is essential for predicting the epidemic trend, and providing guidance for disease control and treatments. Interrogating novel strategies for identifying natural selection using within-species polymorphisms and 3,674,076 SARS-CoV-2 genome sequences of 169 countries as of December 30, 2021, we demonstrate with population genetic evidence that during the course of SARS-CoV-2 pandemic in humans, 1) SARS-CoV-2 genomes are overall conserved under purifying selection, especially for the 14 genes related to viral RNA replication, transcription, and assembly; 2) ongoing positive selection is actively driving the evolution of 6 genes (e.g., S, ORF3a, and N) that play critical roles in molecular processes involving pathogen–host interactions, including viral invasion into and egress from host cells, and viral inhibition and evasion of host immune response, possibly leading to high transmissibility and mild symptom in SARS-CoV-2 evolution. According to an established haplotype phylogenetic relationship of 138 viral clusters, a spatial and temporal landscape of 556 critical mutations is constructed based on their divergence among viral haplotype clusters or repeatedly increase in frequency within at least 2 clusters, of which multiple mutations potentially conferring alterations in viral transmissibility, pathogenicity, and virulence of SARS-CoV-2 are highlighted, warranting attention.

RNA viruses usually have high mutation rates and tend to evolve rapidly [5]. For a new RNA virus affecting humans, such as SARS-CoV-2, a recent host shift likely decreases its fitness and impels the virus to adapt to the new host environments and public health interventions [6,7]. Natural selection may act on the transmissibility and virulence of SARS-CoV-2 through specifically adaptive mutations, which has been observed in Ebola, Zika, and other viruses [6,7]. SARS-CoV-2 has circulated globally since its first outbreak in December, 2019, and accumulated plenty of genetic mutations within the short period. There are currently more than 5,124,203 complete SARS-CoV-2 genome sequences publicly available, of which 29,735 nucleotide substitutions have been identified (https:// bigd.big.ac.cn/ncov/), compiling the largest population genomic data of non-humans so far. The SARS-CoV-2 lineages have exhibited considerable variations in transmission and clinical characters. For example, the basic reproduction numbers (R 0 ) range from 2.2 to 5.9 [8,9], and the mortality rates are from 0.8% to 14.5% [10]. Genetic variants with elevated contagiousness and escape of vaccine-derived immunity emerge and circulate with the outbreaks of Alpha, Delta, and Omicron strains [11,12].
It is critical for epidemic trend prediction, disease control, and vaccine design to understand whether and how natural selection drives the evolution of transmissibility and virulence of SARS-CoV-2 during the pandemic. If under selection, further research is warranted to identify the functional mutants contributing to the evolving epidemiological and pathogenic characteristics. SARS-CoV-2 evolution is composed of two phases: evolution in animal hosts to obtain the transmission ability to affect human, and in human populations after zoonotic transfer [13]. So far, assessment of natural selection on SARS-Cov-2 is mainly focused on the host shifting phase from animals to humans by analyzing the sequence divergence between SARS-CoV-2 and some closely-related viruses such as BatCoV-RaTG13 [14][15][16][17][18]. In contrast, few studies target the latter phase due to lack of efficient methods that can handle indeterminate ancestral sequences, extensive sampling bias, and clustering infections of SARS-CoV-2 [13,14,[19][20][21]. Some studies have explored specific mutations that are of potential significance in evolution or molecular function [10,[22][23][24]. Nevertheless, the analyses are based on allele frequency change of individual mutations that is not necessarily due to natural selection. It is only useful for screening for the candidate mutant loci instead of serving as a persuasive proof of the presence of natural selection. Despite the individual functional mutants identified in the aforementioned studies, ''there is a lack of compelling evidence" of mutations that ''impact the progression, severity, or transmission of COVID-19 in an adaptive manner" [6]. Indeed, the evolutionary driving forces underlying the SARS-CoV-2 epidemiological dynamics and pathogenic changes remain elusive. Furthermore, a genomewide survey of the evolutionary landscape of the functional mutations and their implication of the epidemiological perspective were not fully accomplished either.
Herein, we assess natural selection on SARS-CoV-2 evolution during its pandemic in humans by adopting a novel strategy that is relatively robust to viral clustering infections, founder effects, and sampling bias commonly existing in viral genomic data. The analysis validates the hypothesis that ongoing positive selection is indeed actively acting on the SARS-CoV-2 genomes, shaping the epidemic dynamics of COVID-19. We then partition the viral worldwide samples into clusters according to genomic similarity as a result of global transmission and clustering outbreaks. A spatial and temporal landscape of mutations is constructed on top of the clusters, and the critical mutations potentially conferring pathogenic and clinical characteristics of SARS-CoV-2 are highlighted. Our results provide a reference of viral evolution and genomic mutations for epidemic prediction, surveillance, vaccine design, and clinical treatments of COVID-19.

Results
A total of 3,674,076 SARS-CoV-2 genome sequences publicly available from 169 countries, as of December 30, 2021, were interrogated, and 69,359 mutations were identified. Since the virus populations are undergoing multiple clustering infections and founder effects, it is hard to use allele frequencies straightforward to draw reliable conclusions on virus evolution [25]. On the premise of neutral evolution, the ratio of nonsynonymous vs. synonymous mutations (N m /S m ) should be relatively unaffected by the change in population sizes, since both N m and S m sites from the same gene region are with the common demographic history and should demonstrate identical population behavior when under neutrality. Accordingly in the following sections, our analysis is mainly based on the analysis of N m and S m numbers for different sets of mutations.

Purifying selection dominates the viral genomic evolution in humans
From diffusion theory for the allele frequency in a large population, the population dynamic of mutations with very low frequency is identical to that of mutations under neutrality, and also, when sample size is very large, the population behavior of deleterious or beneficial mutations when in very low derived frequency is essentially the same to that of neutral mutations [26]. Therefore, we first calculated the N m /S m ratio of mutations in very low frequency (f) < 0.0001 as 3.17, which represents the relative abundance of N m and S m when the genomes evolve under neutrality. We then checked the relative occurrence of N m and S m in higher frequency. As shown in Table 1, the N m /S m ratio of mutations with f > 0.001 decreases to 1.28 and is significantly lower than that with f < 0.0001 (Pa = 2.2EÀ16, Chi-squared test), indicating a negative selection against N m . We further carried out the same analysis for subsets of mutations with different allele frequency ranges (i.e., (0.0001, 0.001] and (0.001, 0.01]). The pattern of reduced N m /S m ratios along with increased f is consistently observed: the ratio is 1.39 for SNPs with 0.0001 < f 0.001, and is 1.20 for 0.001 < f 0.01. The trend is consistent with the fact that mutations with higher derived allele frequency are usually older than those with lower frequency and are under purifying selection for a longer duration.
We further compared numbers of N m and S m for widespread (defined as mutations observed in viral samples of more than 30 countries) and non-widespread (defined as mutations observed in samples of less than 15 countries) mutations. The widespread mutations tend to prevail in the populations for a longer time than non-widespread ones. The N m /S m ratio of widespread mutations is significantly lower (P = 2.2EÀ16, Chi-squared test; Table 1), suggesting that purifying selection has been acting on these mutations. We also grouped mutations according to their spanning time that was calculated as the duration between the earliest and the most recent collection time of viral genomes carrying the mutations. The N m /S m ratio of mutations with long-time spanning (> 300 days) is significantly lower than that of mutations with short-time spanning (< 150 days) (P = 2.2EÀ16, Chi-squared test; Table 1), indicating more selective constraints on long-spanning-time mutations, again, confirming the effect of purifying selection. All the analyses consistently reveal overall purifying selection on SARS-CoV-2 genomes.
Positive selection drives the adaptive evolution of genes conferring pathogenicity and infectivity Even though SARS-CoV-2 is under genome-wide negative selection, a small fraction of the viral genome may have undergone positive selection, of which the genetic polymorphism pattern may be diluent in the genome-wide N m /S m ratios. To detect positive or purifying selection acting on SARS-CoV-2 individual genes, we again adopted the fact that mutations with higher derived allele frequency usually undergo longer duration of natural selection and present increased/decreased N m /S m ratios compared to those with very low frequency, and their population dynamics is identical to those under neutrality. Instead of comparing the N m /S m ratios between two groups of mutations (the NSRF1 method), we now carried out stricter statistical tests that investigate the increasing or decreasing trend of N m /S m ratios with the increased allele frequencies (the NSRF2 method), as a more robust indicator of the footprint of natural selection. Three trend tests, including two nonparametric tests [the Mann-Kendall (M&K) and Cox-Stuart (C&S) tests] and a linear regression method (Lin-Regress), were applied to detect the trend of N m /S m ratios as a function of mutant allele frequencies (or mutant allele counts).
Six genes, including spike (S), nucleocapsid (N), open reading frame 3 (ORF3a), ORF8, non-structural protein 4 (NSP4), and NSP13, show strong signals of positive selection, i.e., a significantly increasing trend of N m /S m ratios with increased allele frequencies (P < 0.01; Figure 1, Figure 2A and B; Table S1). Of them, S, N, ORF3a, and ORF8 have been previously studied with elevated protein evolutionary rates in SARS-CoV-2 evolution [27]. These genes play critical roles in molecular processes involving pathogen-host interactions, including viral invasion into and egress from host cells, and viral inhibition and evasion of host immune response, contributing to divergent pathogenic outcomes. Intriguingly, S protein is under positive selection, indicating that it has experienced adaptive alterations in its binding affinity to human angiotensin-converting enzyme 2 (ACE2) to gain cellular entry efficiency and viral infectivity during pandemic. N protein represents one of the most crucial structural components that facilitate viral replication, assembly, and release, and acts as an important immunodominant antigen. It has been reported to promote NLRP3 inflammasome activation and induce excessive inflammatory responses [28]. N protein used to be highly conserved [29], while it is under positive selection during the SARS-CoV-2 pandemic. ORF3a has been demonstrated to induce cellular apoptosis, lysosomal exocytosis-mediated viral egress, type I IFN response inhibition, and potential cytokine storm, which belong to the key processes determining viral infectivity, pathogenicity, and virulence [30,31]. ORF8 mediates host immune evasion through down-regulation of MHC-1 and inhibition of type I IFN response, promotes viral replication, induces apoptosis, and modulates ER stress [32]. NSP4 possibly participates in membrane rearrangement to benefit the viral replication and transcription complex formation, which may have also experienced positive selection when shifting from non-primate hosts to humans with some mutations potentially contributing to unique biological, pathological, and epidemiological features of SARS-CoV-2 [14]. NSP13 inhibits type I IFN response by interaction with TBK1, and counteracts antiviral immunity through hijack of host deubiquitinase USP13 [33]. These findings highlight that during the COVID-19 global pandemic, positive selection is very likely an essential driving force acting on viral invasion, and interplay between infection and host immune system defense, thus reshaping the viral features of infectivity, pathogenicity, and virulence. In contrast, 14 viral genes demonstrate a significantly decreasing trend of N m /S m ratios with increased allele frequencies (P < 0.01; Figures 1, 2B and C; Table S1), being consistent with former section that purifying selection dominates. Ten of the negatively selected genes (NSP1, NSP2, NSP3, NSP5, NSP6, NSP7, NSP10, NSP12, NSP15, and NSP16), encode non-structural proteins of SARS-CoV-2, including components of replication and transcription complexes such as RNA-dependent RNA polymerase (RdRp), papain-like protease (PLpro), main proteinase (Mpro), and RNA primase, which are all essential to viral RNA replication, transcription, and translation [34,35]. NSP10 and NSP16 (2 0 -O-methyltransferase) form a complex during coronavirus life cycle, which can methylate 5 0 cap of viral RNAs, enhancing their translation and mimicking cellular mRNAs to prevent recognition by host innate immunity [36]. Other two accessory genes (ORF6 and ORF10) play roles in evading host immune restriction. ORF6 has been demonstrated to inhibit type I IFN response via blocking nuclear translocation of STAT2, STAT2, and IRF3, and prevent host immune response via nuclear imprisonment of host mRNAs, serving as an antagonist of host immunity [37]. The rest negatively selected structural genes encode viral envelope (E) and membrane (M) proteins, and are involved in the assembly of progeny virions [38,39]. These results indicate that proteins conferring coronaviral fundamental molecular functions, such as viral replication, translation, assembly, and functions in evasion of host innate and adaptive immune systems (like mimicking or imprisonment of host mRNAs), are under significant purifying selection.

Clustering pattern of viral lineages
As we have demonstrated, although the viral genomes are overall under purifying selection, positive selection has been driving the evolution of genes related to coronavirus infection and host immune system defense, probably shaping epidemic and pathogenic diversification of viral populations. A further step is to understand the spatial-temporal dynamics of the diversification and identify the putative functional mutations subject to positive selection. Some studies have provided a list of candidate mutations, most of which were identified according to the trends of allele frequency changes in the overall global samples. As we know, the viral populations have been evolving and spreading in heterogeneous rates, demonstrating a clustering pattern. Investigating the allele frequencies in the pooled samples from multiple populations has two limitations: first, it has limited power to identify mutants which arose in a local population recently while are in a low frequency in the global population; second, it provides little information on the spatial and temporal origins (when and where) of these functional mutations. To track the evolutionary dynamics of genomic variants in a fine scale, we partitioned the sample of 3,328,405 genomes into distinct clusters according to their sequence similarity and evolutionary relationship, and identified 138 worldwide predominant clusters of SARS-CoV-2, denominated as C10, C23, . . ., and C299, respectively (see the Materials and methods section for details of the partition approach). The clusters and their genealogical relationship on a haplotype network are presented in Figure 3.
The genealogical relationship reflects establishment and evolutionary routines of diversified viral clusters, consisting of repetitive processes of viral emergence, transmission across populations and countries, and mutation accumulation as well. As shown in Figure 3, the viruses represent extensive transmissions across continents and countries along with time. The haplotype clusters comprise the currently circulating variants of concern (VOCs) and variants of interest (VOIs) like the B.  [40]. Alpha and Delta variants are the most abundant and widely distributed haplotype clusters so far, which broadly ravage USA and Europe (especially in UK). Of note, the clusters are still subject to ongoing evolution and branching, which warrant further surveillance.
Tracking the spatial and temporal occurrence of putatively selected mutations along the pandemic dynamics Nucleotide mutations that are predominant in a cluster and absent or in low frequencies in others are potentially of functional importance for viral pathogenicity and transmissibility, of which some may be the targets of positive selection. Following this criterion, we identified 545 protein-coding variations differentiated among the 138 delineated haplotype clusters, including 361 N m and 175 S m sites. We mapped the occurrence of some of these mutations to branches connecting the clusters Figure 3 The genealogical relationship of worldwide haplotype clusters of SARS-CoV-2 The nodes represent different haplotype clusters, with the node sizes proportional to the counts of the belonged sequences. The number of line segments separated by dots between adjacent nodes indicates the hamming distance between clusters. Within each node, its geographical distribution is presented. The listed mutations differentiating adjacent clusters are marked in purple for those within genes under positive selection, in cyan for those within genes under purifying selection, and in red for those repeatedly occurring at least twice in distinct phylogenetic relationships. on the haplotype network ( Figure 3). The numbers of intercluster mutations per branch vary from 1 to 47. We should emphasize that the allele frequency of a single mutant locus is not informative or robust to test the effect of natural selection; the list of mutations identified in this section serves as the mostly putative candidate loci under positive selection for further functional investigation. The evidence of natural selection was discussed using the trend of N m /S m ratios in former sections.
The C112 is likely the earliest cluster if using sample collection dates as a criterion [also with the inferred Time to the Most Recent Common Ancestor (TMRCA), results not shown]. A N m site L84S in the ORF8 protein, together with a tightly linked S m variant (S76S) in the NSP4 protein, emerged on the branch, leading to other primary early clusters like C109 and C100 (Figure 3). The L84S replacement together with S76S were used in former studies to define two major haplotype groups in the early epidemic stage: the L and S lineages. The proportion of L lineage in samples collected before and after Wuhan lockdown showed distinct differentiation (99% vs. 70%), and it was hypothesized that the frequency change of L lineage may due to different pressures of negative selection from containment measures [41]. It is disputable for that the allele frequency change can be caused by sampling bias and clustering infection as well [25]. In the map of haplotype network, L84S is pinpointed to the branch connecting C112 and the ancestral node of C109 and C100 (Figure 3), with a very low frequency of 0.2627% in C112 and increased to 100% in both C109 and C100, demonstrating an obvious cluster expanding pattern in the fine scale. According to COVID-3D database [42], the L84S amino acid alteration was predicted to eliminate 4 hydrophobic bonds and lead to destabilization of ORF8 protein. Another study with computational protein modeling proposed that L84S can mitigate the binding of ORF8 to human complement C3b, which is negatively regulated by the C-terminal serine-protease catalytic domain of the human complement factor 1, and activates the host complement system [43]. Therefore, the L84S mutation possibly impacts the normal function of ORF8, and plays an important role in the host immune responses and infection outcome.
The N m site D614G in the epitope region (the receptor binding domain) of S protein occurred in the common ancestor of C100 and majority of the clusters (Figure 3), which has been demonstrated to be associated with enhanced binding to the human ACE2, and increased viral replication, transmissibility, and loads in upper respiratory tract, indicating a competitive fitness advantage in humans [10,23,24,44]. Another N m site P323L in NSP12 (a kind of RdRp) also occurred at this stage, coupling the evolution with D614G. This mutation might regulate the activity of RdRp, and is related to viral replication and fidelity, altering SARS-CoV-2 mutation rates [45].
Derived from C100, two N m sites, R203K and G204R, on the phosphoprotein domain of N protein arose to 100% in C178. Both clusters are widely distributed in European, Asian, and American countries. According to the structural prediction provided by the COVID-3D database [42], R203K and G204R both destabilize the N protein, and the predicted actual free energy value (DDG) using the mutation Cutoff Scanning Matrix (DDG stability mCSM) are À1.71 kcal/mol and À1.07 kcal/mol, respectively, resulting in the alteration of their molecular interactions with other amino acids, such as carbonyl, polar bonds, and hydrogen bonds ( Figure S1A-D). Mutations on N protein may be functionally relevant to viral replication and assembly, and participate in immune evasion and viral infections [39,[46][47][48]. Diverged from C100, the Q57H mutation in ORF3a, one of the positively selected genes, arose and fixed in the C212, C240, and C242 clusters. This variant was predicted to exert structural destabilization (DDG stability mCSM = À1.55 kcal/mol) ( Figure S1E and F) and deleterious effects on protein function [49]. Another mutation I82T within the third membrane spanning helices in M protein diverges C102 and C293 (the early clusters of Delta variant) from C100, which is implicated in viral glucose transport. This mutation is structurally predicted to destabilize the M protein with DDG stability mCSM of À2.9 kcal/mol, and it is observed that it may be associated with the emergence of clusters with an excess of mutations.
Mutations in NSP14, an error-correcting exonuclease protein, may lead to malfunction of ExoN proofreading activity, thus resulting in elevated mutation rates during viral replication [40,50]. In our case, the majority of mutations in NSP14 were associated with elevated mutation rates, especially A394V and I42V that were relevant to the emergence of Delta (C293) and Omicron (C281) variants. A394V was predicted to destabilize NSP14 with estimated DDG stability mCSM of À0.55 kcal/mol, and I42V with DDG stability mCSM of À1.06 kcal/mol. We proposed that mutations in NSP14 could be potential predictors for clusters with a high mutation rate.

Mutations with rapid increase of frequency within clusters are potential targets of selection
Other than the mutations demonstrating nearly fixed divergence among clusters, mutations presenting prominent frequency increasing trends over sampling times within a cluster are also potential sites with evolutionary or functional significance in COVID-19 epidemic. The mutations having increasing trends in frequency independently within at least two clusters are of significance in evolution and function with high confidence. We profiled those within-cluster mutations according to their frequency dynamics during a period of 30 sequential sampling times, and 11 mutations demonstrating a significant trend of increased frequency over sampling times (simultaneously tested by M&K, C&S, and LinRegress tests; P < 0.005) independently within at least two clusters were identified ( Figure S2).
Within C162 and C39 clusters that both predominate in Denmark, a N m site L85F in the positively selected gene ORF3a displayed independently increased frequencies from 0 to 0.62 and from 0 to 1 in two different periods of November 09, 2020-March 01, 2021 and March 29, 2021-August 18, 2021, respectively. This variation located in the second transmembrane segment of the ORF3a protein and potentially can affect the function of virus-induced cell apoptosis and viral egress of SARS-CoV-2, as well as host immune responses and clinical outcomes.
A N m site V1264L of S protein independently rises its frequencies from 0 to 0.30 within C276 and from 0 to 0.20 within C129. This variation is located in the cysteine rich intravirion region at the C-terminus of coronavirus S protein, in which the cysteine residues are targets of palmitoylation, necessary for efficiently S incorporation into virions and S-mediated membrane fusions that impact the efficiency of host cellular entry thus viral infectivity [51].
As mentioned above, mutations in NSP14, the errorcorrecting exonuclease protein, may be strongly associated with elevated mutation rates, and be of first priority to be monitored. We observed a substitution of M72I in NSP14 having independent increases in frequency from 0 to 0.60 within C295 and from 0 to 0.10 within Delta C129. Intriguingly, we observed consistent ongoing rises in frequency across multiple European countries including UK, Norway, Belgium, Italy, Germany, and Netherlands ( Figure S2). M72I is close to the sites at the heterodimer interface of NSP14/NSP10 complex which stimulates the ExoN activity of NSP14, which may elevate the mutation rates of SARS-CoV-2. Moreover, it is predicted to structurally destabilize NSP14 or NSP14/NSP10 complex with a significant DDG stability mCSM of À1.89 kcal/mol. Again, the findings suggest that mutations in NSP14 are supposed to be under constant surveillance in future, and the clusters of Delta variant are still under ongoing selection, which warrants further attentions.

Discussion
Identification of the evolutionary dynamics of SARS-CoV-2 during its pandemic in worldwide human populations, is confronted with great challenges. The dynamics of viral populations demonstrates a series of founder events caused by clustering infection or bursts of epidemic in local regions. Besides, the genomic samples are usually collected from different times and locations disproportionally (sampling bias). Both significantly impact the allele frequencies and bring challenges in analyzing the viral genomic data with most population genetic methods [25]. Comparing the relative excess of nonsynonymous and synonymous substitutions is relatively robust to population size changes, representing an efficient approach for evaluating the effects of natural selection on SARS-CoV-2. The approaches we used in this study are logically similar to the known McDonald-Kreitman test [52][53][54] in molecular evolution, which compares the ratio of nonsynonymous to synonymous substitutions of between-species divergence to that of within-species polymorphisms, and uses the latter as an internal control under neutrality. In contrast, the method we proposed here, referred as the NSRF1 method, is for only comparing genetic polymorphisms within a species. The method is novel in using the N m /S m ratio of mutations with very low frequencies as the internal control under neutral evolution. The fact that the population dynamics of mutants with very low frequencies is identical to those under neutrality is valid according to diffusion theory for the allele frequency in a large population [26]. When identifying natural selection acting on individual genes, we further investigate the increasing or decreasing trend of N m /S m ratios along with the increased mutant allele frequencies under the assumption that the mutations with higher frequencies tend to undergo a longer duration of natural selection and present proportionally increasing or decreasing N m /S m ratios (the NSRF2 method), which are more efficient and robust indicators of natural selection.
By adopting the strategies, we demonstrate with multiple lines of evidence that SARS-CoV-2 genomes are overall constrained under purifying selection during its pandemic. In spite of this, evidence of positive selection acting on specific genes that participate in coronavirus infection and host immune evasion are intriguingly observed. These results indicate that ongoing positive selection is actively driving tighter affinity with human and escape of host antiviral immunity, leading to high transmissibility and mild symptom in a long-run evolution of SARS-CoV-2. Such trend was supported by studies analyzing the immunological and epidemiological data on endemic human coronaviruses [55].
We further partition the viral genomic samples into 138 haplotype clusters according to their sequence similarity and genealogical relationship. Superimposing on the 138 worldwide transmission clusters, we provide a list of 556 mutations as putative target sites of natural selection. Whilst there is no concrete evidence supporting their functional significance during the outbreaks, mutations showing between-cluster divergence or within-cluster frequency boost may explain distinct pathogenicity and infectivity. Thus, the list of mutations provides a basis for further functional study and clinical treatment.

Identification of nucleotide mutations
All the sequences were aligned using MUSCLE [57] with default parameter settings. Then, 265 bp of the 5 0untranslated region (UTR) and 229 bp of the 3 0 -UTR region were trimmed out, with a final length of 29,409 nucleotides retained. Nucleotide mutations were called by comparing these sequences with the reference sequence (NC_045512).

Identification of selection on genomes or individual genes
Selection on viral genomes was detected by comparing the relative abundance of N m and S m between mutations with high and low allele frequencies (referred to as the NSRF1 method), between widespread and non-widespread mutations, and between mutations with long-and short-time spanning. Selection on individual genes was identified as follows. We divided mutations into 4000 bins corresponding to ! 5, ! 10, . . ., and ! 20,000 derived allele counts. Let r j denote the N m /S m ratio for mutations with the derived allele counts ! j. When under purifying selection, r j values are expected to decrease with j; while under positive selection, an increasing trend of r j values is expected. Thus, we applied three kinds of statistical methods to detect the increasing or decreasing trends of r j values as a function of j, including M&K and C&S tests, and LinRegress (referred to as the NSRF2 method). The false discovery rate correction was performed to correct for false positives.

Clustering definition of viral lineages based on the haplotype network analysis
Each viral haplotype was assigned to a cluster following the steps of the classification tree shown in Figure S3. A total of 545 SNPs were chosen as the features for the classification. Each of the haplotypes was assigned to different clusters according to their alleles on the 545 SNP loci following the order of the features.

Prediction of the effects of mutations on protein function
An online resource COVID-3D (http://biosig.unimelb.edu. au/covid3d/) was used to predict the effects of mutations of SARS-CoV-2 on protein structure [42].