Temporal-Geographical Dispersion of SARS-CoV-2 Spike Glycoprotein Variant Lineages and Their Functional Prediction Using in Silico Approach

ABSTRACT SARS-CoV-2 is a positive-sense single-stranded RNA virus with emerging mutations, especially on the Spike glycoprotein (S protein). To delineate the genomic diversity in association with geographic dispersion of SARS-CoV-2 variant lineages, we collected 939,591 complete S protein sequences deposited in the Global Initiative on Sharing All Influenza Data (GISAID) from December 2019 to April 2021. An exponential emergence of S protein variants was observed since October 2020 when the four major variants of concern (VOCs), namely, alpha (α) (B.1.1.7), beta (β) (B.1.351), gamma (γ) (P.1), and delta (δ) (B.1.617), started to circulate in various communities. We found that residues 452, 477, 484, and 501, the 4 key amino acids located in the hACE2 binding domain of S protein, were under positive selection. Through in silico protein structure prediction and immunoinformatics tools, we discovered D614G is the key determinant to S protein conformational change, while variations of N439K, T478I, E484K, and N501Y in S1-RBD also had an impact on S protein binding affinity to hACE2 and antigenicity. Finally, we predicted that the yet-to-be-identified hypothetical N439S, T478S, and N501K mutations could confer an even greater binding affinity to hACE2 and evade host immune surveillance more efficiently than the respective native variants. This study documented the evolution of SARS-CoV-2 S protein over the first 16 months of the pandemic and identified several key amino acid changes that are predicted to confer a substantial impact on transmission and immunological recognition. These findings convey crucial information to sequence-based surveillance programs and the design of next-generation vaccines.


RESULTS
Global dispersion of SARS-CoV-2. Up to 30 April 2021, more than 150 million confirmed cases were recorded worldwide (Fig. 1A). The global spread of SARS-CoV-2 was first recognized in March 2020. Another wave further attacked Europe and North America beginning October 2020, while the proportion of cases increased again in Asia and South America from early 2021 (Fig. 1B). By the end of April 2021, nearly 1.3 million SARS-CoV-2 complete genomes were characterized (Fig. 1C). Compared to the prototype (NC_045512, 614D), numerous nonsynonymous mutations accumulated across the viral genome, with 97.3% of SARS-CoV-2 sequences containing S-D614G in the S protein (Fig. 1D). Among them, 4 S-D614G variants of concern (VOCs), namely, a (B.1.1.7), b (B.1.351), g (P.1), and d (B.1.617), emerging with multiple additional amino acid changes in the S1-RBD, have caused community outbreaks since the end of 2020 in different countries (Fig. 1E).
Genomic diversity of SARS-CoV-2 S protein. Based on the amino acid sequence alignment of S protein, at least 14,479 unique variants that formed 1,257 Pangolin lineages (defined by Global Initiative on Sharing All Influenza Data [GISAID]) were clustered, with amino acid mutation sites ranging between 1 and 16 (including indel events). Before WHO declared the COVID-19 outbreak a pandemic (December 2019 to February 2020), the S gene was relatively conserved, and 75% (52/69) of unique variants belonged to the prototype (614D) ( Fig. 2A), which corresponded to one or two amino acid changes (  (Fig. 3A). However, it is worth noting that the lower level of mutation identified in the early stage of prepandemic could be due to the limited availability of virus sequences (see the bar chart in Fig. 2A).
The temporal dispersion and reported signature amino acid mutations of the four VOC lineages are described below.
(iv) Delta lineage, d (B.1.617). In early 2021, India faced a new wave of the pandemic. A novel variant lineage was detected within the local community that could be  traced back as early as the beginning of October 2020 (EPI_ISL_1415203) (Fig. 3E). This lineage was characterized with a signature mutation, L452R, within S1-RBD. Natural selection of S gene variations. When the S protein variants (N = 3,242, each represented by at least 11 reported identical sequences) were aligned, at least 693 amino acid variations (including indels) were observed (Fig. 4). Likelihood ratio tests for natural selection showed that at least 51 amino acids in the S protein were under positive selection, including 20 and 5 sites located in the S1-NTD (aa 13 to 305) and the S1-RBD (319 to 541) regions, respectively (Table 1). Among them, 4 amino acid mutations within the S1-RBD (aa 452, 477, 484 and 501) might have an impact on the binding affinity to hACE2. L452R was highly specific to d (B. In silico structure prediction of S protein variants and classification. The cryogenic electron microscopy (cyro-EM) structures of S protein for the prototype and a (B.1.1.7) and b (B.1.351) variants are available in the Protein Data Bank (PDB) (27,32,33). Utilizing SWISS-MODEL protein homology modeling (34), we predicted structures of .200 variants of interest that were represented by at least 500 different S protein sequences to assess their structural characteristics with reference to the prototype. As S protein is a homotrimeric protein, we superposed monomer using SWISS-MODEL with the closed conformation of S protein (prototype) (PDB entry 6ZGI), generating a predicted structure that was 92% identical to the known structure (Fig. S5). We then superposed the predicted S protein structure of a monomutant containing a D614G mutation. Dual mutants containing D614G and other amino acid changes, including those under positive selection or observed in VOCs (L5F, L18F, S98F, W152L/C, E154K, L222V, and A262S in S1-NTD; N439K, L452Q/R, S477N, L478R/K, E484K/Q, N501Y, and A570D within S1-RBD; Q677H/P and P681H/R in S1-CTD; T716I, S982A, and D1118H in S2) as well as commonly deleted residues (aa 69 to 70, 144, and 241 to 243) were further simulated for structure prediction. From our prediction models, none of the  Table 1). Other amino acid changes are visualized in gray. The dotted lines in red show S1-NTD and S1-RBM regions of SARS-CoV-2 S gene, of which our amino acid sites that bind to hACE2 were under positive selection (452, 477, 484, and 501). The maximum likelihood tree on the left was constructed using RAxML based on the concatenated nucleotide sequence alignments of 12 open reading frames (ORFs). The phylogenetic positions of the four VOCs (a, b, g, and d ) are highlighted in dark red. The numbers of sequenced isolates (in log 10 ) and amino acid sites of each S gene variant are shown on the right. Mutations observed in at least 50% of sequenced genomes are in italic and bold type.
c Containing at least 100 sequenced genomes, except for the four VOCs.
Boon et al.
® monomutants and deletion mutants affected the S protein structure (.93% identity to the prototype). However, the S-D614G mutation reduced protein structure identity to the prototype by half (51%), altering its S1-RBD conformation primarily (Fig. 5A). When incorporating D614G in monomutants, we were intrigued to find that mutation of T478I 1 D614G or E484K 1 D614G resulted in mutants with protein structure like the S-D614G mutant, while other amino acid changes at the same residues, such as T478R/ K 1 D614G or E484Q 1 D614G, were similar to the prototype (Fig. 5B). Next, we predicted the S protein structure of native variants, including the four VOCs (a, b, g, and d ), using the same approach. Based on our prediction models, we categorized the variants into 4 groups (Fig. 6A), including group 1 of .92% identity to the prototype (Fig. 6Bi), group 2 of S1-NTD unaligned (Fig. 6Bii), group 3 of S1-NTD and S1-RBD unaligned (Fig. 6Biii), and group 4 of S1-RBD unaligned (Fig. 6Biv). S protein that did not carry D614G mutation was categorized into group 1, while the b (B.1.351) variant that carried additional L18F mutation showed both its S1-NTD and S1-RBD unaligned to the prototype (group 3). As expected, the majority of VOCs carrying mutations within the S1-RBD belonged to group 4.
Impact of amino acid variations in S protein to hACE2 binding and its antigenicity. We next utilized in silico approaches to assess the impact of S protein conformation divergence on their functions, particularly their hACE2 binding affinity and antigenicity.
(i) hACE2 binding affinity. We predicted the binding affinity of S protein variants to hACE2 using pyDock (35). Our results showed that the binding energy of hACE2 with the prototype (NC_045512) and S-D614G mutant was 245.88 kcal/mol and 247.06 kcal/mol, respectively, while all the variants in groups 1 to 4 bound stronger to hACE2 than the prototype. Notably, the S protein binding affinity of the b (B.  (Fig. 6A). In addition, our results showed that b (B.1.351) and a (B.1.1.7) variants bound stronger than S-D614G to hACE2, which is consistent with previous reports based on both in silico and in vitro studies (36,37).
(ii) Antigenicity of S protein variants. As S protein is known to elicit host immune response, we used immunoinformatics tools to predict the antigenicity of the S protein prototype and variants. We used Immune Epitope Database (IEDB) analysis resources to predict the ability of S protein sampled by MHC class I molecules to be presented as antigens. BepiPrep-2.0 (38) was used to predict the probability of S protein being recognized by B cells to produce S-specific immunoglobulins. Our results showed that S protein variants in groups 1, 2, 3, and a (B.1.1.7) in group 4 had a greater, while b (B.1.351), g (P.1), and d (B.1.617) variants in group 4 had a lower, T cell immunogenicity than the prototype (Fig. 6A). For B cell epitope probability, S proteins featured in groups 1, 2, and 3 scored higher, except for the B.1 variant lineage (EPI_ISL_984896, group 2), which scored similar or higher than the prototype and D614G mutant. S proteins featured in group 4 had a lower B cell epitope probability than the prototype and D614G mutant. These findings indicated that mutation harbored within the VOCs conferred a lower T and B cell immunogenicity of S proteins, allowing the virus to circumvent host immune surveillance.
Prediction on potential amino acid alterations that confer lower immunogenicity. As SARS-CoV-2 continues to mutate, we attempted to predict amino acid mutations that could emerge in SARS-CoV-2 variants. Based on dN/dS ratio and key amino acids involved in immune evasion, we focused on L5, L18, S98, and A222 (S1-NTD), N439, L452, S477, E484, and N501 (S1-RBD), and Q677 and P681 (S1-CTD). We changed each of the nucleotides corresponding to these genetic codes and predicted the potential amino acid mutation that could arise, as shown in Table 1 and Table S1. Subsequently, using the prototype as a template (NC_045512), we artificially replaced the respective native S protein variants with the predicted amino acids. As D614G mutation achieved a stable mutation, we incorporated the artificial and D614G mutations and analyzed the impact of these amino acid changes to (i) S protein structure and (ii) T and B cell antigenicity. The protein model of SARS-CoV-2 S protein prototype (NC_045512 in blue) was generated using SWISS-MODEL. The monomer of S protein contains S1 and S2 subunits, with the S1 domain further classified into N-terminal domain (S1-NTD), receptor binding domain (S1-RBD), C-terminal domain (S1-CTD) (colored magenta), and S2 domain. The protease cleavage site is located between S1 and S2. When superposing protein models of S-D614G mutant (mutated amino acid residue 614, indicated by a black arrow) onto the prototype, the aligned regions of the two models are indicated in light gray, while the regions where these two proteins are not aligning are indicated in blue (prototype) and red (S-D614G mutant). Note the D614G mutant is 50.53% identical to the prototype. (B) Protein models of S protein containing T478I/R/K or E484K/Q mutations together with D614G double mutations were superposed onto the prototype. The position of the respective amino acid residues mutated is indicated by black arrows. Note the T478R/K 1 D614G and E484Q 1 D614G double mutants are .93% identitical to the prototype, while the S1-RBD domain of T478I 1 D614G and E484K 1 D614G double mutants were not aligned with the prototype.
(i) Impact of artificial amino acid change on S protein structure. Our S protein structure models revealed that all artificial D614G dual mutations containing amino acid changes in S1-NTD, S1-RBD, S1-CTD, or S2 shared a structure that resembled the S-D614G mutant. As mentioned, an exception was observed with T478R/K 1 D614G or E484Q 1 D614G, in which these mutants resembled the prototype (Fig. 5B).
(ii) Prediction of artificial amino acid change to T and B cell recognition. The prototype had an overall T cell immunogenicity and B cell epitope probability score of 0.436 and 0.470, respectively, while the overall T and B cell immunogenicity scores of the D614G mutant were 0.387 and 0.470, respectively. Among the amino acid mutations found in circulating SARS-CoV-2 variants, E484Q/K, P681H/R/L/S, P936Y/N/H, D950H/A/N, and D1118H/Y had a lower T cell immunogenicity than the prototype and S-D614G mutant (Table S1). Among the artificial mutations, A222P/S, N439S, L452P, T478P/S/N, E484L/A/G/V/D, N501S/K, Q677K, P681A/T, T716S/P/K, P936Q, D950Y/Q, and D1118Q resulted in S protein variants with a lower T cell immunogenicity than the prototype and D614G mutant ( Table 2).
To understand the likeliness of artificial amino acid in altering B cell epitope probability, we assessed the effect of amino acid alterations on the overall B cell epitope probability of the entire S protein. S1 that contained L5V/H/P/R, L18V/I/H/R/P, S98/A/P/ Y/C, A222P/T/S/G/D, K417E/Q/R/G/M, N439Y/I/S/T, T478A/S, E484G/V/D, N501I/E/D/K, T716S/P/A/R/K, D936R/A/Q, and D950R/Y/V/G/Q, S982T/P/L, and D1118R/G/A/Q artificial mutations resulted in S proteins with a lower B cell recognition ability than the respective native amino acids ( Table 2, Table S1). Altogether, our findings indicate that mutations on A222, N439, L452, T478, E484, N501, Q677, T716, D936, D950, and D1118 impact virus T and B cell immunogenicity. Notably, we predicted that the presence of N493S, T478S, E484D, N501K, T716K, and D950Q mutations in SARS-CoV-2 variants could confer lower immunogenicity than the native variants. the S prototype. S protein in group 1 did not carry the D614G mutation, while S protein of groups 2, 3, and 4 contained D614G mutation. Also shown are the percent identities of S protein variants to the prototype, predicted binding energy to human angiotensin-converting enzyme 2 (hACE2), major histocompatibility complex class I T cell immunogenicity, and B cell epitope probability of the respective SARS-CoV-2 S protein variant lineages. (B) Protein models of SARS-CoV-2 variants superposed onto the prototype (NC_045512). These proteins were classified into (i) group 1, with the protein sharing .92% structure similarity to the prototype; (ii) group 2, with the S1-NTD of variants unaligned to the prototype; (iii) group 3, with the whole S1 region of S protein variants unaligned to the prototype; and (iv) group 4, with S1-RBD unaligned to the prototype.
Predict the impact of artificial amino acid mutations on S protein functions of SARS-CoV-2 variants. We assessed the impact of artificial mutations on S protein functions of the 4 VOCs (a, b, g, and d ). We focused on artificial mutations in S1-RBD (N439S, L452P, S477G, T478S, E484D, and N501K) that resulted in S protein conferring a lower antigenicity (Table 2) and predicted their protein conformation, hACE2 binding affinity, and antigenicity. These results were summarized in Table 3.
(i) S protein conformation. We superposed S protein models of a (B. 1.1.7), b (B.1.351), g (P.1), and d (B.1.617) containing artificial amino acid mutations against the prototype and S-D614G mutant. In general, there was no change in the conformation of S proteins, except for b (B.1.351) variants represented by EPI_ISL_700450 (group 3) and EPI_ISL_712079 (group 4) (Fig. 6A and B). Strikingly, when N439S or T478S was simulated, or E484K mutation was replaced with E484D in the amino acid sequence of EPI_ISL_700450, its S protein structure featured in group 3 switched to group 4 (Table 3). On the contrary, adding N439S and L452P or mutations into the existing EPI_ISL_712079 S protein led to a switch of structure featured in group 4 to group 3 (Table S2).
(ii) hACE2 binding affinity. Artificially mutated S proteins of the SARS-CoV-2 variant lineages displayed a different impact on hACE2 binding affinity. For example, insertion of N439S, L452P, T478S, E484D, or N501K mutations into the existing a (B.1.1.7) S protein conferred a higher binding affinity to hACE2. Similarly, S proteins of b (B.1.351) variants with N439S or T478S insertion, or alteration of N501K to N501Y, resulted in a higher binding affinity to hACE2. Meanwhile, incorporation of N439S artificial mutation into the existing mutations of g (P.1) and d (B.1.617) also resulted in increased binding affinity to hACE2 than their respective natural mutants.
(iii) T and B cell antigenicity. Since the artificial mutations cluster within the S1-RBD, we further examined the B cell epitope probability of the S1-RBD among the variants. The presence of N439S, T478S, and N501K artificial mutations in all four VOCs led to a lower T cell immunogenicity than the respective natural variants. On the other hand, the presence of S477G in a (B.1.1.7), N439S, T478S, and N501K in b (B.1.351), T478S, E484D, and N501K in g (P.1), and E484D in d (B.1.617) resulted in S protein variants with a lower B cell epitope probability.
Collectively, our prediction models indicate that three yet-to-be-identified hypothetical mutations (N439S, T478S, and N501K) alter S protein structure coupled with increased binding affinity to hACE2 and lower immunogenicity. These mutations may confer additional adaptive and survival benefits to the current circulating SARS-CoV-2 variants.

DISCUSSION
With the ramping up in sequencing capacity and interest, an enormous number of novel variants of SARS-CoV-2 were identified. Mutation occurs partly due to natural selection (39), and the mutation rate reflects a response to the selection (40). The mutation rate of SARS-CoV-2 is higher than that of the mild beta-CoVs, such as hCoV-OC43 (8), but lower than that of RNA viruses in general (41).
In this study, we focused on mutations in the S protein, particularly the key amino acids under positive selection pressure. After stringent filtering, we identified 693 amino acid variations in S protein. Among these amino acid mutations, S-D614G is the mutation that emerged in January 2020 (20,23) that allows SARS-CoV-2 to bind to cell receptors, predominantly hACE2, and enter host cells more easily (25). Since the pandemic was declared, strict travel restrictions have been imposed in most countries. Despite this, S-D614G continued to be detected in different countries, indicating that SARS-CoV-2 S-D614G mutant was the parental virus seeding across continents and then continued to mutate independently, giving rise to SARS-CoV-2 variants containing additional signature amino acid mutations.
The amino acid mutation may confer protein structural and functional alteration. The structural change may depend upon the change in the group of charge, polarity, and hydrophobicity of the amino acid and whether or not the amino acid alteration occurs within the functional domain of the protein. For S protein, our prediction models and others (28) showed that single-amino-acid mutation alone can alter the conformation of S1-RBD. Amino acid change of T478I (polar to nonpolar residue) and E484K (negative-to positive-charge residue) did not contribute to the major shift of the S-D614G structure. However, T478R/K (polar to positive-charge residue) and E484Q (negative to polar residue) resulted in a shift of S-D614G to a structure resembling that of the prototype, while amino acid change can lead to a change in S protein antigenicity and binding affinity to hACE2. Of note, the nonsynonymous substitution of frequently mutated amino acids influences S protein conformation. For instance, the B.1.351 variants with an additional L18F mutation resulted in altered S1-NTD and S1-RBD conformation with increased binding affinity to hACE2.
N501Y has also been shown to confer increased infectivity. Even though both the B.1.1.7 and B.1.351 variants harbor the N501Y mutation, they display a different degree of infectivity and antibody neutralization ability. The B.1.351 variant possesses greater infectivity and better resistance to antibody neutralization (42). In addition, other amino acids are also important to viral pathogenicity. SARS-CoV-2 K417T/E484K dual mutants resist postvaccination serum neutralization to an extent similar to that of the B.1.351 variant (42), while N439K mutation arose independently in different countries, facilitating viral attachment to hACE2 and resistance to antibody neutralization (43). Aside from the known mutations, we predicted that if SARS-CoV-2 acquires N439S, T478S, or N501K mutations, the virus will attain a chance to infect host cells even more efficiently and with reduced antigenicity.
Mutation via amino acid deletion may affect virus fitness and is known to influence a wide variety of biological phenomena. However, this may depend on whether the deletion occurs in the vicinity of a functional domain. From our predictive models, deletion of amino acids 69 to 70 and 144, which are signatures of B.1.1.7 variants, may enhance the infectivity of the virus; however, this may not affect its immunogenicity, while deletion of amino acids 241 to 243, located relatively closer to S1-RBD, found in B.1.351, renders lower immunogenicity and higher binding affinity to hACE2.
However, there are several shortfalls of our in silico prediction models. First, S protein models were generated based on the protein structure deposited in the PDB, and amino acid mutations predicted have not been reported before. The precise protein structure of novel mutants might be inaccurate. Second, the docking model utilized a rigid protein structure to predict binding energy, which differs from the natural status of activated S protein when encountering hACE2 or antagonist antibodies (44). Finally, despite the immunoinformatics tools being useful to predict the antigenic epitope of a protein, they are far from perfect. Therefore, laboratory evidence is paramount to further elucidate the functionality and antigenicity of S protein variants. Nonetheless, our findings provided a foundation to further investigate how SARS-CoV-2 exhibits its infectivity and pathogenicity. Despite these deficiencies, our findings are impactful for preparing the potential threat posed by novel S protein mutations in relation to viral pathogenicity. More importantly, our data are of great importance particularly for vaccine development and anti-SARS-CoV-2 drug discovery.

MATERIALS AND METHODS
Metadata and sequence collection. We retrieved 1,294,693 high-quality SARS-CoV-2 complete genome sequences deposited in the GISAID database, with metadata available from 24 December 2019 to 30 April 2021. Among these, 939,591 amino acid sequences encoding the complete S protein were extracted and clustered into 34,134 unique variants using cd-hit (-G 1 -aL 0.0 -c 1 -M 0 -T 0 -d 0 -g 1 -sc 0) (45). For each variant, the isolate with the earliest collection date was selected as the representative sequence. After removing singletons (N = 19,655), 14,479 variants represented by at least 2 identical S amino acid sequences were retained for further analyses.
Detection of positive selection. Four algorithms within the HyPhy distribution (49), including FEL (fixed-effects likelihood), FUBAR (fast, unconstrained Bayesian approximation), MEME (mixed-effects model of evolution), and SLAC (single-likelihood ancestor counting), were used to detect the sites under positive selection. An ML tree inferred from 3,242 S gene variant complete genome nucleotide sequences was preconstructed as the phylogenetic input. A codon site with a dN/dS (v ) value of .1 was regarded as under positive selection when the posterior probabilities inferred from the four algorithms were $0.95 (P # 0.05).
Homology modeling of protein structure, structure comparison, and function prediction. S protein structures were predicted using the SWISS-MODEL (34), an automated homology modeling approach, based on SARS-CoV-2 S variant amino acid sequences available in the Protein Data Bank (PDB) (PDB entries 6ZP2, 6VXX, 7LSS, 7LYQ, 7CWN, 7CN8, and 6LXT). Molecular docking of SARS-CoV-2 S proteins and hACE2 (residues 23 to 63) was performed using pyDock (35) and Chimera UCSF (50). The percent identity of S protein variants and prototype was analyzed using Chimera UCSF. The total binding energy of S proteins to hACE2 was predicted using pyDock. Prediction on major histocompatibility complex (MHC) class I T cell immunogenicity was performed using IEDB analysis resources (51,52), while B cell epitope probability of the S1-RBD or entire S protein was predicted using BepiPrep-2.0 (38).
Data availability. The complete genome sequences of coronaviruses analyzed in this study were downloaded from the Global Initiative on Sharing All Influenza Data (GISAID).

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.