Functional evolution of SARS-CoV-2 spike protein: Maintaining wide host spectrum and enhancing infectivity via surface charge of spike protein

The SARS-CoV-2 virus, which causes the COVID-19, is rapidly accumulating mutations to adapt to the hosts. We collected SARS-CoV-2 sequence data from the end of 2019 to January 2023 to analyze for their evolutionary features during the pandemic. We found that most of the SARS-CoV-2 genes are undergoing negative purifying selection, while the spike protein gene (S-gene) is undergoing rapid positive selection. From the original strain to the alpha, delta and omicron variant types, the Ka/Ks of the S-gene increases, while the Ka/Ks within one variant type decreases over time. During the evolution, the codon usage did not evolve towards optimal translation and protein expression. In contrast, only S-gene mutations showed a remarkable trend on accumulating more positive charges. This facilitates the infection via binding human ACE2 for cell entry and binding furin for cleavage. Such a functional evolution emphasizes the survival strategy of SARS-CoV-2, and indicated new druggable target to contain the viral infection. The nearly fully positively-charged interaction surfaces indicated that the infectivity of SARS-CoV-2 virus may approach a limit.


Introduction
The SARS-CoV-2 virus has been causing COVID-19 pandemic for more than three years. During the pandemic, the virus is rapidly accumulating mutations. The three major variants, namely alpha, delta and omicron, caused significant waves of infection worldwide in early 2021, mid 2021 and early 2022. Each infection wave is much more serious than the initial outbreak in Wuhan, China around early 2020, and the infections in each wave increased dramatically [1]. With the mitigation of quarantine policies in many countries, the transmission will be booming, and the transmission between human and animals may become more frequent. Thus, omicron should not be the last variant of SARS-CoV-2. The main evolutionary force of the viral mutations includes infectivity, reproduction efficiency and immune evasion. The earliest infectivity-increasing mutations were observed in China (V367F, W436R and D364Y), where strict isolation policies were deployed; in contrast, the other early mutations in the rest of the world did not show increased affinity of S-RBD and human ACE2 because of relaxed quarantine measures [2,3]. The worldwide massive vaccination was also a driving force of virus evolution. The major variants emerged soon after such massive vaccination, causing immune evasion and significantly reduces the effectiveness of vaccines [4,5]. However, little was known about the mutations on the reproduction efficiency, i.e. the adaptation on the viral protein synthesis.
The epidemiology indicated that the SARS-CoV-2 virus is adapting human for co-existence due to the increasing infectivity and decreasing mortality. However, many studies showed that the SARS-CoV-2 is under strong purifying selection [6,7]. This normally means that the functional mutations are being excluded, which is counteracting the adaptation [8]. Such a contradiction needs more clarification.
In this study, we focused on the translational adaptation of the SARS-CoV-2 virus and the surface properties of the spike protein to provide more insights on the abovementioned questions, and depict the major trend of the SARS-CoV-2 evolution during the evolution.

Sequencing data analysis
The original RNA-seq and Ribo-seq data (Calu3 cells infected with SARS-CoV-2 for 7 h) were downloaded from Gene Expression Omnibus database (GEO) under accession number GSE149973 [9], In brief, all raw sequencing data low quality reads and linker were removed using fastp [10]. For the quantification of SARS-CoV-2 gene expression, FANSe3 program was used to align the clean reads to human refmRNA (UCSC) to remove human reads [11,12], after human reads removal, the remaining reads mapped to the Wuhan Hu-1 (NCBI Accession NC_045512.2. Mapping parameter, Ribo-seq: -S10 -E1 -U1, RNA-seq: -S12 -E2), reads per kilobase per Million mapped reads (RPKM) of each gene were calculated in golang program.

Proteomic and RNC-seq expression data analysis
Proteomic (quantitative technique: DIA, A549 cells were infected with SARS-CoV-2 for 24 h) and RNC-seq (human HBE cell line) expression data were extracted from two independent studies supplementary materials, respectively [13,14].

SARS-CoV-2 genomes data analysis
SARS-CoV-2 genomes for temporal evolution analysis were downloaded from NCBI SARS-CoV-2 database at 2023/01/29 (https:// www.ncbi.nlm.nih.gov/SARS-CoV-2/), with a list of over 2.9 million SARS-CoV-2 genomes, only sequences with length longer than 29500 and having zero N bases will be included in the subsequent analysis. Genomes for CAI and ITE evolution analysis were downloaded from GISAID database (www.epicov.org).

Codon adaptation analysis
CAI (Codon Adaptation Index), ITE (Index of Translation Elongation) and RSCU (Relative Synonymous Codon Usage) were calculated using DAMBE7, CAI is a measure of the host codon adaptation for a gene sequence. ITE is similar to CAI, which incorporates both tRNA-mediated selection and background mutation bias and fits protein production better than CAI [20][21][22].

Electrostatic energy and surface electrostatic potential calculations
The complex structure of the SARS-CoV-2 RBD and ACE2 complex (PDB: 6M0J), SARS-CoV-2 Spike with furin cleavage site structure (PDB: 7FG7) and Human furin structure (PDB: 5MIM) were obtained from the PDB database.
The three-dimensional structures of three variants (alpha, delta and omicron) were generated by in silico mutation using UCSF chimera software v1.15 (Dunbrack backbone-dependent rotamer library method) [23,24], all site mutations selected the most probable isoforms. PyMol was used to calculate the surface electrostatic potential by Adaptive Poisson Boltzmann solver (APBS), the color scale range was set from − 1.0-1.0 kT/Å. We used delphi force webtool (http://compbio.clemson.edu/ delphi-force/) to calculate the electrostatic energy of the RBD and ACE2 binding domain, the required PQR file was generated by delphiPKa (http://compbio.clemson.edu/pka_webserver/) [25,26]. DelphiPKa calculation was used charmm force field, pH and salt concentration were set to 7 and 0.15 M, respectively, other setting uses standard mode, delphiforce calculation uses the Gaussiansmoothed mode, the Grid Resolution uses 2.0 Å, and the salt concentration is set to 0.15 M.

The S-gene is the only gene undergoing stepwise positive selection
We used Ka/Ks to evaluate the selection of the SARS-CoV-2 sequences from the end of 2019 to January 2023. Ka/Ks > 1 means that the gene is undergoing positive selection and drives the adaptation. Ka/Ks < 1 indicates purifying selection, which stabilizes the gene. It is noted that the Ka/Ks of the S-gene was below 1 till February 2021, showing that in the first year of pandemic, S-protein was under purifying selection in general, indicating that the selective pressure was not that significant. However, the Ka/Ks surpassed 1 from February 2021 and is constantly increasing, showing that the Sprotein is undergoing an increasingly stronger positive selection (Fig. 1A). In the era of Omicron pandemic, the Ka/Ks of the S-protein suddenly increased to more than 5. The Ka/Ks analysis of omicron sublineages showed that the Ka/Ks tend to increase over the time: the early omicron BA.1 lineage showed Ka/Ks between 2 and 8; the BA.2 lineage started from Ka/Ks= 4 and increased to 8-9 in early 2023; the BA.4, BA.5 and BQ.1 lineages showed Ka/Ks about 7-8; the latest XBB lineage had Ka/Ks over 10 ( Supplementary Fig. S1B). This echoed much faster evolution of omicron variants than the earlier variants.
The M protein, which encodes the most abundant membrane protein, was undergoing purifying selection until the omicron variant. In the omicron variant, the Ka/Ks of M gene was approximately 1.1 (Fig. 1A), indicating that it was undergoing neutral selection. In contrast, the other important and highly expressed genes, such as the N-gene as structural genes, the orf1ab gene for viral replication, the orf3a and orf7a genes that mediates cell inflation, were undergoing purifying selection, because their Ka/Ks were almost constantly below 1. The Ka/Ks of N-gene was transiently above 1 between February and September 2021 and then dropped back below 1, suggesting that the mutations generated in this period were not the primary factor to maintain pandemic. These results showed that the S-gene was the only SARS-CoV-2 gene that underwent positive selection. Such positive selection may be driven by the massive vaccination worldwide and preexisting antibody due to natural infection, because most COVID-19 vaccines include or produce Sprotein.
Interestingly, the stepwise elevation of the positive selection coincides temporally with the major SARS-CoV-2 variation types (alpha, delta, omicron. Fig. 1A). However, within one variant type, the Ka/Ks of S-gene has a long term continuous upward or downward trend (Fig. 1B). This indicated that the mutations were accumulated stepwise.

SARS-CoV-2 codon usage did not evolve towards more efficient translation in human
Adaptation to the host translation system is needed for efficient viral protein production and replication of the virus. Such adaptation is largely represented by similar codon usage in the host cells. Using the CAI (codon adaptation index) and ITE (index of translation elongation) to represent the codon adaptation, we showed that the SARS-CoV-2 genes (demonstrated using the original strain Wuhan Hu-1) with higher CAI and ITE were expressed in a significantly higher level (Fig. 2A). This trend was also visible in Ribo-seq data (Fig. 2B), but CAI and ITE do not correlate to mRNA expression level at all (Fig. 2C). These results showed that the protein expression preference towards the codon adaptation happened indeed in the translation process of SARS-CoV-2. However, SARS-CoV-2 CAI and ITE were significantly lower than that of the human cells and other well-known human viruses, respectively, while the human viruses have comparable CAI and ITE against the human cells (Fig. 2D). Compared to the other coronaviruses that can infect human (SARS-CoV, MERS-CoV, HCoV-229E, HCoV-HKU1, HCoV-NL63 and HCoV-OC43), the CAI and ITE of the SARS-CoV-2 is comparable to the SARS-CoV, and slightly lower than the other coronaviruses but without statistical significance (P > 0.05). Note that the other coronaviruses have been shown to infect multiple species other than human. Coincidently, the CAI and ITE of these coronaviruses were significantly lower than that of the human cells and other well-known human viruses (Fig. 2D). Here, we used the genomic data to calculate the CAI and ITE for all virus, and we used the RNC-seq (sequencing of the translating mRNAs) data of the human lung cell line A549 to reflect the actual codon demand in human lung cells, which was the primary infection host of SARS-CoV-2. These results showed that the SARS-CoV-2 were not fully adapted to human in terms of codon usage.
To assess the evolutional trend of the codon adaptation of SARS-CoV-2 during the pandemic, we calculated the CAI and ITE of the typical strains of the wild-type (the original strain in Wuhan) and the major variants alpha, delta and omicron (Fig. 2E). Most genes showed a constant CAI and ITE during the pandemic. Only the orf7b gene showed an increase in CAI and ITE. However, the orf7b was lowly expressed, indicating the limited biological impact of its increase in codon adaptation.
The S-gene, which is the only gene underwent positive selection, showed a decrease in CAI and ITE from the original strain to delta variant, and minimally increased in omicron variant. Such a minimal increase may not cause any visible biological effect. We also calculated the RSCU of the S-gene of all collected virus sequences. Compared to condensed RSCU distribution of human genes, S-genes of all major variants showed a wide scatter of S-gene RSCU values (Fig. 2F), indicating that the optimal codon usage was not a selective force during the pandemic. Translation efficiency is determined by translation initiation and translation elongation, while CAI, ITE and RSCU can only reflect translation elongation efficiency. According to previous research, 5'UTR minimum free energy (MFE) characterizes translation initiation efficiency. We calculated the SARS-CoV-2 5′UTR MFE between end of 2019 to January 2023, it can be seen that there was no continuous upward trend in three years 5′UTR evolution (Fig. 2G), and there was only a brief and small increase before July 2021 and January 2022, indicating that the translation initiation efficiency did not have sign of enhancement.
In sum, the SARS-CoV-2 virus did not mutate towards optimal translation efficiency. We need to investigate the evolutional advantage from other aspects.

S-gene accumulated positive charges during pandemic
Electrostatic force is a major force to guide protein-protein interactions. Compared to van der Waals force, the electrostatic force decays much slower against distance, which facilitates the binding of virus to other proteins in various important processes. We analyzed the changes of charges (at pH=7) and found similar pattern as the Ka/Ks analysis. The S-gene showed a steady increase in positive charge, and this increase was remarkably speeded up since June 2021, which was the time when delta variant was leading the pandemic. The charges of other genes did not apparently change during the pandemic in general (Fig. 3A). The accumulation of the positively charged amino acid residues were mainly found in the RBD and furin   (Fig. 3B). The net charge in RBD domain increased from 1.5 to 4, and the net charge in furin domain increased from 3 to 4. The increase in the charge at the furin cleavage site can theoretically increase the cleavage efficiency, but it has been verified in previous studies that omicron furin has a lower cleavage efficiency, It is related to allosteric modulations of the linoleic acid binding pocket [27]. Overall, these results indicated that these changes may be functionally meaningful.
Indeed, the human ACE2 protein is largely negatively charged in surface, especially in the region that binds SARS-CoV-2 S-RBD domain (Fig. 3C). The increasing positive charge in RBD (Fig. 3D) obviously elevated the electrostatic affinity, confirmed by energy computation on the structure of the RBD-ACE2 binding state (Fig. 3E). This is also supported by other investigations [30]. Similarly, the furin enzyme active site is highly negatively charged. The furin cleavage site in S-gene accumulated positive charges during evolution (Fig. 3F), which should increase the binding efficiency and accelerate the enzymatic cleavage.

Discussion
Our results may provide novel insights on the SARS-CoV-2 evolution. It has been shown that the narrow-spectrum +ssRNA viruses evolve their codon usage matching their hosts' tRNA better than the broad-spectrum viruses. Such adaptation is to optimize their protein expression [31,32]. In our analyses, we showed that the SARS-CoV-2 did not evolve towards optimized codon usage in human lung cells, indicating that it maintains the translation ability to adapt multiple hosts expression system. It has been experimentally shown that the SARS-CoV-2 virus infects deers, hamsters, North American raccoons, striped skunks, white-tailed deer, raccoon dogs, fruit bats, deer mice, domestic European rabbits, bushy-tailed woodrats, tree shrews and multiple non-human primate species [33]. Evidence were also demonstrated that cats, ferrets, fruit bats, hamsters, dogs, deer mice and white-tailed deer can transmit the virus within the species [33][34][35]. Such ability facilitates formation of a natural reservoir of virus. Even with the strictest lockdown policy, the reservoir serves and animals as a refuge of SARS-CoV-2 and may transmit to human again when possible. Indeed, SARS-CoV-2 transmission from mink and pet hamsters to humans has been confirmed [36,37]. Analyses also suggested that the progenitor of omicron variant jumped from human to mice and jumped back to humans and caused the omicron outbreak [38]. Maintaining a broad spectrum of hosts is inferred to facilitate the survival of SARS-CoV-2, especially in the regions deploying "dynamic zero-COVID policy" like China. In other countries where not so strict policies were applied, the massive vaccination also drives the virus to maintain the ability to infect other species to survive. Therefore, it is understandable that the major evolutionary driving force is applied on the infectivity, i.e. the spike protein.
Coincidently, we showed that the S gene is the only gene which underwent the strong positive selection. Other genes exhibit only weak or temporary positive selection. Since the infection is conducted via the binding of S-RBD and human ACE2, two major ways can enhance this affinity: increasing the electrostatic interaction and mutating specific residues to from more stable van der Waals force on the interaction interface. Electrostatic force decreases far slower than the van der Waals force against distance, making it more suitable to serve as guide force of the two molecules. Indeed, van der Waals energy between S protein and human ACE2 barely increased in omicron variant compared to the initial wild type [39]. ACE2 is highly conserved in mammals [40],and has a negatively charged surface in the RBD binding region for multiple species (Fig. S1C). Therefore, the increase of electrostatic interaction is the major factor of the increasing binding affinity during the evolution [41,42].
Increasing the electrostatic interaction is also universal for all species, and thus could potentially beneficial for a broad host spectrum. The occurrence of the same mutation in variants is a sign of convergent evolution. Nevertheless, our omicron sublineage analysis shows that the direction of evolution is deviate from the convergent evolution for the past three years, due to XBB lineage has the lowest net charge among all omicron lineages (−2.5 net charge). The XBB lineage also showed the highest Ka/Ks analysis, XBB has higher Ka/ Ks, which means under stronger positive selection (Fig. S1A-B). The convergence and divergence of the evolution simultaneously occurred in the virus, illustrating the multifaceted and complex evolutionary scenario.
There are other interesting studies showing different conclusions [43], synonymous substitutions in SARS-CoV-2 adapted towards human codon usage patterns over time. To be noted, the global analysis of this study includes both synonymous and non-synonymous mutations in codon and translation adaptation, showing that there is no significant adaptation towards human codon usage in general, indicating that the evolutionary force may not come always from the human host.
With the average reported R 0 = 9.5 [44], the SARS-CoV-2 omicron variant is already among the most infectious viruses in the human history. However, since the accumulation of positive charges in its S-RBD and Furin digestion site is the key aspect of the high infectivity, it can be used as druggable target. Polar reagents which neutralize the charges may counteract the infectivity. In another aspect, the positive charges cover almost the entire interaction surface of the S-RBD and Furin digestion site of the omicron variant, showing that there's only restricted potential to accumulate more positive charges in these interfaces, which indicates that the infectivity of SARS-CoV-2 may approach to a limit.

Data availability statement
The data presented in this study are available on accession number.