Codon usage, phylogeny and binding energy estimation predict the evolution of SARS-CoV-2

In the frames of a One Health strategy, i.e. a strategy should be able to predict susceptibility to infection in both humans and animals, developing a SARS-CoV-2 mutation tracking system is a goal. We observed that the phylogenetic proximity of vertebrate ACE2 receptors does not affect the binding energy for the viral spike protein. However, all viral variants seem to bind ACE2 better in many animals than in humans. Moreover, two observations highlight that the evolution of the virus started at the beginning of 2020 and culminated with the appearance of the variants. First, codon usage analysis shows that the B.1.1.7 (alpha), B.1.351 (beta) and B.1.617.2 (delta) variants, similar in the use of codons, are also similar to a virus sampled in January 2020. Second, the host-specific D614G mutation becomes prevalent starting from March 2020. Overall, we show that SARS-CoV-2 undergoes a process of molecular evolution that begins with the optimization of codons followed by the functional optimization of the spike protein.


Introduction
A strategy to identify the host animal ('reservoir') of SARS-CoV-2 stands on the analysis of how the virus uses codons ('codon usage'). The codon usage pattern of SARS-CoV-2 is similar to that of bat and human SARS-CoVs, although some (membrane protein-coding) genes are more similar to pangolin virus P1E [1], while others (orf1ab, spike, and nucleocapsid) show a codon usage pattern more similar to RaTG13 [1,2]. Spike and membrane protein-coding genes also undergo different evolutionary pressures [1]. Notably, the E gene of SARS-CoV-2 has the highest codon usage bias and appears to be under natural selection [3].
Most of the SARS-CoV-2 high-frequency codons end in A or T [2,4]. In addition, SARS-CoV-2 uses pyrimidine-rich codons more smoothly [4]. Also, SARS-CoV-2 has significantly lower GC compared to humans but, notably, its codon composition strongly correlates with that of some human genes expressed in the lung, suggesting that the GC content is crucial for adaptation to (a) specific host tissue(s) [4,5]. Adaptation to the human host is also supported by dinucleotide analysis: UG and CA dinucleotides are found to be preferred in SARS-CoV-2; in contrast, CG dinucleotide is avoided [6].
Besides similarity in codon usage, the binding energy between Receptor-Binding Domain (RBD) and Angiotensin-Converting Enzyme 2 (ACE2) also deserves attention. Computational methods have been used to estimate binding energy and similarity between ACE2 and RBD, from both human and vertebrates, with a focus on RaTG13 and Pangolin-CoV Spike proteins [7][8][9].
In these frames, we used a docking method whereby the N501Y mutation is predicted to increase ACE2 binding energy by ~ − 20 Kcal/ mol [10]. The prediction is now confirmed by both experimental and epidemiological data. In fact, N501Y influences protein S affinity for the receptor enhancing infectivity of the virus and, possibly, virulence [11,12], and all the most widespread variants in mid-2021 (June 2021), i.e., B.1.1.7 (alpha), B.1.351 (beta) and P.1 (gamma) show the N501Y mutation, despite their distinct geographical origin (United Kingdom, South Africa, and Brazil, respectively). However, at the end of 2021 (November 2021), the most common variant is B.1.617.2 (delta). This variant does not have the amino acid substitution N501Y but exhibits two other substitutions: L452R and T478K. As reported by an in vitro neutralizing assay performed with pseudo-typed viruses, L452R increases resistance to some monoclonal antibodies [13], while in-field observations suggest that administering one dose of the BNT162b2 vaccine produces 5.8 times lower neutralizing activity for variant B.1.617.2 (delta) than for variant B.1.1.7 (alpha). On the other hand, the administration of two doses restores the neutralizing effect [14]. Moreover, the two mutations also increase the binding affinity between ACE2 and RBD [15]. Following this line of reasoning, another mutation functionally emerges in B.1.617.2 (delta) variant, i.e. P681R, which instead of operating on ACE2 receptor binding seems to primarily act on S1/S2 cleavage by furin [16].
In this study, we used codon analysis to compare human SARS-like (including SARS-CoV), and bovine, bat, and other mammalian viruses. (delta) SARS-CoV-2. We used docking to investigate the relationships between phylogeny and the binding energy of RBD and ACE2. Docking simulations, initially performed using the RBD of SARS-CoV-2 (WT, Wuhan strain, S lineage), were extended using the RBD of the variant viruses B.1.1.7 (alpha), B.1.351 (beta), and B.1.617.2 (delta).

Methods
Sequences were downloaded from NCBI or UniProt. We used MOD-ELLER v.10.1 to model ACE2 [21]. ACE2 N-terminal signal sequence was predicted using SignalP v.5 [22] and eliminated from the models. Phylogenetic analysis was performed using MEGA X [23], while we used ClustalO to align sequences [24]. We used HDOCK [25] to run the docking simulations, using ACE2 models ('receptor') and three different types of WT RBD ('ligands') (6 M17, 6LZG, and 6M0J). FireDock [26] was used to calculate the free binding energy (Kcal/mol) defined as Global Energy Score (GES). For each receptor, mean and standard deviation (SD) were calculated.
We used the Sequence Manipulation Suite [27] to count the codons for SARS-CoV-2 viruses (WT and variants), and the Kazusa database (htt ps://www.kazusa.or.jp/codon/) to count the codons of the other viruses. We used PAST to perform multivariate analysis for the codon usage data [28].

ACE2 phylogenetic diversity in vertebrates and affinity to SARS-CoV-2 spike protein
According to the docking results (Fig. 1, Tables S1 and S2), 13 out of 33 structures showed a better binding affinity with ACE2 proteins of vertebrates compared to the value obtained with human ACE2. The simulations included 2 proteins from Pan troglodytes (chimpanzee), 5 out of 6 proteins from Carnivora and 5 out of 9 proteins from Microchiroptera. Primates showed better values, followed by Carnivora (Caniformia and Feliformia) and Microchiroptera (Fig. 1). Domestic herbivorous, Megachiroptera and Reptilia showed values significantly worse than the panel average.
The signal peptides also showed diversity in vertebrates. All sequences, but the reptilian Python bivittatus (Burmese python), exhibited a secretion signal (Fig. S1, Table S3). Two different cleavage sites were identified in many sequences: the cut occurring on S19 or A17 (ref. human ACE2). Fifteen sequences showed S19 as better cut site, while 9 sequences showed A17. All the sequences with a S19 cut site showed a conserved domain 'LSLVAVTAAQS', suggesting that modification of this motif can influence the cut. This domain was present also in 5 out of 15 proteins showing A17 as better site, but these 5 sequences showed a G in position 3, which can also influence the cut. According to this second set of docking simulations ( Fig. 2A, Table S4 Fig. S2 reveal that S. suricatta (meerkat), R. sinicus (Chinese rufous horseshoe bat) share the N-terminal sequence of ACE2 that forms the most important interaction interface between RBD and ACE2. Hence, changes in these interfaces almost certainly result in a reduction in binding energy. As for the data generated with variant B.1.617.2 (delta), ACE2 sequences from Felis catus (cat), Canis lupus familiaris (dog) and C. griseus (Chinese hamster) showed GES values comparable to that of human ACE2, while ACE2 sequences from all the other species showed less negative GES values. Specifically, F. catus (cat), C. lupus familiaris (dog), C. griseus (Chinese hamster) and H. sapiens ACE2 sequences have very similar N-terminal sequences considering the first 120 amino acids (Fig. S2C).

The D614G mutation and its relationship with the reservoirs
The first noteworthy mutation that became prevalent as early as mid-2020 is D614G. This mutation is interesting because an acidic amino acid (D) is replaced by an apolar-achiral amino acid (G). We compared various CoVs of the porcine, bat, human, and SARS origin using BLAST. We also analyzed the databases reported in two studies [17,18] on SARS-CoV-2. Notably, D614G is recorded for SARS-CoV and other human CoVs since 2004 (Fig. 2B) [29]. Furthermore, a basic amino acid was identified (N) in the porcine CoV, while the bat viruses exhibited an acidic amino acid (D or E).

Codon analysis of the B.1.1.7 (alpha), B.1.351 (beta) and B.1.617.2 (delta) variants
The Relative Codon Frequency [RCF (%), File S1] was used to calculate Pearson's correlation index (RCF of human as reference) ( Table S5) Fig. S4 shows a violin plot considering all the databases and all the codons. The diversity of these groups of viruses is also shown by an NM-MDS using the Bray-Curtis index (Fig. 3A). Furthermore, the network plot (Fig. 3C) shows that Civet-CoV closely relates to SARS-CoV, while Bat-CoV RaTG13 closely relates to SARS-CoV-2. Two other network plots and an NM-MDS (Fig. S5) show the relationships between SARS-CoV-2 WT and the three variants B.1.1.7 (alpha), B. 1.351 (beta), and B.1.617.2 (delta). This plot shows that the variants are similar one another and different from the SARS-CoV-2, although a WT virus is more like variants.
Comparing the values obtained by subtracting the RCF values of the B.1.1.7 (alpha), B.1.351 (beta), and B.1.617.2 (delta) viruses from the RCF calculated for the WT virus, it was evident that the RCF changes in a similar way for each codon (Fig. 3B). The Pearson's index calculated for the mean RCF values of the three viruses was >0.99.

Discussion and conclusions
The analysis of the use of codons has the advantage not only of considering both uncoded mutations and mutations that cause a change in the amino acid composition of proteins but also of providing information on the adaptation of the virus to the host. The codon usage bias was related to some specific features during the infection; e.g., it was observed that the infection with SARS-CoV-2 reduces the translation rate of highly expressed host proteins that share the codon usage bias of the virus [30]. Also, codon usage varied rapidly during the pandemic [3] as well as for isolates from distinct geographic regions [2]. Our results demonstrate that codon analysis is a valuable tool for tracking virus adaptation. Likewise, some specific mutations can be used for this purpose. The D614G is present in data referred to humans, for both SARS-CoV-2 and SARS-CoV. The same mutation is absent in bat, bovine and porcine viruses. This confirms D614G as a good marker to track viral evolution. The docking results reveal that carnivore ACE2 binds RBD with good energy. This confirms that some animals, including pets, can be hosts of SARS-CoV-2. Furthermore, the docking analysis for the variant form of the B.1.1.7 (alpha) and B.1.351 (beta) virus shows that the variants exhibit higher binding energy for almost all the animals considered. On the contrary, B.1.617.2 (delta) variant shows comparable binding energy values with ACE2 sequences from human, F. catus (cat), C. lupus familiaris (dog) and C. griseus (Chinese hamster). These vertebrates also show a conserved N-terminal sequence suggesting that B.1.617.2 (delta) is well adapted to human ACE2. The docking data obtained in this study are consistent with the literature. E.g., the Rhinolophus affinis (intermediate horseshoe bat) ACE2 is a functional receptor for SARS-CoV-2 [31].
Taken together, these data show how computational approaches can be useful for analyzing and monitoring the spillover and evolution of viruses of environmental origin by comparatively analyzing sets of virus variants and vertebrates. Codon analysis is more effective in measuring genetic variations that occur over short periods. On the other hand, the docking analysis can provide information on animal hosts. In this respect, this work is a proof-of-concept, which can be useful for the implementation of bioinformatics and functional comparative genomics strategies from a One Health perspective. All the methods here shown are fast, low-cost, and powerful, which implies possible costs savings as a major benefit again from a One Health perspective. Last but not least, our approaches could be used to develop new protocols for rapid screening, screening of environmentally sourced viruses, bioinformatic tools to predict structure-to-function relationships, etc.

Ethical approval
Not required.

Author contributions
PA and MA conceptualized the study. MC performed computational analysis. TV implemented the database. PA, MA, and MC wrote the draft version. PF and MST revised the draft of the manuscript. All authors critically read and revised the manuscript and approved the manuscript for publication.

Declaration of Competing Interest
All authors have no conflict of interest to declare.