Genome-Wide Variation in Betacoronaviruses

ABSTRACT The Severe acute respiratory syndrome coronavirus (SARS-CoV) and SARS-CoV-2 originated in bats and adapted to infect humans. Several SARS-CoV-2 strains have been identified. Genetic variation is fundamental to virus evolution and, in response to selection pressure, is manifested as the emergence of new strains and species adapted to different hosts or with novel pathogenicity. The combination of variation and selection forms a genetic footprint on the genome, consisting of the preferential accumulation of mutations in particular areas. Properties of betacoronaviruses contributing to variation and the emergence of new strains and species are beginning to be elucidated. To better understand their variation, we profiled the accumulation of mutations in all species in the genus Betacoronavirus, including SARS-CoV-2 and two other species that infect humans: SARS-CoV and Middle East respiratory syndrome coronavirus (MERS-CoV). Variation profiles identified both genetically stable and variable areas at homologous locations across species within the genus Betacoronavirus. The S glycoprotein is the most variable part of the genome and is structurally disordered. Other variable parts include proteins 3 and 7 and ORF8, which participate in replication and suppression of antiviral defense. In contrast, replication proteins in ORF1b are the least variable. Collectively, our results show that variation and structural disorder in the S glycoprotein is a general feature of all members of the genus Betacoronavirus, including SARS-CoV-2. These findings highlight the potential for the continual emergence of new species and strains with novel biological properties and indicate that the S glycoprotein has a critical role in host adaptation. IMPORTANCE Natural infection with SARS-CoV-2 and vaccines triggers the formation of antibodies against the S glycoprotein, which are detected by antibody-based diagnostic tests. Our analysis showed that variation in the S glycoprotein is a general feature of all species in the genus Betacoronavirus, including three species that infect humans: SARS-CoV, SARS-CoV-2, and MERS-CoV. The variable nature of the S glycoprotein provides an explanation for the emergence of SARS-CoV-2, the differentiation of SARS-CoV-2 into strains, and the probability of SARS-CoV-2 repeated infections in people. Variation of the S glycoprotein also has important implications for the reliability of SARS-CoV-2 antibody-based diagnostic tests and the design and deployment of vaccines and antiviral drugs. These findings indicate that adjustments to vaccine design and deployment and to antibody-based diagnostic tests are necessary to account for S glycoprotein variation.

a For each virus species, one annotated accession describing the full genome was used as a reference. Only accessions describing complete genomes were used for nucleotide variation analyses. Accessions were downloaded from NCBI on 6 April 2020. For SARS-CoV-2, additional accessions were downloaded on 13 May 2020.
genetic variation are RNA recombination and nucleotide insertions, deletions, and substitutions introduced during RNA replication (17). Genetic variation combined with selection pressure imposed by genetically diverse hosts favors the accumulation of mutations that support the emergence of new virus strains and species (18)(19)(20). These general principles of virus evolution explain several features of b-CoVs, including SARS-CoV-2. Both SARS-CoV and SARS-CoV-2 likely emerged through RNA recombination of two species infecting bats. The recombinant progeny then adapted to infect humans (3). Within a year of the initial description, several SARS-CoV-2 strains have been detected that mainly accumulate mutations in the S glycoprotein (16,21). Temporal and spatial genetic relationships show accumulation of mutations in the genome, allowing SARS-CoV-2 isolates to be differentiated (22,23). In the United States, a variant carrying the Q677P substitution in the S glycoprotein is now abundant in the southwest, and new mutations within this variant have created sublineages (24). Furthermore, variants infecting the same individual have been detected (20,(25)(26)(27). The S glycoprotein is the common target for neutralizing antibodies developed in response to natural infection or vaccines (28)(29)(30)(31). Neutralizing antibodies are formed against the prefusion conformation of the entire S glycoprotein. In contrast, nonneutralizing antibodies are formed against the S2 subunit (8,32). Antibodies against the S glycoprotein are used as markers in diagnostic assays (28,29,32). Accordingly, the emergence of new variants with mutations in the S glycoprotein has the potential to compromise the efficacy of vaccines and the immunity mediated by natural infection (20,27,(33)(34)(35)(36). Conversely, antibodies developed through natural infection or vaccines may impose selection pressure on b-CoVs (20,36). For these and other reasons, it is imperative to understand the biological properties of b-CoVs that contribute to the emergence of new strains and species.
Virus variation, evolution, and adaptation to diverse hosts are mediated by genetic determinants in the viral genome and selection pressure imposed by the host (37)(38)(39)(40). Accordingly, characterization of genomic variation is fundamental to our understanding of b-CoV evolution and host adaptation. We hypothesized that genetic determinants of variation in b-CoVs are conserved across species, including SARS-CoV-2. In this study, we profiled the genomic variation in all species in the genus Betacoronavirus. Genome-wide nucleotide variation analyses combined with amino acid variation analyses revealed that variation patterns are conserved across b-CoVs, including the presence of variable areas at homologous locations. The most variable parts are the S glycoprotein, followed by open reading frame 8, accessory proteins 3 and 7, and the N protein. Genome-wide distribution of mutations of all b-CoVs provides an explanation for the emergence of new b-CoV species, such as SARS-CoV-2, and for the emergence of strains. These findings and published results (16,20,24,41,42) predict how and where SARS-CoV-2 will accumulate mutations and differentiate into new biological strains. SARS-CoV-2 will likely evolve as it adapts to genetically diverse human populations (20,43) and possibly to selection constraints imposed by vaccines, antiviral drugs, and antibodies developed against natural infections (20,30,36). Our results underscore the potential for the continual emergence of new b-CoV species and strains with novel biological properties.

RESULTS
S glycoprotein is more variable than the rest of the genome. SARS-CoV-2 strains identified to date (16,21) differ in the accumulation of nonsynonymous substitutions in the entire genome (Fig. 1A). However, nonsynonymous substitutions preferentially accumulate in the S glycoprotein and in the N protein. In all other parts of the genome, mutations accumulate to a frequency that is equal to or less than that expected randomly (Fig. 1B). New mutations continue to arise and diversify SARS-CoV-2 into variants (20,24). However, recurrent mutations mapped along the SARS-CoV-2 genome preferentially accumulate in the S glycoprotein (22,23).
To further test preferential accumulation of mutations in the S glycoprotein, we measured single-nucleotide polymorphisms (SNPs) in the entire SARS-CoV-2 genome and separately in the S glycoprotein, normalized to their respective length. SARS-CoV was included in the analysis as it was closely related to SARS-CoV-2. Nucleotide accessions were added in increments of 50 or 10 for SARS-CoV-2 and SARS-CoV, respectively. In both species, the S glycoprotein accumulated more polymorphic sites than the rest of the genome proportionately ( Fig. 2A)  in the S glycoprotein was similar to or higher than that of the rest of the genome in SARS-CoV and SARS-CoV-2 over time (Fig. 2B) and in all hosts (civets, humans, mice, and Vero cells) (Fig. 2C). In SARS-CoV-2, the S glycoprotein represents approximately 17% of the genome (Fig. 1B). However, it accumulated at least 50% of the mutations (Fig. 1B). Accordingly, the S glycoprotein accumulated mutations at a frequency that is at least 3-fold higher than would be expected randomly. The frequency was even higher in accessions derived from nonhuman hosts and in SARS-CoV (Fig. 2C). These results show that the SARS-CoV-2 S glycoprotein is the most variable part of the genome.
Differentiation of SARS-CoV-2 based on the S glycoprotein. The variable nature of the S glycoprotein suggests it is a major contributor to SARS-CoV-2 evolution and diversification. To further test this model, using accessions from the United States, we generated a phylogenetic tree using the S glycoprotein and two time periods. Based on nucleotide (Fig. 3A) or amino acid sequence (Fig. 3B), accessions from January 2020 clustered separately from accessions from July and August 2020. With respect to the reference sequence Wuhan-Hu-1 (NC_045512.2), U.S. accessions from January accumulated two nucleotide substitutions (Fig. 3C) and one amino acid (H49Y) substitution that maps to the N-terminal domain of the S1 subunit (Fig. 3D). These mutations were not present on accessions from July and August 2020. Instead, eight new nucleotide ( Fig. 3C) and two new amino acid mutations were detected, a Q494L substitution in the receptor binding motif and an E1202V substitution in the transmembrane motif ( Fig. 3D). In both time periods, all accessions contained the D614G mutation (Fig. 3D). Consistent with recent observations (16,24), these results support the model that  mutations in the S glycoprotein mediate the emergence of new strains and that the S glycoprotein is a major contributor to SARS-CoV-2 evolution. Genome-wide variation in SARS-CoV-2. Although nonsynonymous substitutions preferentially accumulate in the S glycoprotein, other areas of the genome also accumulate mutations (Fig. 1). To characterize SARS-CoV-2 genome variation, we compared accessions early and later in the pandemic. Using a 50-nt window, variation in SARS-CoV-2 was measured for the 106 nucleotide accessions available on NCBI on 23 March 2020, the 2,315 nucleotide accessions available on 13 May 2020, and the 2,299 amino acid accessions available on 14 May 2020. No variable areas in the genome were detected in accessions representing the early part of the pandemic (Fig. 4A). However, by May 2020, variation was detected in several areas (Fig. 4B). SNPs were higher than the average of the genome in the S glycoprotein, ORF8, and at the N-terminal part of ORF 1a (Fig. 4B). A single-amino-acid polymorphism (SAP) analysis showed variation in the S glycoprotein maps to the N-terminal domain, the receptor binding domain, the fusion peptide-proximal region, the heptad repeat 2, and the transmembrane domain (Fig. 4C). These results are consistent with the detection of recurrent deletions that map to the N-terminal domain in immunocompromised patients (20). An order/disorder analysis showed that, in SARS-CoV-2, the S glycoprotein has intrinsically disordered areas in the receptor binding domain, C-terminal domain 2, and the fusion peptide proximal region (Fig. 4D). Intrinsically disordered regions often interact with multiple molecular partners, are highly plastic, and show high evolutionary rates (44). Consistent with these results, major mutations (D614G and Q677P) that render the virus more transmissible and pathogenic to humans (24,34,45) map near the hypervariable region in the disordered C-terminal domain 2 in S1 (Fig. 4C). The S1/S2 cleavage site is located within a variable region (Fig. 4D). In the U.S. accession subset analyzed here (Fig. 3), no variation was detected in the S1/S2 cleavage site. However, in the larger data set, mutations were detected at the S1/S2 cleavage site (Fig. 4C). Because variation in the S1/S2 cleavage site contributes to cellular tropisms and pathogenesis (13), these results suggest that the S1/S2 cleavage site tolerates mutations and can contribute to SARS-CoV-2 diversification.
Collectively, these observations support the model that the S glycoprotein is variable and mutationally robust and contains intrinsically disordered areas.
Genome-wide variation in betacoronaviruses. The variation pattern described above could be a property exclusive to SARS-CoV-2, to a subset of species, or a general property of b-CoVs. To distinguish the difference, we profiled the genome variation in all members of the genus Betacoronavirus. To ensure statistical power (46), the analyses described here were based on species with three or more accessions. At least three accessions describing complete genomes were available for 19 of the 21 species in the genus Betacoronavirus. The length of the genomes ranged from 29,855 to 31,190 nt (Table 1).
We measured nucleotide variation in all members of the genus Betacoronavirus (Table 1). SNPs and nucleotide diversity, estimated in a 50-nt window, showed that b-CoVs in general and species in the subgenus Sarbecovirus in particular are highly variable (Fig. 5A). Due to its high variability (47), and as a point of comparison, we estimated HIV-1 variation using the same method. The most diversity was observed in Rousettus bat coronavirus HKU9, other species infecting bats, and MERS-CoV (Fig. 5A). In these species, more than 25% of the nucleotides in the genome were polymorphic. Genomic variation is not a function of the number of accessions, because similar results were observed using the nucleotide diversity index (Pi), which normalizes for the number of accessions (48) (Fig. 5A).

C S protein variation
The genome of b-CoVs consists of 11 to 14 open reading frames (ORFs). Coding regions for accessory and structural proteins located 39 of the S proteins are not in synteny (1,50). After normalizing to their length, the most variable parts of the genome were the S glycoprotein, followed by the E protein, protein 7, the M protein, and the N protein (Fig. 5B). The lowest variation was detected in open reading frame 1b (Fig. 5B), which codes for nonstructural proteins that mediate virus replication: RNA-dependent RNA polymerase, RNA helicase, exonuclease, endoribonuclease, and methyltransferase. Within the subgenus Sarbecovirus, the most variable part of the genome was the S glycoprotein, followed by ORF8, accessory proteins 3a, 7a, and 7b, and N protein. The lowest variation was detected in open reading frame 1b (Fig. 5C). Genome-wide maps illustrate the distribution of nucleotide variation (Fig. 7). In the S glycoprotein, ORF8, and 3a, nucleotide polymorphisms were higher than the average of the genome. In the two species infecting bats, variation was also detected in nsp1 (inhibits host gene expression) (51), nsp2 (inhibits cell signaling) (52), and nsp3 (papainlike protease) (53). The lowest variation was detected in the 39 half of ORF1a and in ORF1b (Fig. 7A). This pattern was observed in all species in the subgenus Sarbecovirus, including SARS-CoV (Fig. 7A).
In the subgenera Merbecoviruses (Fig. 7B), Nobecovirus (Fig. 8A), and Embecovirus (Fig. 8B), the S glycoprotein is the most variable part of the genome. Other areas of hypervariation include nsp1, nsp2, and the nsp3 protease, and the lowest variation was detected in the 39 half of ORF1a and in ORF1b (Fig. 8).
Collectively, genome-wide variation described above show that, in all members of the genus Betacoronavirus, the S glycoprotein is the most variable part of the genome, and replication proteins in ORF1b are the least variable (Fig. 5, 7, and 8).
Betacoronavirus differentiation into strains. The S glycoprotein mediates receptor recognition and membrane fusion during viral entry into the cells (8)(9)(10)(11). Consistent with this role, ACE2 receptor binding is a determinant of host range in sarbecoviruses (11,(54)(55)(56)(57). For viruses in the subfamily Torovirinae within the family Coronaviridae,  receptor binding is a determinant of host range (58). In the b-CoV genome, as described above, the S glycoprotein is the most variable (Fig. 5B). These observations predict that coronaviruses differentiate into strains based on selection pressure from the host. To test this hypothesis, we generated a phylogeny based on the S glycoprotein for SARS-CoV, MERS-CoV, and closely related species infecting bats. In SARS-CoV (Fig. 9A) and MERS-CoV (Fig. 9B), accessions formed clusters that correlated with the country of origin and host species. Accessions representing bat SARS coronavirus (Fig. 9C) and bat SARS-like coronavirus (Fig. 9D) originated exclusively from China, and accessions clustered according to the host.
These results support the model that, in b-CoVs, variation in the S glycoprotein separates isolates into strains and may reflect the effect of selection imposed by the host. Consistent with this model, reinfection in humans has been confirmed (25,27). In Brazil, a case of reinfection occurred with a new strain containing an E484K mutation in the S protein (27).
Amino acid variation in the S glycoprotein. Amino acid sequence in the S glycoprotein is variable in all species of the subgenus Sarbecovirus (Fig. 10). Variation localizes to subunit S1, particularly to the receptor binding domain, which is predicted to be intrinsically disordered for bat-SARS and bat-SARS-like coronaviruses (Fig. 10). Intrinsically disordered proteins mediate functional diversity and interactions with   4000  8000  16000  20000  12000  28000  24000  1  30000  2000  6000  10000  14000  18000  22000  multiple partners (59,60). These observations support the model that, in b-CoVs, the S glycoprotein is mutationally robust and contains disordered areas.

DISCUSSION
Host and viral factors contribute to virus evolution (38,61). The starting material is the introduction of mutations (nucleotide substitutions, insertions, or deletions) in the genome through RNA-dependent RNA polymerase errors during replication, RNA recombination, and reassortment (in segmented viruses) (39,61). While they may occur randomly, selection separates beneficial from detrimental and neutral mutations. Selection is imposed by the host, the environment, and their interaction. Mutations  400  800  1000  600  1200  1  100  300  500  700  900  1300  1100   200  400  800  1000  600  1200  1  100  300  500  700  900  1300  that provide a beneficial advantage are more likely to be fixed in the genome (37). Under this scenario, the distributions of mutations in the viral genome are not random. Instead, mutations accumulate to higher than random frequencies in areas of the genome that contribute to fitness by enhancing stability, transmission, replication efficiency, escape from immunity, suppression of immunity responses, or a combination (37)(38)(39). The collective results of these effects may be evident through biological properties such as host adaptation, pathogenicity, or others. The emergence of new species, such as SARS-CoV-2, and the rapid emergence of new SARS-CoV-2 strains are an indication that b-CoVs evolve quickly and have a high capacity to switch hosts and to adapt to new hosts. Results described here show that b-CoVs are more variable than polioviruses (49). In some species, variation is close to that observed for HIV-1 ( Fig. 5 and 6). The wide host range observed across species suggests that, in b-CoVs, genomic variation is related to the genetic diversity of the host.
While mutations accumulate in the entire genome, they are not randomly or equally distributed. Instead, preferential accumulation of mutations in the S glycoprotein is a general feature of all members of the genus Betacoronavirus (Fig. 5B). This was particularly evident in SARS-CoV-2. Strains identified to date (16,21) and isolates from early and middle 2020 can be distinguished based on the S glycoprotein sequence alone (Fig. 3).
In HIV-1, glycoproteins gp120 and gp41 are the most variable in the genome (Fig. 6). Both b-CoV S glycoprotein and HIV gp120 and gp41 are envelope proteins and mediate viral entry. The S glycoprotein binds to the ACE2 receptor (55), while gp120 binds to the CD4 receptor (62). Both the S glycoprotein and gp120 induced the formation of neutralizing antibodies. These features suggest several mechanisms driving diversifying selection in envelope glycoproteins: cellular receptors, entry cofactors, and antibodies.
Within the variety of coronavirus hosts, the cellular receptors, entry cofactors, and cellular proteases that process the S1/S2 cleavage site and immunity responses are likely diverse (12,63). Results described here show that, for b-CoVs, the S glycoprotein is variable and mutationally robust and contains intrinsically disordered areas (Fig. 4, 5B, and 9). Disordered proteins allow functionality with a diverse set of interaction partners (44). These observations are consistent with a model in which host diversity pushes diversifying selection in the S glycoprotein. Mutational and structural robustness in the S glycoprotein provide a selection advantage, are major contributors to b-CoV evolution, and may lead to the emergence of new strains and species.
In the reference sequence Wuhan-Hu-1 (NC_045512.2), the S glycoprotein contains residues that are compatible, but not optimal, for binding human receptor ACE2 (55). Accordingly, SARS-CoV-2 has the potential to accumulate mutations for more efficient entry into human cells and to escape from neutralizing antibodies. Consistent with this model, SARS-CoV-2 strains detected to date mainly differ in the S glycoprotein (Fig. 1A) (16,21). The D614G and Q677P mutations make the virus more transmissible and more pathogenic to humans (24,45) and have been detected in several parts of the world (33,64). Amino acids 614 and 677 are near hypervariable C-terminal domain 2 in subunit S1 (Fig. 4C). Furthermore, the D614G mutation and others in the receptor binding domain reduce affinity to monoclonal antibody CR3022 (64). In Mexico (65) and Wisconsin (Fig. 3), the H49Y mutation in the S protein was the most frequent and appears to have independent origins, suggesting convergent evolution.
In humans, the strength of the immune responses is not uniform, as indicated by immunocompromised patients (2,20), and immunity-driven selection in SARS-CoV-2 has been documented (66). The human population is genetically diverse enough to select for variants in SARS-CoV and MERS-CoV (Fig. 9), and there is genetic variability in human leukocyte antigen genes that affect susceptibility to SARS-CoV-2 and the severity of the disease (67). Thus, it is likely that SARS-CoV-2 will continue to accumulate mutations for efficient transmission and genome replication and differentiate into biological strains as the virus faces selection pressure from genetically distinct human populations or immunocompromised individuals. For example, a 27-amino-acid deletion was detected in protein 7 in Arizona (68), new mutations formed sublineages in the southwest (24), and SARS-CoV-2 populations were grouped into clades in Mexico (65).
Our analysis identified proteins 3 and 7 and ORF8 as variable in SARS-CoV and bat SARS (Fig. 5C). The 3b protein is an inhibitor of the interferon response, and a variant with a longer 3b protein induces more severe symptoms and has enhanced ability to suppress induction of type I interferon (69). The 7a protein antagonizes antiviral factor BST-2 to enhance virion release (70). ORF8 is a cofactor of the RNA-dependent RNA polymerase and an inhibitor of the type I interferon response (71)(72)(73). Further, a 29-nt deletion in ORF8 attenuated SARS-CoV, and mutations or deletions in SARS-CoV-2 ORF8 caused attenuation. Although no difference was detected in vivo, higher replication was detected in vitro than in the wild-type virus (66,74). ORF8 is different between SARS-CoV and SARS-CoV-2 and does not include functional motifs (50). Because ORF8 induces a robust antibody response, deletions may reflect immunity-driven selection (66). Due to their biological role in replication, suppression of antiviral defense, and hypervariable nature (Fig. 5C), proteins 3 and 7 and ORF8 are likely contributors to pathogenicity and host adaptation in sarbecoviruses.
Vaccines against SARS-CoV-2 induce neutralizing antibodies against the prefusion conformation of the S glycoprotein (8,75,76). However, nonneutralizing antibodies against subunit S2 are also developed (8). Variation within the S2 subunit is among the highest in the genome (Fig. 4). Nonneutralizing antibodies may provide a mechanism for the virus to escape from the immune response (77). Thus, variation in the S glycoprotein provides b-CoVs a mechanism to escape the immune response and an important selection advantage.
Vaccines and antiviral drugs might function as selection agents (36). In an infected individual, new variants are generated (26) and may be selected to escape neutralizing antibodies, which were developed against natural infection or triggered by a vaccine. Indeed, in immunocompromised patients, recurrent deletions in the N-terminal domain were detected, and these deletions mediate escape from neutralizing antibodies (20).
Factors contributing to b-CoV evolution include intrinsic properties of the S glycoprotein (mutationally robust and intrinsically disordered), natural genetic diversity in their hosts, and diversity in the strength of the immune response. Similarly, several factors contribute to SARS-CoV-2 differentiation into strains, including natural genetic diversity in the human population, diversity in the strength of the immune response, and, possibly, selection imposed by vaccines. This represents a challenge for vaccine development and deployment, because vaccines may only be efficient against closely related strains, ineffective against diverse strains, and fail to prevent reinfection.

MATERIALS AND METHODS
All computational analyses were done on high-performance computing nodes. Custom scripts are available upon request.
Genomic RNA sequences. All available genomic sequences for the genus Betacoronavirus were obtained from NCBI (http://www.ncbi.nlm.nih.gov/) on 6 April 2020 using customized scripts based on Entrez Programming Utilities (E-utilities; https://eutils.ncbi.nlm.nih.gov/entrez/eutils/). For SARS-CoV-2, nucleotide accessions were redownloaded on 13 May 2020. For HIV-1, 100 random full-length sequences were obtained from NCBI to provide a representative sample of HIV-1 variation on 19 February 2021. Only accessions with at least 95% of the reference genome length were retained. For each species, the reference accession describing a complete genome was identified (Table 1).
Genomic and amino acid variation. For each species, nucleotide and amino acid variation analyses were conducted either on the entire genome or the spike S protein. Both were estimated in a 50-nt window. Nucleotide substitutions on the genome were measured based on nucleotide diversity (48) and genomic variation (40). Nucleotide diversity was calculated using TASSEL (https://www.maizegenetics .net/tassel) (78). Amino acid substitutions were measured based on SAPs (40). SNPs or SAPs were identified and mapped using SNP-sites version 2.4.1 (https://github.com/sanger-pathogens/snp-sites) (79) and VCFtools (80). The average and 99% confidence interval (P , 0.01) were estimated and plotted for each species. For variation per ORF, only ORFs present in at least 25% of the b-CoVs were counted.
Disorder of the S protein. Order/disorder was estimated using the Multilayered Fusion-based Disorder predictor (MFDp) with a false-positive rate of 5% (82). For each species, the amino acid sequence of the reference accession was used. Ordered and disordered areas are below and above the 0.5 threshold, respectively.
Annotated phylogram of U.S. species. All U.S. January sequences from the 23 March 2020 download were included in the early (January) time period. Three random sequences were chosen from each state with accessions from the late (July-August) time period to ensure a representative sample. Neighbor-joining phylogenetic trees were created using MAFFT version 7.4 (https://mafft.cbrc.jp/ alignment/software/) with bootstrap values of 100. Accessions were aligned and mutations were identified using Geneious version 8.0 (https://www.geneious.com).
Data availability. All accession numbers used in this study were downloaded from GenBank (Table 1).