1 Introduction

This article aims to follow up a recent report that discussed the proteins of severe acute respiratory coronavirus-2 (SARS CoV-2) [1], the cause of coronavirus disease-2019 (COVID19). Although the virus first emerged in December 2019, the effects of this pandemic are still increasingly evident. With promise of a new vaccine against SARS CoV-2 in the coming months, there is hope that we may soon return to our normal lives. However, this return is threatened by the recent emergence of new SARS CoV-2 strains that are more transmissible from the original strain [2]. This article will discuss the chemistry of the various nonstructural proteins (NSPs) and the spike protein found in SARS CoV-2 to serve as a resource to help understand this virus that caused this global pandemic. Besides the structural proteins: S protein, E protein, M protein, and N protein (or ORF2, ORF4, ORF5, and ORF9) and accessory proteins, there are specific enzymes expressed by the ORF1ab gene of SARS CoV-2 that catalyze essential reactions. Furthermore, Part 2 of this manuscript discusses the research progress towards understanding the spike protein, the basis of the recent vaccines against SARS CoV-2 [3,4,5]. The discussion of the recent investigations of the spike protein will provide insight into how the frequently occurring strains of SARS CoV-2 are manifesting. Possible explanations to rationalize the increased prevalence of these strains that have changes in the spike protein sequence are described. The structures of the protein presented in the text were made using UCSF Chimera software (version 1.12) [6]. In general, the figures with superimposed structures were generated using the default parameters in the Chimera software: Needleman–Wunsch alignment algorithm with a BLOSUM-62 matrix.

2 Discussion

2.1 Part 1: The Enzymes That are Nonstructural Proteins (NSPs) from the ORF1ab Gene

The ORF1ab gene of SARS CoV-2 results in the expression of a polypeptide that is cleaved into 16 nonstructural proteins [1]. From the ORF1ab gene, SARS CoV-2 has two protease enzymes: NSP3 (papain like protease) and NSP5 (3C-like protease), an RNA polymerase that copies viral RNA: NSP12, a 5′-RNA triphosphatase enzyme: NSP13, guanosine N7-methyltransferase: NSP14, an endoribonuclease: NSP15, and a 2′-O-ribose-methyltransferase (NSP16). This group of enzymes can generally be classified into different subgroups: (i) proteases (NSP3 and NSP5), (ii) enzymes involved in the 5′-capping modification of viral RNA—a posttranslational modification of viral RNA (to allow the viral RNA to escape the host innate immune system) (NSP13, NSP14, NSP15, and NSP16), (iii) RNA replication (NSP12), and (iv) other RNA modifying activities such as posttranslational modification of proteins (ADP ribose phosphatase activity of NSP5) and exoribonuclease/endoribonuclease activity (NSP14/NSP15 activity).

2.1.1 NSP3 (Papain-Like Protease)

The NSP3 (papain-like protease) is a multifunctional protein with 1,945 amino acid residues. The papain like domain catalyzes the reaction that cleaves the peptide bonds between: (i) NSP1 and NSP2, (ii) NSP2 and NSP3 [7], and (iii) NSP3 and NSP4 (Table 1) [8]. This enzyme cleaves at the consensus sequence LXGG [8]. NSP3 from SARS CoV-2 has been crystallized and biochemically characterized [9]. The catalytic triad of NSP3 of SARS CoV-2 are found in the following residues: D286, H272, and C111. This recent group expressed the PLpro-Ubl domain of NSP3 (amino acid residues 746-1060) [9]. SARS CoV-2 NSP3 has been shown to preferentially cleave the ubiquitin-like interferon-stimulated gene 15 (ISG15) protein. This cleavage of ISG15 from interferon factor 3 (IRF3) weakens the type I interferon response. Another group reported the crystal structure (PDB ID: 7CMD)—the structure is shown in Fig. 1 [10]. There are many efforts in studying the crystal structure of NSP3 and designing effective inhibitors for this protease—for instance Michael acceptors have been designed to form covalent thioether bonds with the active site of the cysteine residue [11]. Table 1 shows the sequences that NSP3 cleaves with its catalytic triad. The reaction catalyzed by NSP3 is shown in Scheme 1.

Fig. 1
figure 1

The crystal structure of the papain like protase domain of NSP3 (PDB ID: 7CMD). Zoomed in region of the catalytic triad of NSP3 (aspartate-286, histidine-272, and cysteine-111)

Scheme 1
scheme 1

The chemical reaction catalyzed by NSP3 (papain like protease or PLpro) and NSP5 (3C-like protease or 3CLpro). Also see Ref. [105] for more information. NSP3 has a catalytic triad (cysteine–histidine–aspartate) while NSP5 has a catalytic dyad (cysteine–histidine) [106]. For NSP3 the key residues are: D286, H272, and C111, and for NSP5, the active site residues are: H41 and C145

In addition to its protease activity, NSP3 also has other domains that confer other activities. For instance, there is a ribose phosphatase domain. The crystal structure of the ADP ribose phosphatase domain of NSP3 has been elucidated (PDB ID: 6w02) [12]. The structure of the ADP ribose phosphatase domain is shown in Fig. 2. SARS CoV has an essential asparagine-41 for ADP ribose-1ʺ-phosphate phosphatase activity [7]. This ADP deribosylating activity is related to avoiding the host’s immune system [7].

Fig. 2
figure 2

Crystal structure of the ADP ribose phosphatase domain of NSP3 (PDB ID: 6W02) [12]. The red spheres are water molecules (Color figure online)

2.1.2 NSP5 (3C-Like Protease)

NSP5 cleaves at 11 distinct sites in the ORF1ab polyprotein with 306 amino acids after excising itself from the polyprotein [13]. The active site of NSP5 (3C-like main protease) has a catalytic dyad of a cysteine-145 residue and histidine-41 residue. The crystal structure of NSP5 has been reported (PDB ID: 6Y2E) [14]. The structure of NSP5 is shown in Fig. 3. Other structures with inhibitors bound to NSP5 [15, 16] have been reported [17]. Figure 3 also shows the structure of NSP5 with the inhibitor, GC376 bound (PDB ID: 6WTT) [18]. Table 2 shows the sequences that are cleaved by NSP5.

Table 1 The catalytic triad of NSP3 is known to cleave sites between NSP1–NSP2, NSP2–NSP3, and NSP3–NSP4
Fig. 3
figure 3

a Crystal structure of NSP5 (PDB ID: 6Y2E). b Zoomed in view of the catalytic dyad (cysteine-145 and histidine-41) on the right (PDB ID: 6Y2E) [14]. c Crystal structure of NSP5 bound to inhibitor, GC-376 (PDB ID: 6WTT) [18]. d Zoomed in view of the catalytic dyad with inhibitor bound to the cysteine residue (C145). e Superimposed structures of (a) and (b). (apo protein is in red). f Zoomed in view of the superimposed structures. The distances between the histidine and the sulfur of the cysteine in the apo protein and inhibitor bound forms are 3.6 and 4.0 angstroms, respectively

2.1.3 NSP12—RNA Polymerase

NSP12 is the RNA polymerase with 932 amino acids that copies viral RNA. The structure of NSP12 has been reported [19]. The structure of NSP12 that is complexed with an RNA template and NSP8 is shown in Fig. 4. Interestingly, NSP8 is shown stabilizing the RNA template with its positively charged residues coordinating to the negatively charged phosphate backbone of the RNA template (Fig. 5, expanded view of Fig. 4). Remdesivir is the current antiviral drug used to treat SARS CoV-2, and this drug is a prodrug, which is metabolized to the active form and is incorporated by NSP12 into the RNA template to stall replication [20]. In the inhibition assays between SARS CoV-2 RdRp complex with remdesivir triphosphate, the investigators used 100 nM concentration of remdesivir to show inhibition of RNA polymerase activity [20]. For comparison remdesivir triphosphate against the RNA-dependent RNA polymerase activity of the Ebola virus, concentrations at 33 μM had showed effects of inhibition [21]. In Vero E6 cells, remdesivir blocked SARS CoV-2 infection with a half maximum effective concentration (EC50) of 0.77 μM [22]. Remdesivir triphosphate inhibits ebola virus replication in HMVEC/TERT (human microvascular endothelial) cells with a half maximum effective concentration (EC50) of 0.06 μM [23]. Scheme 2 shows the generic mechanism of how RNA polymerase replicates the viral genome.

Table 2 Sites of cleavage of NSP5—the 3C-like protease [97]
Fig. 4
figure 4

Structure of the RNA-dependent RNA polymerase (RdRp) complex: NSP12 RNA polymerase (red) from SARS CoV-2 (PDB ID: 6YYT) [19]. The green proteins are the two NSP8 proteins (NSP8 and NSP8′) that are believed to interact and stabilize the RNA. NSP7 is shown in blue (Color figure online)

Fig. 5
figure 5

There are positively charged amino acid residues on NSP8 and NSP8′ (K37, K36, K40, K46, R51, R57, K58, K61) that stabilize the negatively charged phosphate groups in the RNA template (PDB ID: 6YYT) [19]

Scheme 2
scheme 2

RNA polymerase reaction mechanism incorporating a new RNA (RNTP) into the primer strand

A structure of NSP12 with remdesivir in the active site is available (PDB ID: 7BV2) [24]. This structure is a complex between NSP12 with NSP7 and NSP8 (Fig. 6 and see Fig. 7 for expanded view of the active site of NSP12 with remdesivir bound).

Fig. 6
figure 6

a Structure of NSP12 (red), also called: RNA-dependent RNA polymerase (RdRp) in complex with NSP7 (green) and NSP8 (blue) (PDB ID: 7BV2). b NSP12 alone with the RNA template (NSP7 and NSP8 are hidden for clarity). The different domains of NSP12 [98]—Nidovirus RdRp-associated nucleotidyltransferase (NiRAN): 51–249 (red), Interface: 250–365 (green), Fingers: 366–581 and 621–679 (grey), Palm: 582–620 and 680–815 (blue), and Thumb: 816–932 (cyan) (Color figure online)

Fig. 7
figure 7

The active site of NSP12 with remdesivir incorporated (PDB ID: 7BV2). The red sphere by C222 is a water molecule (Color figure online)

The mode of action of remdesivir is worth discussing as more nucleoside based antiviral drugs could be developed based on this drug and favipiravir [25, 26]. Remdesivir is a prodrug—it is metabolized into its active form after it enters the cell membrane (Fig. 8). The enzymes, cathepsin A (CatA) and carboxyesterase 1 (CES1), convert remdesivir to its alanine metabolite, which then undergoes hydrolysis by the enzyme, histidine triad nucleotide binding protein 1 (HINT1), to the monophosphate [27]. The monophosphate is finally modified by kinases to form remdesivir triphosphate (also referred to as GS441326 or RDV-TP) [20], the substrate for the RNA polymerase (NSP12) complex for incorporation into the primer strand [28].

Fig. 8
figure 8

The metabolism of remdesivir into its triphosphate metabolite, the substrate of NSP12. Also shown is the structure of adenosine for comparison. The structure of favipiravir, another antiviral prodrug is shown

After remdesivir triphosphate (RTP) is incorporated into the RNA primer—inhibition of the NSP12 complex occurs through chain termination as shown in Fig. 9. Remdesivir takes the place of adenosine and is incorporated opposite of uridine from the template strand. Serine-861 from NSP12 is suspected to have a steric clash with the C1′-nitrile moiety of remdesivir unit only after three additional nucleotides are incorporated. This steric interaction was determined from a model [20]. Moreover, although there was no experimental evidence of a nucleophilic addition (i.e. covalent adduction) of the serine hydroxy group (S861) onto the carbon of the nitrile moiety—this may be a reasonable possibility as well as an electrostatic interaction between the O–H of the serine residue and the terminal nitrogen lone pair of the nitrile.

Fig. 9
figure 9

How remdesivir incorporation into the RNA primer inhibits RNA-dependent RNA polymerase activity (NSP12–NSP7–NSP8 complex) through chain termination. After incorporation of remdesivir into the primer strand, the RNA polymerase complex incorporates three more nucleotides before stalling. A hypothetical sequence for the template is shown above to illustrate that three NTPs are incorporated after remdesivir incorporation into the primer while the fourth NTP is not incorporated [20]

In the clinic, remdesivir was better than the placebo in treating adults, who were hospitalized with COVID-19. These patients received 200 mg of remdesivir on day 1 followed by 100 mg each day for up to 9 additional days [29].

2.1.4 NSP13—Helicase and RNA 5′-Triphosphatase

NSP13 with 601 amino acids has multiple enzymatic activities—helicase, RNA 5′-triphosphatase [30], and NTPase [31] activities.

2.1.4.1 RNA Capping

RNA capping and methylation is a process that post-translationally modifies viral RNA to help the viral RNA hide from the recognition of the h ost’s innate immune system [32]. This capping also ensures binding to the host ribosome for translation of the proteins. Figure 10 shows the general chemical structure of the 5′-cap modification of RNA. This process in coronaviruses involves 4 steps [33].

Fig. 10
figure 10

The structure of the 5′-cap of RNA, processed by the viral proteins of SARS CoV-2. The 5′-cap of viral RNA prevents recognition by the host innate immune system and promotes translation by the ribosome

(i) RNA triphosphatase activity by NSP13—which involves the removal of the 5′-gamma-phosphate group of the mRNA.

(ii) Guanylyltransferase activity—involving the transfer of a GMP group on the remaining 5′-diphosphate end (the enzyme that transfers this GMP group is still unknown).

(iii) N7-methyltransferase activity of NSP14—this activity caps the N7-nitrogen of the guanosine at the 5′-end (making the “cap-0” structure—7MeGpppN).

(iv) 2′-O-methyltransferase activity of NSP16.

The RNA 5′-triphosphatase activity is important for initiating the RNA capping process. This activity of NSP13 is regioselective for hydrolyzing the γ-phosphate group of the 5′-terminus of the viral RNA (Scheme 3). The subsequent guanylyl transferase step (step (ii) in the 5′-capping process) introduces a guanosine-monophosphate (GMP) group at this resulting diphosphate end. However, the enzyme that catalyzes the GMP incorporation reaction is unknown. An interesting note is that a different protein, baculovirus LEF-4 (late expression factor-4) protein, is known to possess multiple activities including RNA 5′-triphosphatase, nucleoside triphosphatase, and guanylyltransferase activities [34].

Scheme 3
scheme 3

The reaction catalyzed by NSP13 involving the RNA-5′-phosphatase activity to initiate the 5′-capping of mRNA. The amino acid residues K288 and D374 are proposed to play roles in promoting the terminal phosphate to leave and deprotonating the hydrolyzing water molecule, respectively. The support for this hypothesis is shown with the structure analysis in Fig. 11 (PDB ID: 6XEZ and 6YJT)

In terms of the 5′-triphosphatase activity for NSP13 [30], the key residues in the active site have been identified in SARS CoV-1 [35]. This active site was suggested to be the same site for NTPase activity as well. The key amino acid residues in the active site were determined to be K288, S289, D374, Q404, and R567 [35]. When any of these amino acid residues were changed to alanine residues, the activity was shut down [35]. These amino acid residues are conserved in SARS CoV-2. Based on the sequence alignment between the NSP13 of SARS CoV-1 and SARS CoV-2, only one amino acid out of 601 is different (position-570 is an isoleucine in SARS CoV-1 and a valine in SARS CoV-2) [1]. Scheme 3 shows a proposed mechanism of how NSP13 may regioselectively hydrolyze the γ-phosphate group of its substrate with the help of key amino acid residues (K288 and D374), which is supported by the structure shown in Fig. 11.

Fig. 11
figure 11

a Structural superposition between NSP13 of SARS CoV-2 and SARS CoV-1 (PDB ID: 6XEZ and 6YJT). Under the Matchmaker option in Chimera software, the reference chain was set to chain F (green) of 6XEZ (NSP13 complex of SARS CoV-2 PDB ID), and the chain to match was set to chain A (red) of 6YJT (PDB ID for NSP13 of SARS CoV-1, apo protein). b Focused view of the 5′-triphosphatase active site. c A different angle of SARS CoV-2 NSP13 (green, alone, PDB ID: 6XEZ) for clarity. d Expanded view of the active site of SARS CoV-2 NSP13 (green)—an AlF3 molecule is shown, which mimics the terminal monophosphate. The green spheres in b and d are Mg2+ ions (they are identical) (Color figure online)

The structure of NSP13 for SARS CoV-2 has been reported as a complex with NSP12–NSP7–NSP8 (PDB ID: 6XEZ) [36]. In order to focus on the 5′-triphosphatase active site of the structure of NSP13, the NSP13–NSP12–NSP7–NSP8 complex (PDB ID: 6XEZ) was taken and only NSP13 was shown (i.e. NSP12, NSP7, and NSP8 are hidden) in Fig. 11. This structure (6XEZ) contains an ADP moiety bound to the active site. The available structure of the apo form of NSP13 of SARS CoV-1 is also available (PDB ID: 6YJT). Using the NSP13 structure from SARS CoV-1 and the knowledge of the active site residues (K288, S289, D374, Q404, and R567), the two structures were superimposed to show the active site of NSP13 for SARS CoV-2.

Figure 12 shows the structure of NSP13 of SARS CoV-2 (PDB ID: 6XEZ) as a complex with RNA polymerase, which is relevant for its helicase activity [36]. This structure was used to show the active site of the RNA-5′-triphosphatase active site of NSP13 in Fig. 11. The figure that follows (Fig. 13) shows the NSP13 in complex with the RNA-dependent RNA polymerase complex (RdRp complex: NSP12–NSP7–NSP8) with a strand of RNA embedded in the NSP13′ unit (PDB ID: 7XCM) [37], which presumably corresponds to the RNA template that the helicase is “unwinding” for replication to occur.

Fig. 12
figure 12

The structure of NSP13 (grey) in complex with NSP7 (blue), NSP8 (green), and NSP12 (red) bound to an RNA template (PDB ID: 6XEZ). (There are two NSP13 units (NSP13 and NSP13′), two NSP8 units (NSP8 and NSP8′), one NSP12 unit, and one NSP7 unit) (Color figure online)

Fig. 13
figure 13

Structure of NSP13 (grey, helicase) in complex with NSP12 (red), NSP7 (blue), NSP8 (green), and RNA (PDB ID: 7XCM) [37]. (There are two NSP13 units (NSP13 and NSP13′), two NSP8 units (NSP8 and NSP8′), one NSP12 unit, and one NSP7 unit). NSP13′ has part of the RNA template bound, which shows the helicase activity of this protein (Color figure online)

2.1.5 NSP14—Guanosine N7-Methyltransferase and Exoribonuclease

NSP14 comprised of 527 amino acids has been shown to have two activities: guanosine N7-methyltransferase activity as well as exoribonuclease activity. In the former, the methyl group of S-adenosylmethionine is transferred to the N7-group of the terminal guanosine [33]. Scheme 4 shows how NSP14 methylates the N7-position of the guanosine at the 5′-cap of RNA.

Scheme 4
scheme 4

The reaction catalyzed by NSP14 involving the N7-methylation of the guanosine residue of the 5′-cap of viral RNA. The methylating substrate is S-adenosylmethionine (SAM), which converts to S-adenosylhomocysteine (SAH)

Although a structure of NSP14 has not been reported, the structure of NSP14 in complex with NSP10 for SARS CoV-1 has been reported (PDB ID: 5C8T) [38]. Fig. 14 shows the crystal structure of NSP14 for SARS CoV-1.

Fig. 14
figure 14

Structure of NSP14 for SARS CoV-1 in complex with NSP10 (PDB ID: 5C8T). NSP14 in tan (right) and NSP10 is in red (left). An S-adenosylmethionine (SAM) ligand (green) is shown in the complex (circled). The green sphere is a magnesium (II) ion coordinated to the residues, D90 and E191 of NSP14 (Color figure online)

NSP14 also is reported to have exoribonuclease activity. Inactivating the exoribonuclease (ExoN) activity has been shown to be lethal for SARS CoV-2 [39].

2.1.6 NSP15—Endoribonuclease

NSP15 (also called EndoU) with 346 amino acids is the endoribonuclease enzyme that cleaves the 5′-polyuridine motif of negative sense viral RNA [40]. The polyuridine tail arises from the polyA-templated processing that occurs for messenger RNA (mRNA) [41]. Histidine residues are hypothesized to be involved in the catalysis of the active site of endoribonucleases [42]. Scheme 5 shows the endoribonuclease activity of NSP15.

Scheme 5
scheme 5

Endoribonuclease activity of NSP15

The structure of NSP15 has been reported [43]. The key active site residues of NSP15 are: His235, His250, Lys290, and Thr341 [43]. Fig. 15 shows the structure of NSP15 (PDB ID: 6VWW). Moreover, a structure of NSP15 with uridine-3′,5′-diphosphate is available (PDB ID: 7K1O) (Figs. 15c, d).

Fig. 15
figure 15

a Structure of apo NSP15 (green, PDB ID: 6VWW) [43]. b Structural alignment of NSP15 apo form (green, PDB ID: 6VWWL: 6VWW) and form bound to uridine diphosphate (red, PDB ID: 7K1O) [99]. c Structure of NSP15 from SARS CoV-2 bound to uridine diphosphate (red, PDB ID: 7K1O). d The expanded view of the active site of NSP15 with uridine diphosphate bound (PDB ID: 7K1O). The green spheres in a and b are water molecules (Color figure online)

2.1.7 NSP16—2′-O-Ribose-methyltransferase

The final 5′-capping enzyme that this review covers is NSP16 containing 298 amino acids. This enzyme methylates the 2′-position of the ribose of the first transcribed nucleotide with S-adenosylmethionine [44]. The structure of NSP16 has been reported (PDB ID: 6YZ1) and the structure is shown in Fig. 16 [45]. The reaction catalyzed by NSP16 is shown in Scheme 6.

Fig. 16
figure 16

a Structure of NSP10–NSP16 complex with sinefungin bound (PDB ID: 6YZ1). NSP16 is in green. NSP10 is tan. The structure of sinefungin is shown in the top left. b shows expanded view of the active site of NSP16 (green) with sinefungin (red) bound. The red spheres are water molecules (Color figure online)

Scheme 6
scheme 6

The reaction catalyzed by NSP16. NSP16 transfers the methyl from S-adenosylmethionine to the 2′-O position in the 5′-cap of viral RNA

2.2 Part 2: The Spike Protein of SARS CoV-2

The spike protein is a trimeric glycoprotein expressed by ORF2 in the viral genome. It has been recently suggested that a SARS CoV-2 strain that possesses the D614G variant of the spike protein is more widely spread than the original strain with the aspartate residue at position-614 [46]. This 614G variant is not associated with an increased severity of infection, but it has been suggested that the 614G variant has increased infectivity relative to the D variant [47]. In order to explain the enhanced viral loads of the D614G mutant strain [46], a thorough structural analysis of the spike protein was performed.

2.3 Structural Comparisons of the SARS CoV-2 Spike Protein with Other Known Coronaviruses that Infect Humans (SARS CoV-1, HCoV-299E, MERS CoV, HCoV-OC43, HCoV-HKU1, HCoV-NL63)

Figures 17, 18, 19, 20, 21, and 22 show the individual primary sequence alignments between the SARS CoV-2 spike protein and the 6 spike proteins from other human coronaviruses: (i) SARS CoV-1 (β-coronavirus), (ii) Human coronavirus 229E or HCoV-299E (α-coronavirus), (iii) Middle East respiratory syndrome coronavirus or MERS CoV (β-coronavirus), (iv) Human coronavirus OC43 or HCoV-OC43 (β-coronavirus) [48], (v) Human coronavirus HKU1 [49] or HCoV-HKU1 (lineage A β-coronavirus), and (vi) Human coronavirus-NL63 or HCoV-NL63 (α-coronavirus). All sequence identities and similarities shown in Table 3 were determined using LALIGN software (See Supporting Information for alignments) [50]. SARS CoV-1 and HCoV-NL63 [51] are known to interact with the angiotensin-converting enzyme 2 (ACE2) receptor while HCoV-229E [52] and MERS CoV [53] interact with the aminopeptidase N (APN) and dipeptidyl peptidase 4 (DPP4) receptors, respectively. In fact, it has been suggested that DPP4 can also act as a receptor for the spike protein of SARS CoV-2 [53]. Moreover, structural alignments of the available cryo-EM structures from the protein data bank (PDB) of the spike proteins have been performed to gain insight between similarities and differences between some of these viruses (virus/PDB ID: SARS CoV-2/7JJJ [54], SARS CoV-1/5XLR [55], HCoV 229-E/6U7H [52], MERS CoV/5X5U [56], HCoV OC43/6OHW [52], HCoV HKU1/5I08 [57], HCoV NL63/5SZS [58]).

Fig. 17
figure 17

a Primary sequence alignment of SARS CoV-2 (GenBank: BCA87361.1) and SARS CoV-1 (GenBank: AAP13441.1). b The structural comparison of the spike proteins from SARS CoV-2 (red, PDB ID: 7JJJ) and SARS CoV-1 (light blue, PDB ID: 5XLR) [55]. AAP13441.1 (now obsolete but previously used: NP_828851.1 [1], where position S577A). c Rotated view (Color figure online)

Fig. 18
figure 18

a Primary sequence alignment of SARS CoV-2 (GenBank: BCA87361.1) and HCoV-229E (GenBank: QOP39313.1). b The structural comparison of the spike proteins from SARS CoV-2 (red, PDB ID: 7JJJ) and HCoV-229E (PDB ID: 6U7H) [52]. c Rotated view (Color figure online)

Fig. 19
figure 19

a Primary sequence alignment of SARS CoV-2 (GenBank: BCA87361.1) and MERS-CoV (GenBank: ASU91305.1). b The structural comparison of the spike proteins from SARS CoV-2 (red, PDB ID: 7JJJ) and MERS CoV (PDB ID: 5X5U, open RBD conformation) [56]. c Rotated view of (a) (Color figure online)

Fig. 20
figure 20

a Primary sequence alignment of SARS CoV-2 (GenBank: BCA87361.1) and HCoV OC43 (GenBank: AAA03055.1). b The structural comparison of the spike proteins from SARS CoV-2 (red, PDB ID: 7JJJ) and HCoV OC43 (PDB ID: 6OHW) [52]. c Rotated view of (a) (Color figure online)

Fig. 21
figure 21

a Primary sequence alignment of SARS CoV-2 (GenBank: BCA87361.1) and HCoV HKU1 (GenBank: ADN03339.1). b The structural comparison of the spike proteins from SARS CoV-2 (red, PDB ID: 7JJJ) and HCoV HKU1 (PDB ID: 5I08) [57]. c Rotated view (Color figure online)

Fig. 22
figure 22

a Primary sequence alignment of SARS CoV-2 (GenBank: BCA87361.1) and HCoV NL63 (GenBank: AGT51394.1). b The structural comparison of the spike proteins from SARS CoV-2 (red, PDB ID: 7JJJ) and HCoV NL63 (PDB ID: 5SZS) [58]. c Rotated view of (a) (Color figure online)

Table 3 Summary of sequence identities and similarities between SARS CoV-2 (GenBank ID: BCA87361.1) and other human coronaviruses [50]. AA overlap: amino acid overlap

2.4 A Structural Analysis of SARS CoV-2 Spike Protein

The spike protein exists as a trimer. Figures 23 and 24 show the three protomers that come together to form the trimer of the spike protein (PDB ID: 7JJJ) [54]. Figure 24 shows the different domain organizations of the spike protein (receptor binding domain, S1 unit, and S2 unit). The S1-S2 site is where the protease, furin, cleaves. The S1 unit binds to the ACE2 receptor [59] and the S2 unit mediates fusion of the viral and cellular membranes [60]. A more detailed discussion is presented in the next section.

Fig. 23
figure 23

Structure of the spike protein trimer from SARS CoV-2 (PDB ID: 7JJJ). Each protomer is a different color (i.e. red, green, or blue) (Color figure online)

Fig. 24
figure 24

Structure of the spike protein (PDB ID: 7JJJ, chain a: red, chain b: blue, chain c: green) and highlighted are the different receptor binding domains (RBDs, position 319–541) for each protomer (green, blue, and red). a Side view (top) and rotated view (bottom) of the RBD of the spike protein. b The S1 fragment (position 14–685) of the spike protein is highlighted for each protomer (red, blue, and green) where the protease, furin, cleaves (side view, top) (rotated view, bottom), and c S2 fragment (686–1273) of the spike protein (side view, top) (rotated view, bottom) (Color figure online)

2.5 Spike Protein Role in Viral Entry into the Host Cell

With many of the conformations of the spike proteins elucidated by cryo electron microscopy [61, 62], a better understanding of the dynamic process of viral entry is gained. Initially, the receptor binding domain (RBD) opens (step (i) in Figure 28) to readily bind to the ACE2 protein (step (ii)) on the host cell. The resulting spike protein binds to two more ACE2 proteins to form the spike protein bound to three ACE2 proteins (steps (iv) and (v)). Finally, the spike protein complex is cleaved by furin and TMPRSS2 to release the ACE2-S1 fragments (step (vi))—furin cleaves at the S1/S2 site (see Fig. 25, position: 685–686) and TMPRSS2 cleaves at the S2′ site (See Fig. 25, position: 816) [63, 64]. After cleavage, the S2 domain of the spike protein remains, which is now primed for viral entry into the host cell [63].

Fig. 25
figure 25

The sequence of the spike protein of SARS CoV-2. The 22 asparagine (N) residues that undergo glycosylation are highlighted. S1 subunit: 14–685 [100], S2 subunit: 686–1273, S2′ cleavage [61] site: 816. NTD N-terminal domain, RBD receptor binding domain, FP fusion peptide, HR1 heptapeptide repeat (or heptad repeat) sequence 1, HR2 heptad repeat 2, TM transmembrane domain, CT cytoplasm domain [101]. “*” indicates glycosylation sites. “!” Marks the location of the D614G variant (cyan) [47, 74]. “#” Marks the locations of the 501Y.V2 variant in grey ([i] NTD region: L18F, D80A, D215G, R246I, [ii] RBD region: K417N, E484K, N501Y, and [iii] A701V) [80]. “^” Marks the locations of the 501Y.V1 variant in red (H69, V70, Y144, (N501Y), A570D, P681H, T761I, S982A, D1118H–N501Y is already marked in grey with “#” for the 501Y.V2 variant) (Color figure online)

2.6 Glycosylated residues on the spike protein:

The spike protein has 22 distinct glycosylation sites [65] on each protomer: N17, N61, N74, N122, N149, N165, N234, N282, N331, N343, N603, N616, N657, N709, N717, N801, N1074, N1098, N1134, N1158, N1173, N1194 (positions 1158, 1173, and 1194 are not available from the structure, PDB ID: 7JJJ). The glycans play an important role in protein folding and evading the host immune system. The sequence of the spike protein was formatted using ProtParam and is shown in Fig. 25 [66]. Furthermore, the sites of glycosylation are highlighted in yellow in one of the protomers in Fig. 26 (PDB ID: 7JJJ). Understanding the sites of glycosylation is important because the spike protein produced by the vaccine and the spike protein from the virus have been shown to have different glycosylation patterns [67]. In addition to the asparagine residues, O-linked glycosylation of spike proteins have also been observed at residues S325 and T323, which was determined by mass spectrometry [68].

Fig. 26
figure 26

Structure of the spike protein (protomer) with asparagine residues that undergo glycosylation are highlighted in yellow (PDB ID: 7JJJ) (Color figure online)

2.6.1 Spike Protein Bound to the Angiotensin Converting Enzyme-2 (ACE2)

Angiotensin converting enzyme-2 (ACE2) is the proposed receptor for SARS CoV-2, and a human recombinant soluble ACE2 has recently been shown to block SARS CoV-2 infection in engineered human tissue [69]. A structure of the spike protein with ACE2 bound has also been reported (PDB ID: 7A98) [70]. For comparison of the ACE2-spike protein complex (PDB ID: 7A98) and spike protein (PDB ID: 7JJJ), a structural overlay is shown in Fig. 27 [60]. Furthermore, the various conformations that the spike protein undergoes upon ACE2 binding have been detected using cryo-electron microscopy [70]. These sequential steps of the trimer include: (i) closed conformation, (ii) open conformation (only one receptor binding domain (RBD) points “up” the other two RBD remain “closed”) but unbound to ACE2 (similar to the MERS CoV spike structure provided with PDB ID: 5X5U), (iii) one ACE2 bound to the RBD of one protomer, (iv) two ACE2 bound to two protomers at the RBDs, (v) three ACE2 bound to three protomers at the RBDs, and (vi) release of the monomeric S1-ACE2 complex (S1 unit includes residues 14–685). The S1 unit is first cleaved by the protease, furin [59, 64], from the host cell [63, 71]. The serine protease, transmembrane protease serine 2 (TMPRSS2), is also known to prime the spike protein for cell entry by cleaving at the S2′ site (position 816) [61]. The use of a TMPRSS2 inhibitor, camostat mesylate, blocked SARS-2-S-driven entry into Caco-2 and Vero-TMPRSS2 cells (Fig. 28) [72].

Fig. 27
figure 27

Superimposed structures of (i) the spike protein with ACE2 bound (spike protein is cyan and ACE2 protein is orange, PDB ID: 7A98) and (ii) the unbound spike protein (red, PDB ID: 7JJJ). The open state of the spike protein can be seen when the ACE2 protein (orange) binds at the RBD of the spike protein (cyan). a Side view and b rotated view of (a) (Color figure online)

Fig. 28
figure 28

Illustration of SARS CoV-2 entry into the host cell via ACE2-B0AT1 (B0AT1 is also called SLC6A19, solute carrier family 6 member 19, sodium dependent neutral amino acid transporter) [102] (PDB ID: 6M18) [102]. (i) One receptor binding domain (RBD) (or two RBDs – PDB ID: 7A93) [70] of the spike protein orients in the open conformation (PDB ID: 6ZGG) [103] from the closed conformation (PDB ID: 6VXX) [60], (ii) ACE2 protein binds to the RBD of the spike protein (PDB ID: 7KNE) [104], (iii) a second RBD “opens” up (PDB ID: 7A96) [70], (iv) a second ACE2 protein binds to the second RBD of the spike protein (PDB ID: 7KMZ) [104], (v) a third ACE2 protein binds to the final RBD (PDB ID: 7KNI) [104], (vi) furin and TMPRSS2 cleave at the S1-S2 site and S2′ site of the spike protein releasing the ACE2-S1 complex (PDB ID: 7A92) [70] and in turn, leaving behind the S2 domain (PDB ID: 6XRA) [61], which is primed for entry into the host cell. Shown in the S2 domain trimer (PDB ID: 6XRA) is the spike protein sequence from T912-N1173 and Q1180-L1197. Interestingly, the spike protein with two RBD units in the open conformation has also been observed through cryo-EM (PDB ID: 7A93) [70] (Color figure online)

2.6.2 Spike Protein Bound to Antibodies

There have been structures of the spike protein bound to antibodies. These antibodies bind to the RBD of the spike protein. One study showed the antigen-binding fragment (Fab) fragment of the neutralizing antibody (C105) complexed with the receptor binding domain (RBD) of the spike protein (PDB ID: 6XCM) [73]. Another study reported the spike protein complexed with the S2A4 neutralizing antibody Fab fragment (PDB ID: 7JVC). These structures are available as determined by cryo-electron microscopy (cryo-EM) and their superpositions with the unbound spike protein (PDB ID: 7JJJ) are shown in Fig. 29.

Fig. 29
figure 29

a Spike protein (cyan) bound to C105 neutralizing antibody Fab fragment (orange) (PDB ID: 6XCM) superimposed with spike protein (red, PDB ID: 7JJJ). b Rotated view of (a). c Spike protein (light blue) bound to S2A4 neutralizing antibody Fab fragment (green) (PDB ID: 7JVC) superimposed with spike protein (red, PDB ID: 7JJJ) (Color figure online)

2.6.3 The D614G Variant of the Spike Protein

From the primary sequence alignment of the spike proteins, the D614 residue is conserved in both SARS CoV-1 and SARS CoV-2. However, a SARS CoV-2 strain containing a glycine residue (G) at position 614 has been suggested to be more globally widespread. A structure of this variant without the receptor binding domain (RBD) is available and it has been suggested that the D614G variant (PDB ID: 6XS6) of the spike protein more frequently adopts an open conformation compared to the D614 variant [74]. The structure of the G614 variant of the spike protein that lacks the RBD (PDB ID: 6XS6) is superimposed with the D614 version of the spike protein in Fig. 30.

Fig. 30
figure 30

Structural overlay of the spike protein D614 (red, PDB ID: 7JJJ) and G614 (cyan, PDB ID: 6XS6). Circled in yellow is the location of D614. The cryo-EM structure of the G614 variant has no RBD but is in a slightly more “open” conformation. a is the side view of the spike protein and b is the rotated view (Color figure online)

From a careful look at the spike protein structure, D614 forms a salt bridge with R634 (2.9 angstroms, cf. Fig. 31b). Furthermore, a lysine residue (K854) is located on a separate protomer that appears to interact with D614 (7.4 angstroms away). In the G614 variant, these salt bridges are absent, which suggests that the spike protein trimer is held less tightly together when the aspartate (D) is a glycine (G). This “looser” conformation with the G614 variant could possibly explain how the D614G strain is more infectious than the wild type. Furthermore, in the complex of ACE2 and spike (PDB ID: 7A98), the ionic interaction between D614 and K854 is enhanced when the distance was measured to be 4.1 angstroms (Fig. 31d). In the structure of the ACE2-spike complex (PDB ID: 7A98), the residue R634 is not included. This lack of interaction between D614 and R634 in the ACE2-spike complex confirms that the D614 residue plays a role in keeping the spike protein more “compact” so that the receptor binding domain (RBD) is less likely to reach out to bind to the ACE2 protein. In contrast, with the G614 variant, which lacks these interhelical salt bridge interactions, the spike protein is more free to be in the “open” conformation to interact with the ACE2 receptor.

Fig. 31
figure 31

a A salt bridge between D614 and R634 within the same protomer (red) of the spike protein is shown. Another amino acid K854 from a different protomer (green) interacts with D614 (7.4 angstroms away) (PDB ID: 7JJJ). b is the zoomed in region of the salt bridge interactions (D614-R634 and D614-K854). The red color is one protomer and the green color is the second protomer (K854 is on a separate protomer from the D614 residue on the red protomer—they are 7.4 angstroms apart). Each protomer in the trimer is a different color: red, green, blue. c In the ACE2 bound-spike protein, D614, the K854 is shown closer to D614 (4.1 angstroms) suggesting a stabilizing role for this salt bridge (PDB ID: 7A98). d Zoomed in image of the salt bridge between D614 and K854 in the spike-ACE2 complex (4.1 angstroms apart) (Color figure online)

2.6.4 The 501Y.V1 Variant of the Spike Protein

In addition to the strain that contained a D614G change in the spike protein, another SARS CoV-2 strain possessing an N501Y (asparagine to tyrosine change at position-501) was reported in the United Kingdom as 10% more transmissible than the original strain (501N lineage). In fact, an even more transmissible strain, which was 75% more transmissible than the 501N lineage, was reported with additional changes in the spike protein sequence besides N501Y: H69 and V70 deletion, Y144 deletion, N501Y, A570D, P681H,* T716I, S982A, and D1118H [75]. These positions for this variant are highlighted in Fig. 32. This particular strain was called 501Y.V1 and is also referred to as 501.V1, B.1.1 [76], or 20B (Nexstrain nomenclature [77, 78]) [79]. One can hypothesize that the deletion of the amino acid residues H69 and V70 could potentially truncate the S1 domain, which could in turn cause the spike protein to remain in the open conformation more frequently. This open conformation is more likely to bind to the ACE2 protein on the surface of the host cell. Interestingly, the P681 position is the location of the furin cleavage site of the spike protein and it is not clear whether the change from proline to histidine in this strain may play a role in affecting this activity [59].

Fig. 32
figure 32

a The spike protein (PDB ID: 7JJJ) with the protomers colored differently (red, green, and blue). The amino acid residues that are changed in the 501Y.V1 variant are highlighted in yellow. b Rotated view. Highlighted in yellow are: H69 and V70 (deletion), Y144 (deletion), N501(Y), A570(D), T716(I), S982(A), and D1118(H)—position P681(H) is not included in the structure (Color figure online)

2.6.5 The 501Y.V2 Variant of the Spike Protein

Recently another strain of SARS CoV-2 termed “501Y.V2” containing changes in the amino acid residues of the spike protein have been reported to be widespread [80]. The label 501Y.V2 is interchangeable with 501.V2 as it appears in the literature. The changes of this variant are located in the N-terminal domain: L18F, D80A, D215G, R246I, receptor binding domain (RBD): K417N, E484K, and N501Y, and at position A701V. These specific positions are highlighted in Fig. 33. At the RBD interface with the ACE2 protein, three residues are highlighted (K417N, E484K, and N501Y). Interestingly, the change at position 501 from the asparagine to the tyrosine (N501Y) introduces a possible cation–pi interaction between the aromatic tyrosine-501 residue on the spike protein with the positively charged lysine-353 residue on the ACE2 protein (Fig. 34a). The K417 residue is 9.3 angstroms away from K26 in the ACE2 protein, suggesting that converting the K417 residue to an asparagine (K417N) will likely introduce a hydrogen bond between the asparagine’s (N417) carbonyl oxygen on the spike protein and the lysine’s (K26) proton of ACE2. Furthermore, the E484K change would potentially introduce a salt bridge between the K484 residue and E35 of the ACE2 protein. As shown in Fig. 31b, E484 is 8.8 angstroms away from E35 of the ACE2 protein. Therefore, the change to a positive charge at position 484 as the lysine residue (K484), should enhance the interaction between the spike protein and the ACE2 protein. Because of the stronger interaction between the RBD (receptor binding domain) of the spike protein and the ACE2, this new SARS CoV-2 strain (501Y.V2) potentially improves the virus’s ability to enter the host cells. Although the only similarity in the spike protein sequence between the 501Y.V1 strain and the 501Y.V2 strain is the N501Y residue, it is likely that the other amino acid changes promote the enhanced infectivity of each strain. For instance, the E484K change in 501Y.V2 (which 501Y.V1 lacks) introduces a potential salt bridge (K484 of the spike protein with E35 of the ACE2 protein). On the other hand, 501Y.V1 has a deletion of H69, V70, and Y144, which potentially can truncate the NTD of the spike protein, forcing the spike protein to adopt the open conformation more frequently. The facts that (i) these emerging SARS CoV-2 strains (501Y.V1, 501Y.V2, and D614G) have changes in the spike protein sequence and (ii) the recent vaccines are developed against the spike protein of the original strain, gives reason to focus efforts on understanding how these new strains are different from the original strain. It is interesting to note that a study was performed where the sera of 20 patients with the COVID19 mRNA-based vaccine, BNT162b2, successfully neutralized a SARS CoV-2 N501Y spike mutant [81]. However, this particular SARS CoV-2 mutant only possessed the N501Y mutation and not the other changes present in either 501Y.V1 or 501Y.V2. Therefore, further studies focused on the entire set of amino acid variations belonging to these more transmissible strains are necessary to confidently assess the effects of the changes.

Fig. 33
figure 33

a Structure of the spike protein with the amino acid residue changes of the 501Y.V2 variant highlighted: L18F, D80A, D215G, R246I, K417N, E484K, N501Y, and A701V. b Rotated view (Color figure online)

Fig. 34
figure 34

a Spike protein-ACE2 protein complex (PDB ID: 7A98). Focus on the residues of the recently reported variant: K417N, E484K, and N501Y are shown in b and c. b E484 (spike) with K31 (ACE2) (the distance is also measured to E35 of the ACE2). c N501 (spike) with K353 (ACE2) is shown

2.6.6 Selected Updates on SARS CoV-2 Research—23 New SARS CoV-2 Proteins

One notable report recently identified 23 previously unidentified viral open reading frames (ORFs) (Table 4), suggesting that there are many other unknown features of this virus that have yet to be explored [82]. The original approach to identify the genes of SARS CoV-2 was based on comparing the sequences of other known betacoronaviruses—especially with SARS CoV. These genes were identified by locating the sequential open reading frames (ORFs) that begin with the start codon sequence: AUG (also, please see Supporting Information, section II, for the AUG start codons that were highlighted in the SARS CoV-2 genome) [83]. However, it is likely that some genes were missed in the original characterization for two reasons: (i) the fact that some AUG start codons can be embedded within the originally identified ORFs (i.e. overlapping ORFs) and (ii) some start codons have a different sequence besides the canonical AUG sequence [82] (for instance, CUG, ACG, AUU, AUC, UUG, and AUG—see Tables 4 and 5). Therefore, in order to identify the novel proteins, the researchers performed ribosome profiling experiments [84], where Vero E6 cells (African green monkey kidney cells) and Calu-3 cells (human lung cancer cells) were infected with SARS CoV-2. After a certain amount of time, the cells were treated with harringtonine or lactimidomycin, which halt ribosomes at initiating codons on the mRNA and provide translation initiation libraries. Alternatively, the cells were treated with cycloheximide, which would generate translation elongation libraries (instead of translation initiation libraries). In particular, lactimidomycin binds to the empty E-site of the large 80S ribosome allowing for isolation of the ribosome at strictly the start codon [85]. On the other hand, cycloheximide binding to the ribosome is reversible and when cycloheximide dissociates, ribosomes continue translation of the mRNA, which enables the trapping of ribosome downstream of the initiation codon [86]. The mRNA that was embedded in the ribosome was isolated and sequenced to reveal the new proteins. The combination of the translation initiation libraries and translation elongation libraries enabled the identification of the new protein sequences. Tables 4 and 5 show the 23 new proteins identified from this study. Among the newly identified proteins were in-frame internal ORFs (iORFs) located within known ORFs (e.g. S.iORF1 (Table 4, entry 6) is an internal ORF located downstream of the AUG start codon of S-ORF), upstream ORFs (uORFs), internal out-of-frame translations, and extended ORFs (such as M.ext, a 13 amino acid extension of ORF-M, cf. Table 4, entry 11). Furthermore, this study reported that virus translation dominates host translation due to increased levels of viral transcripts. Identification of the proteins may be important in the development of new vaccines and elucidating new biochemical properties of SARS CoV-2.

Table 4 The 23 new proteins identified in SARS CoV-2 from the ribosome profiling study [82] of cells infected with SARS CoV-2 (continued to Table 5)
Table 5 The 23 new proteins identified in SARS CoV-2 from the ribosome profiling study [82] of cells infected with SARS CoV-2 (continued from Table 4)

2.6.7 Vaccines

The most promising news has been the success of a SARS CoV-2 vaccine (BNT162b1) that has been shown to be 95% [87] effective [4]. This vaccine is an RNA based vaccine. BNT162b1 is a lipid nanoparticle formulated RNA vaccine that encodes a trimerized SARS CoV-2 receptor binding domain (RBD) [3]. A second mRNA vaccine (mRNA-1273), which encodes the prefusion spike protein of SARS CoV-2, is also showing effectiveness in producing antibody responses [5, 88]. And even another vaccine (ChAdOx1 nCoV-19) is also undergoing phase 2/3 clinical trials with promising results [89]. ChAdOx1 is a chimpanzee adenovirus vector, and the researchers have designed the vaccine to deliver the codon-optimized full-length spike protein of SARS CoV-2 [90]. From the clinical trial studies, ChAdOx1 nCoV-19 had an acceptable safety profile and was effective against symptomatic COVID-19 [91]. The new vaccines are made available [92] at the time the emergence of other variants of SARS CoV-2 are appearing. These recent strains (D614G and 501Y.V2) possibly have enhanced infectivity and stability of virions compared to the originally identified strain [93, 94]. Although these new mutations of SARS CoV-2 strains have been suggested to not have increased transmissibility [95], this research area remains a high priority worldwide. Investigating the biochemistry of the proteins found in SARS CoV-2 enables a better understanding to prevent another global pandemic.

3 Conclusion

A comprehensive review of the current literature of SARS CoV-2 is impossible due to the explosion in publications on this urgent topic. Despite the increase in publications, more experimental work is desperately needed to gain better insights into how SARS CoV-2 caused this global pandemic and to develop new strategies to combat this virus [96]. The catalyst for deeper knowledge towards the cause of COVID-19 is the emergence of the recent variant strains that may be more infectious [2] than the original SARS CoV-2 strain. This pandemic has reminded humanity how fragile our existence is and how important science and research is and will continue to play a vital role in our lives.