A structural view of the SARS-CoV-2 virus and its assembly


 The SARS-CoV-2 pandemic that struck in 2019 has left the world crippled with hundreds of millions of cases and millions of people dead. During this time, we have seen unprecedented support and collaboration amongst scientists to respond to this deadly disease. Advances in the field of structural biology, in particular cryoEM and cryo-electron tomography, have allowed unprecedented structural analysis of SARS-CoV-2. Here, we review the structural work on the SARS-CoV-2 virus and viral components, as well as its cellular assembly process, highlighting some important structural findings that have made significant impact on the protection from and treatment of emerging viral infections.



Introduction
In November 2002, an outbreak of an atypical pneumonia struck the Guangdong province of China [1]. A novel coronavirus, later named SARS-CoV, was identified as the cause of the epidemic, with a case fatality rate of 9.7% [2][3][4][5]. This was followed by the Middle eastern respiratory syndrome coronavirus (MERS-CoV) outbreak in 2012, with a very high case fatality rate of 34% [5]. In 2019, the world was hit by another strain of coronavirus, SARS-CoV-2.
SARS-CoV-2 has a much lower fatality rate that increases steeply with age. However, it has a far higher transmission rate than SARS-CoV or MERS-CoV [5]. SARS-CoV-2 pandemic that struck showing that vaccines can be created quickly in response to such a global crisis [6]. Structural knowledge of the virus and its viral components is critical for the development of novel treatments and vaccines. Structural biology has provided structural information for the development of vaccines for SARS-CoV-2 that utilize the spike, as well as the development of potential therapeutics targeting the main protease (M Pro ) [7,8,9]. Other protein factors, such as the papain protease (PL Pro ) and RNA-dependent RNA polymerase (RdRp) also present promising targets for therapeutic treatment [10-12,13]. Following the outbreak, an astounding number of protein structures from the SARS-CoV-2 virus have been determined, among which over 1400 atomic models deposited in the RCSB protein databank (PDB) and 600 electron densities in the Electron Microscopy Databank (EMDB) (as of October 1st, 2021). These structures reveal how the virus infects its host and provide the basis for development of the J o u r n a l P r e -p r o o f COVID-19 vaccines and novel therapeutics [14,15,16]. Here we review major structural efforts on the virus, viral components, and its assembly process.

Molecular Architecture of the SARS-CoV-2 virus
Recent breakthroughs in cryo-electron tomography (cryoET) and subtomogram averaging (STA) have allowed for unprecedented structural analysis of molecular complexes in their native state to near-atomic resolution [17][18][19][20][21][22]. Several groups have used cryoET STA to image intact SARS- . The non-structural proteins (Nsps) Nsp1-16 are produced from self-cleavage of the precursor polyproteins Pp1a and Pp1ab by viral proteases [11]. PL Pro (Nsp3) cleaves three sites resulting in free Nsp1-3, while M Pro (Nsp5) is responsible for the remaining 11 cleavage sites [11]. This allows the Nsps to perform their functions in the host cell ranging from RdRp (Nsp12) and helicase (Nsp13) functions, generating double membrane vesicles for viral genome replication, transcription, and RNA transport (Figure 1a).

Spike Glycoprotein
The coronavirus surface S glycoprotein is a ~600 kDa trimer, one of the largest known class 1 fusion proteins. Located on the outer envelope of the virion, it plays a critical role in viral J o u r n a l P r e -p r o o f infection through recognition of the host cell receptors and by mediating the fusion of the viral and host cell membranes. S also has been shown to elicit a strong immune response, making it the primary target for the recently developed vaccines for SARS-CoV-2 necessary to stem the COVID-19 pandemic [28][29][30][31]. The SARS-CoV-2 S gene encodes a ~1300 amino acid precursor protein which is then activated through proteolytic cleavage into an amino (N)-terminal S1 subunit (~700 amino acids), and a carboxyl (C)-terminal S2 subunit (~600 amino acids) with a single transmembrane (TM) region anchor (Figure 2a). The S1 and S2 subunits form a heterodimer, that in turn oligomerize into a trimer resulting in the formation of the surface spike on the virion (Figure 2b [32]. In its prefusion form, the S protein appears to have two conformations: the "RBD down" conformation and the "RBD up" conformation ( Figure 2d). In its trimeric prefusion form, the "RBD" down, "one RBD up", and "two RBD up" conformations have been observed [23,24,32]. During infection, the RBD of SARS-CoV-2 binds to angiotensin converting enzyme 2 (ACE2) on the surface of target cells before undergoing viral uptake and fusion ( Figure 2e) [33,34,35-37]. ACE2 has only been shown to bind RBDs in the "up" conformation [33,34]. The RBD site has been shown to be of importance in neutralizing SARS-CoV-2 by targeting S with neutralizing antibodies [33,38, 39,40,41-51].
The cryoEM structures of most antibodies bound to S have least one RBD in the "up" conformation, similar to ACE2 (Figure 2e) [33,38,39,52-53]. However, several other antibodies and antibody fragments such as VH ab8 can bind RBDs in the "down" conformation ( Figure 2e) [33,41,45,54]. Other antibodies bind to other regions of S, such as the NTD [55][56][57][58]. While the glycosylation of S is thought to shield it from antibody recognition, some neutralizing antibodies can still bind to glycan-containing epitopes, allowing immune response [59][60][61]. Previous studies of SARS-CoV have found two proline substitutions at residues 986 and 987 (Figure 2a, c) stabilize S in its prefusion form, which elicits a strong immune response [7,62]. S stabilized with this two-proline mutation has been used in the development of both the Moderna and Pfizer mRNA vaccines [14][15].

J o u r n a l P r e -p r o o f
During infection, the S1 subunit is shed and S2 undergoes a large conformational change compared to its prefusion state ( Figure 2c) [32,63]. This structural re-arrangement brings the fusion peptide and transmembrane domain together at the same end of the spike molecule, resulting in the insertion of the fusion peptide into the host membrane ( Figure 2c) [32,63]. HR1 and CD form an extended, three-

Envelope Protein
Along with M, the coronavirus E protein is one of the major membrane components in SARS-CoV-2. E is a small, 8.5 kDa protein consisting of 75 amino acid residues. In coronaviruses, E is a cationic selective viroporin, forming a channel across the endoplasmic reticulum-Golgi intermediate compartment (ERGIC) membrane. In SARS-CoV, E mediates the budding and release of viruses [64]. Deletions of E have been shown to attenuate the virus, while mutations abolishing channel activity reduce pathogenicity [65]. This provides a target for potential antiviral drug development as well as a potential vaccine candidate in SARS-CoV-2.
Despite its importance, until recently the E protein structure remained elusive. An NMR structure of the transmembrane domain structure of SARS-COV-2 E was determined using solid state NMR in phospholipid bilayers [64]. This work reveals that E consists of a compact and rigid homopentameric helical bundle transmembrane domain (Figure 3a). The central portion of the TM domain contains four hydrophobic residues lining the core, narrowing the radius to ~2-Å.
As this would only permit a single file of water molecules and partially dehydrate any ions that move through the pore, this structure may represent a closed state of SARS-CoV-2 E [64]. The  In SARS-CoV-2, the N protein consists of the three intrinsically disordered regions: the N-arm, central linker region (LKR), and C-tail, as well as the two structural domains: the NTD ( Figure   3b, green) and CTD (Figure 3b, orange/blue) [67]. Previous work on SARS-CoV has shown that the NTD serves as the RNA-binding domain, while the CTD functions as a dimerization domain [68].
In SARS-CoV-2, N protein forms dimers through the CTD interactions ( Figure 3b) [67,[69][70][71]. N proteins form RNP complexes with the viral genome. These RNPs are thought to be linked to neighboring in a "beads on a string" manner [72]. CryoET STA has revealed a reverse G-shaped This "beads on a string" mechanism of genome packaging would maintain the high steric flexibility of the vRNPs necessary to allow the unusually large genome to be packaged efficiently within the budding virions [72].

Main Protease and Papain-like Protease
One attractive drug target for SARS-CoV-2 is non-structural protein 5 (Nsp5), the main protease (M Pro ). M Pro , a 3C-like protease, is responsible for processing 11 M Pro -specific sites on the two SARS-CoV-2 polyproteins Pp1a and Pp1ab into 16 nonstructural proteins (Nsp1-16) [73][74]. Leu-Gln↓(Ser, Ala, Gly) recognition sequence (with ↓ denoting the cleavage site) [8]. As no known human proteases have the same specificity as M Pro , M Pro appears to be an ideal target for therapeutic development [8,16,73]. The structure of M Pro was then used to develop an ketoamide inhibitor that targets the active site of SARS-CoV-2 M Pro as the basis of therapeutic development [8]. In addition, the structure of M Pro was determined with N3, a mechanism-based inhibitor and used for further inhibitor discovery using in silico analysis [16].
A second, papain-like protease (PL Pro ), encoded in Nsp3, is the protease responsible for cleaving the remaining three cleavage of polyproteins [10]. The crystal structure of PL Pro has also been determined [11,76], showing that PL Pro contains two domains: a small, N-terminal ubiquitin-like domain, and a catalytic domain with a "thumb-palm-fingers" architecture ( Figure 3c). The catalytic active site sits between the thumb and palm domains and contains a canonical cysteine protease catalytic triad, recognizing a Leu-X-Gly-Gly↓(XX) sequence [11]. PL Pro can recognize the C-terminal sequence of ubiquitin. This makes development of a PL Pro inhibitor more challenging, as care must be taken to ensure any inhibitor doesn't interfere with host deubiquitinases [73].

Other non-Structural Proteins
The structures of several other non-structural proteins from SARS-CoV-2 have been reported as potential targets for therapeutics. Nsp1 functions as a host shutoff factor, binding the mRNA entrance channel of ribosome complexes. A cryoEM structure of Nsp1-ribosome shows its Cterminal forming two α-helices binding within the entrance channel of the 40S subunit ( Figure   J o u r n a l P r e -p r o o f 3d) [77][78][79]. Helix 1 (residues 153-160) interact with the ribosome helix 18 through hydrophobic interactions, while helix 2 (residues 166-178) interact with the phosphate backbone of helix 18 through conserved arginine residues R171 and R175, allowing it to inhibit translation of host mRNA [77][78][79].
While the full-length structure of Nsp2 remains undetermined, recently the N-terminal domain of Nsp2 (Nsp21-276) has solved [80]. This structure reveals the Nsp21-276 structure to be a novel zinc finger domain consisting of three zinc fingers. ZnF1, ZnF2, and ZnF3 (Figure 3e) [80]. A large, positively charged region on the surface of Nsp21-276 was then shown to be able to bind to dsDNA. By chelating the Zn with EDTA and using mutagenesis, Nsp21-276 appears to bind to DNA through this charged surface, with the zinc fingers not being directly involved [80]. While this provides insight into the potential function Nsp21-276, the role of Nsp2 in SARS-CoV-2 infection remains unknown [80].
Along with PL Pro , Nsp3 contains a macrodomain responsible for removal of ADP-ribose from ADP-ribosylation sites during infection, potentially playing an important role in disrupting host ADP-ribosylation [81][82][83]. As ADP-ribosylation has been linked to innate immune response, this macrodomain may provide an attractive target for drug development. This macrodomain has a baseball glove-like structure, with an ADP-ribose-binding pocket (Figure 3f) [81][82][83]. A structural comparison of the binding site crystalized with a variety of substrates suggests high structural plasticity within the binding site, presenting an opportunity for rational targeting of small molecule inhibitors [84]. This pocket has been shown to bind GS-441524, a remdesivir metabolite, supporting the hypothesis that the macrodomain represents a promising drug target [84].
Nsp12, the RdRp, is essential for the synthesis of viral RNA and the primary target for RNA analog therapeutics such as remdesivir [12,13,85]. On its own, Nsp12 has low polymerase activity. Upon addition of Nsp7 and Nsp8 cofactors and formation of a holo-RdRp:RNA complex with scaffold RNA made up of template RNA (t-RNA) and primer RNA (p-RNA), the polymerase activity of Nsp12 is greatly improved [86]. The holo-RdRp:RNA complex consists of a single Nsp7/Nsp8 heterodimer bound to Nsp12, as well as a single Nsp8 at a separate Nsp12 The Nsp9 in its crystal structure forms a dimer, with each monomer containing a unique fold limited to coronaviruses [92]. The structure consists of an enclosed six-stranded β -barrel with outward projecting loops connecting the β-strands with a projected N-terminal β-strand and a Cterminal α-helix make up dimerization interface, allowing it to dimerize. Nsp10 from SARS-

J o u r n a l P r e -p r o o f
CoV-2 is a non-classic zinc finger protein, containing two zinc finger motifs. Nsp10 acts as a cofactor, necessary for stimulation of Nsp14 and Nsp16 [93]. Nsp14 is a bifunctional protein, consisting of an N-terminal exoribouclease domain (ExoN) and a C terminal domain guanine-NT-MTase involved in caping [94,95]. The overall structure of Nsp10/Nsp14-ExoN consists of the Nsp14-ExoN leaning along the Nsp10 monomer, with peripheral regions of Nsp14-ExoN interacting with most regions of Nsp10 [94]. Nsp10/14 associate with the Nsp13-RTC complex, mediated by an Nsp9-Nsp12 interaction, forming a cap(0)-RTC complex (Figure 3h) [90,95].
This cap(0)-RTC complex can form dimers, positioning the Nsp14 ExoN domain facing the Nsp12 reaction center, revealing a potential mechanism for Nsp14 to exert its proofreading activity [95].
Nsp16 is S-adenosylmethionine-dependent methyltransferase (SAM-MTase) essential for methylation of the viral RNA cap [96,97]. The overall structure of Nsp10/Nsp16 is of an Nsp16 monomer on top of an Nsp10 monomer [96].

Open Reading Frame Accessory Proteins
ORF3a from SARS-CoV-2 is a conserved protein across the Sarbecovirus subgenus, which includes SARS-CoV. ORF3a has been implicated in apoptosis and inhibition of autophagy.
ORF3a has been proposed form an ion channel, the second viroporin in the SARS-CoV-2 genome. However, the function of ORF3a during infection remains unknown. The cryoEM structure of ORF3a was recently determined in lipid nanodiscs, revealing that ORF3a forms a dimeric or tetrameric ion channel [102]. ORF3a is composed of a transmembrane domain (TM) of three helices per protomer, TM1, TM2, and TM3 that connect to a cytosolic domain (CD) extending into the cytosol (Figure 3k). In the dimeric form, two of these protomers come together to form the ORF3a ion channel. The CD is made up of eight-stranded β-sheet sandwich, with the inner sheets from each protomer forming a stabile hydrophobic core. In its tetrameric form, two ORF3a dimers come together through interactions between TM3/CD linker region, as well as β1/β2 of neighbouring dimers.
In ORF3a, the lower half of the TM region contains a polar cavity with a lower tunnel, open to the cytosol, and an upper tunnel, likely open to the membrane. While most ion channels contain a central pore, in the case of ORF3a the extracellular TM region forms a hydrophobic seal [102]. Some ion channels have evolved pathways of external groves or tunnels on membrane-facing surfaces of the channel. ORF3a contains a distinct membrane-facing hydrophilic groove between TM2 and TM3, connected to the upper tunnel. Mutations in this region alter ion permeability, supporting the hypothesis of these external grooves are involved in ion transport [102]. As deletions of ORF3a have lowered viral titer and mortality in mice, this may provide a target for novel therapeutic development.
Several other small open reading frame protein structures have also been recently determined.
The crystal structure of the ectodomain of ORF7a has recently been determined, revealing an Iglike fold structure consisting of seven β-strands organized into two tightly packed β-sheets stabilized by two disulfide bonds (Figure 3l) [103]. ORF7a has been shown to interact with CD14 + monocytes with high efficiency [103]. While it functions as an immunomodulating factor J o u r n a l P r e -p r o o f and triggers an inflammatory response, the mechanism of ORF7a's interaction with CD14 + remains unknown [103].
ORF8 has been shown to disrupt IFN-I signalling in cells, as well as down-regulate MHC-I. The crystal structure of ORF8 has been determined, revealing a homodimeric Ig-like fold (Figure 3m) [104]. This dimer is linked by an intermolecular disulfide bond. This Ig-like fold is stabilized by two disulfide bonds conserved between ORF7a and ORF8. ORF8 also contain an ORF8-specific region distinct from other Ig-like folds, containing a third ORF8-specific disulfide bond [104]. While many interactors for ORF8 have been identified, its mechanism of action remains unclear, necessitating more structural work of ORF8 in complex with host factors [104].
The structures of ORF9a provide structural insight into its mechanism for interfering with type 1 interferon immune response by targeting TOM70 [105,106]. TOM70 forms a surface receptor for the translocase of the outer membrane (TOM) complex in mitochondria, playing a key role in relaying antiviral signalling from mitochondrial antiviral signalling (MAVS) to the TANKbinding kinase 1 through recruitment of protein binding heat shock protein 90 (Hsp90), ultimately resulting in interferon response [105]. Upon binding to TOM70, ORF9b takes on a helical conformation that binds within a deep pocket of TOM70 C-terminal domain (CTD) (Figure 3n) [105][106]. This binding appears to stabilize TOM70 and allosterically inhibit recruitment of Hsp90, ultimately suppressing interferon response [105].

Virus assembly in the context of host cell
One of the major advantages provided by advances in cryoET is cryo-focused ion beam scanning electron microscopy (cryoFIB/SEM). CryoFIB/SEM uses a focused ion beam to create 150-250 nm thick cell lamella, which can then be imaged using cryoET to determine macromolecular complex structures in situ [17]. This method was used for imaging the SARS-CoV-2 virions at different stages over the course of infection, allowing high resolution characterization of viral structure and replication (Figure 4a-b) [72,107]. Using cryoFIB/SEM, double membrane vesicles (DMV) in fixed SARS-CoV-2 infected cells were revealed to contain multiple copies of a membrane-spanning pore complex (Figure 4c) thought to be composed of Nsp3 and other J o u r n a l P r e -p r o o f unknown proteins [107,108]. Newly synthesized RNA is hypothesized to be transported out of DMVs through the transmembrane portals for subsequently protein production and virus assembly [72,107]. S is transported in its trimeric prefusion form to assembly sites via small transport vesicles that then fuse with single membrane vesicles (SMV) where virus assembly takes place (Figure 4d) [107]. CryoET imaging of early budding events revealed a positively curved membrane decorated with S on the luminal side, and vRNPs on the cytosolic side ( Figure   4e) [72]. S clusters with the SMV near the electron-dense areas with encapsidated RNPs, ultimately leading to budding [107]. In SMVs, S trimers show a polarized distribution and is likely mobile, allowing them to redistribute during the budding process [72].

Future perspective
The SARS-CoV-2 pandemic has brought the importance of scientific research to the forefront of the media and public's view. The unprecedented collaboration has given us the tools and insights needed to develop not one, several vaccines in record time. The two-proline mutation that was identified in previous MERS-CoV and SARS-CoV work to stabilize S in its prefusion conformation [109] has been used in the development of both the Pfizer and Moderna mRNA vaccines, increasing the efficacy [7,14-15]. CryoET has also been shown to be valuable in validating the post-translational processing and glycosylation of S in the ChAdOx1 nCoV-19 vaccine in human cells, providing a useful tool for checking future vaccine efficacy [110].
However, as the virus continues to spread and mutate, further research will be important in determining new vaccine and therapeutic targets, as it is unlikely that we will be able to eradicate SARS-CoV-2 anytime soon. The structure of M Pro has already resulted in the direct design of inhibitors, as well as being used for virtual screening of thousands of compounds for potential inhibitors [8,16]. Pfizer has recently developed a novel M Pro inhibitor, PF-07321332, the first orally administered M Pro inhibitor to begin clinical trials [111]. Structures of S from variants could provide the basis for future vaccine development and provide insight into how variants evade immune response [112]. Structural work on the RdRp bound by nucleoside analogs presents it as a promising target for novel therapeutic development. Most recently, the inhibitor molnupiravir which causes an error catastrophe during replication, has been approved for use in the UK (as of 4 th November 2021) [113].

J o u r n a l P r e -p r o o f
Other proteins, such as Nsp3 macrodomain, also provide promising candidates for therapeutic development. As the Nsp3 macrodomain has been shown to bind remdesivir metabolite, it is a prime therapeutic target [84]. E also provides an interesting target, as deletions of E or blocking abolishing channel activity have shown promise in SARS-CoV [65]. Small molecules designed to target the acidic or polar residues at the N-terminal could provide an effective target [64].
Structural work on PL Pro also shows promise towards the development of novel inhibitors [10][11].
Several proteins and complexes from SARS-CoV-2 still elude structure determination. The structure of the major structural protein M has yet to be determined, leaving many questions as to how it interacts with both S and its role in viral genome packaging. While the N-terminal domain of Nsp2 has been determined, a structure of the full-length protein structure has yet to be published. Additional non-structural proteins that have not had their structures determined include Nsp4, Nsp6, and Nsp11. Another complex of interest that eludes structure determination is the membrane pore complex in DMVs. Amongst the accessory proteins, ORF3b, ORF7, ORF7b, ORF9c, and ORF10 have yet to be described structurally. In addition, while structures of ORF7a and ORF8 have been determined, structural insight into their mechanisms of action within the host remain unknown [103][104]. As structural techniques improve, it will become more feasible to address the structures of these elusive proteins and their interactions with host factors. Further work on the molecular architecture of SARS-CoV-2 proteins and their host factor interactions could provide the foundation for new developments in antiviral therapies and vaccines, such as work to further stabilize S in its prefusion conformation [109,114].   [102]. (l) Crystal structure of ORF7a (pink) (PDB 7CI3) [103]. Disulfide bonds shown in green (m) Crystal structure of the ORF8 homodimer (tan and cyan) (PDB 7JTL) [104]. Disulfides conserved with ORF7a (green), intermolecular disulfide (yellow) and ORF8-specific disulfide (magenta) are shown. (n) CryoEM structure of ORF9b (red) bound to TOM70 (gold) (PDB 7KDT) [106].  N CTD 7CE0 [67]