Structures and dynamics of the novel S1/S2 protease cleavage site loop of the SARS-CoV-2 spike glycoprotein

Graphical abstract


Introduction
Coronaviruses (CoVs) are a large group of enveloped, singlestranded positive-sense RNA viruses that infect humans and a wide range of animals including birds and mammals (Menachery et al., 2017). A new strain of coronavirus known as SARS-CoV-2 (severe acute respiratory syndrome coronavirus-2) and 2019-nCoV was first signaled at the end of 2019 in the Chinese city of Wuhan (Hubei province) as a human pathogen (Zhou et al., 2020b;Zhu et al., 2020). The SARS-CoV-2 coronavirus causes fever, a dry cough, breathing difficulties and in certain cases, pneumonia and severe respiratory syndrome, which can lead to death. This novel and highly infectious coronavirus respiratory illness was named COVID-19 (Chan et al., 2020;Huang et al., 2020) and marks in recent years the third emergence of a coronavirus that can be life threatening for humans. Previous coronavirus outbreaks include the SARS-CoV-1 and the Middle East Respiratory Syndrome (MERS), which appeared in (World Health Organization, 2003 and 2012 (World Health Organization, 2012), respectively. SARS-CoV-1 disappeared about two years later, whereas MERS continues to affect a small number of people, mainly in the Middle East. SARS-CoV-1 and -2, and MERS coronaviruses (MERS-CoV) are animal pathogens that crossed the species barriers and infected humans who had direct and indirect contact with infected animals (Lu et al., 2015). Unfortunately, SARS-CoV-2 can also be transmitted from human-to-human and has thus spread worldwide at an alarming rate. In March 2020, the World Health Organization (WHO) declared the worldwide outbreak of the new coronavirus as a pandemic. To date, no approved vaccines or proven therapeutics against coronaviruses infecting humans are available.
respectively. Spike glycoproteins form homotrimeric membrane protein complexes that protrude from the viral surface giving the viral particles a "crown" (corona)-like appearance. In contrast to other viruses, e.g., morbilliviruses such as measles virus and canine distemper virus (Plattet et al., 2016), which possess separate receptor and fusion proteins in the viral membrane, coronavirus spike proteins are composed of two functional domains/subunits: one acting as a receptor (S1) and the other as a fusion subunit (S2) (Fig. 1a).
S proteins can be divided into several conserved domains and motifs (Fig. 1a). The N-terminal S1 domain contains the receptor-binding domain (RBD), which recognizes the angiotensin-converting enzyme 2 as a receptor on host cells in SARS-CoV-1 (Li et al., 2003) and SARS-CoV-2 (Hoffmann et al., 2020b;Walls et al., 2020;Wrapp et al., 2020). The C-terminal S2 domain is responsible for mediating membrane fusion between the virus and the host cell, and includes the fusion peptide (FP), two heptad repeats (HR1 and HR2), the transmembrane domain (TM) and other domains (Fig. 1a).

Structure of the SARS-CoV-2 ectodomain S protein
At the beginning of 2020, structures of the SARS-CoV-2 S protein ectodomain trimer were published Wrapp et al., 2020) providing valuable information on the complex architecture. It should be noted that the recombinant SARS-CoV-2 proteins were designed in a prefusion stabilized conformation, e.g., with an abrogated S1/S2 protease cleavage site. Cryo-electron microscopy (cryo-EM) of the SARS-CoV-2 S protein ectodomain structures in the closed state, where all RBDs are tightly packed together , and in the partially open state, with one open, two closed RBDs in the trimer Wrapp et al., 2020) are available ( Fig. 1b and c). Recently, the groups of McLellan  and Veesler (McCallum et al., 2020) presented additional engineered versions and structures of the SARS-CoV-2 S protein ectodomain stabilized with two open, one closed and all closed RBDs. The structure of SARS-CoV-2 S protein resembles that of SARS-CoV-1 (Walls et al., 2020; Wrapp et al., Fig. 1. Structure of the SARS-CoV-2 S protein and comparison of S1/S2 protease cleavage site loops of closely related coronaviruses. a Schematic representation of the SARS-CoV-2 spike glycoprotein primary structure. The different domains are colored and defined as follows: SS, signal sequence; NTD, Nterminal domain; RBD, receptor-binding domain; S1/ S2, S1/S2 protease cleavage site; S2′, S2′ protease cleavage site; FP, fusion peptide; HR1, heptad repeat (HR) 1; CH, central helix; CD, connector domain; HR2, stalk domain containing HR2; TM, transmembrane domain; CT, cysteine-rich cytoplasmic domain. Arrowheads mark protease cleavage sites S1/S2 and S2′. Prefusion state structures of the SARS-CoV-2 S protein ectodomain with: b all RBDs closed (PDB ID code: 6VXX) , and c one open and two closed RBDs (PDB ID code: 6VYB) . Domains of the spike glycoproteins for which no structural information is available are represented schematically. Glycans are not displayed. The domains are colored according to the color code given in panel a. Amino acid (aa) sequence alignment of selected CoV beta-hairpins containing S1/S2 protease cleavage site loops (d), and SARS-CoV-2 and MERS-CoV inner loop regions (e). The novel insertion identified in SARS-CoV-2 in the loop is highlighted with amino acid residues in bold and overlined. The aa residues of the two conserved betastrands, which enclose the loop containing the S1/S2 priming site are indicated by β characters. The sequence conservation is represented with characters, i.e., positions that have a fully conserved residue (*), and conservation between groups of strongly (:) and weakly similar properties (·). Color coding of aa residues is according to their physicochemical properties: small and hydrophobic (including aromatic except tyrosine) are in red, acidic in blue, basic in magenta and others in green. Protease cleavage sites are indicated (↓). Sequence alignment was performed with Clustal Omega (Sievers et al., 2011) and the aa sequences of SARS-CoV-2 (GenBank: QHD43416.1), SARS-CoV-1 (GenBank: AYV99817.1), RaTG13-CoV (GenBank: QHR63300.2), Pangolin-CoV (GenBank: QIA48632.1) and MERS-CoV (GenBank: AFS88936.1). 2020). One difference consists in the packing of the RBDs in their closed conformations, i.e., the RBDs in SARS-CoV-1 are tightly packed against the N-terminal domain (NTD), while the S protein in SARS-CoV-2 is angled and closer to the central cavity of the trimer . Structures of SARS-CoV-2 resolved 48  and 44  of the 66 N-linked glycosylations per trimer .

The extended S1/S2 protease cleavage site of SARS-CoV-2 spike glycoprotein
Coronavirus spike glycoproteins are cleaved posttranslationally by host cell proteases into S1 and S2 domains, which remain bound via non-covalent interactions (Bosch et al., 2003). This processing step is known as priming and is essential for viral entry. In contrast to SARS-CoV-1, which contains a monobasic S1/S2 protease cleavage site that is processed upon entry into target cells, SARS-CoV-2 has an extended tribasic priming site including a pair of basic residues (Fig. 1d). Recent evidence from in vitro experiments reports that this tribasic site is not only recognized and cleaved by furin (Hoffmann et al., 2020a;Jaimes et al., 2020) but also by additional proteases (Jaimes et al., 2020). This tribasic protease cleavage site, hereinafter referred to as furin cleavage site, contains an insertion of 12 nucleotides coding the aa sequence 681-PRRA-684 ( Fig. 1d) . The spike protein is thus already cleaved by furin or other proteases during biogenesis, differentiating this new virus from SARS-CoV-1 and other related CoVs . In coronaviruses, a second cleavage site called S2′ is localized upstream of the fusion peptide (Madu et al., 2009;Millet and Whittaker, 2015) (Fig. 1a). For full activation of the S protein and viral entry, cleavage at S1/S2 and S2′ is expected. Here again, the SARS-CoV-2 is markedly different. Unlike other CoVs, which exhibit a monobasic S2′ cleavage site (R↓S), SARS-CoV-2 and closely related bat CoVs display a dibasic cleavage site (KR↓SF) (Coutard et al., 2020). Interestingly, CoVs presenting monobasic cleavage sites appear to be less pathogenic to humans (Coutard et al., 2020).
Betacoronaviruses are divided into four lineages denoted as: A, B, C and D. The SARS-CoV-1 and the new SARS-CoV-2 are both part of lineage B, and the MERS-CoV is part of lineage C. With the exception of the recently emerged SARS-CoV-2, multibasic S1/S2 protease cleavage sites are totally absent in lineage B. Only the S protein from the MERS-CoV (lineage C) also contains a related dibasic cleavage site (Fig. 1e). Proteolytic cleavage at the S1/S2 site is essential for viral entry. Blocking of this event could reduce or inhibit viral entry. We therefore carried out an extensive investigation into the structures and dynamics of the loop containing this novel S1/S2 protease cleavage site (Fig. 1d).

Molecular dynamics simulations
Two 10-µs molecular dynamics (MD) simulations of the SARS-CoV-2 S protein under physiological conditions (aqueous solution, 310 K and 1 atm) were carried out by D. E. Shaw Research on their Anton2 supercomputers. These simulations are freely available and can be downloaded from the D. E. Shaw Research website (Shaw Research, 2020). The closed (6VXX, simulation 11021566) and the partially open (6VYB, simulation 11021571) structures were used as initial models. Missing loops were added and the structures were fully glycosylated. The final systems were solvated in an aqueous buffer and neutralized using NaCl ions at a concentration of 150 mM. The molecular dynamics simulations were carried out using the Amber force field (ff99SB-ILDN for the protein and general force field for the glycans) and the trajectory was saved every 1.2 ns.

Structural analyses
A principal component analysis (PCA) of the molecular dynamics simulations was performed using ProDy (Bakan et al., 2011). The kmeans clustering was implemented with the scikit-learn package (Pedregosa et al., 2011) and the optimal number of clusters was determined using the Silhouette method (Rousseeuw, 1987).

Rosetta loop modeling
The missing loop containing the tribasic protease cleavage site was modeled using the remodel procedure in Rosetta . The procedure generated 600 loop conformations starting from the closed structure (6VXX). The model with the lowest energy was then further refined by running 600 instances of the Kinetic Closure (KIC) protocol in Rosetta (Mandell et al., 2009). The structures were clustered using the cluster application in Rosetta with a 2 Å radius and sorted by energy. We considered the top 10 clusters and removed the singletons.

Structures and dynamics of the S1/S2 protease cleavage site loop in SARS-CoV-2 S protein
In CoVs, two conserved beta-strands form an anti-parallel beta-sheet connected by a loop, which contains the S1/S2 protease cleavage site (Fig. 1d). In order to gain insights into the structures and dynamics of the loop containing the novel multibasic furin cleavage site of the SARS-CoV-2 spike glycoprotein, we analyzed two multi-microsecond molecular dynamics (MD) simulations made freely available by D.E. Shaw Research (Shaw Research, 2020). These simulations were initiated from the closed (6VXX) and partially open (6VYB) structures with modeled missing loops. A principal component analysis of the conformations sampled by the beta-hairpin containing the furin cleavage site (I670 to T696) shows that the loop samples several distinct conformations (Fig. 2a). A system is considered ergodic, if the time average equals the ensemble average. Despite of the significant simulation time, i.e., multimicrosecond MD simulations, the conformations sampled by each protomer are distinct (Fig. 2), thus indicating that ergodicity was not achieved.
In order to isolate representative conformations, a k-means clustering was carried out in the eigenspace defined by the first three principal components, which account for 81% of the total variance. The optimal number of clusters (k = 8) was determined using the Silhouette method (Rousseeuw, 1987). During the MD simulations, the loop appears largely unstructured and samples several conformations extending outwards (Fig. 2b, clusters 4 -7), making them potentially accessible for proteolytic cleavage. However, in three clusters, the loop also folds back towards the protein (Fig. 2b, clusters 1 -3), causing the furin cleavage site to be less accessible. Finally, in one of the clusters of the closed structure (cluster 8), the loop points towards the apex and interacts extensively with the neighboring N-glycans (N61 and N603). The interactions remain stable throughout the simulations. A principal component analysis of conformations sampled by the beta-hairpin containing the furin cleavage site and the glycan rings indicates two comparable interaction modes (Fig. 3). The structures from cluster c1 were all sampled during the first 1 μs of simulation, thus corresponding to the initial equilibration of the interactions. We therefore focused the glycan analysis on the cluster c2. Two arginine residues (R683 and R685) dominate the interactions with the glycans and form a persistent network of hydrogen bonds with several glycan moieties (Fig. 3b and  c). The backbone of V687 also interacted with N61 β-mannose in about 30% of the conformations. These glycans could thus play an important role in regulating the accessibility of the furin cleavage site. Taken together, these observations indicate a complex interplay between the dynamics of the novel multibasic S1/S2 protease cleavage site loop and neighboring glycans.
Since ergodicity was not achieved during the MD simulations, we used the ab initio modeling procedure of Rosetta , a powerful protein modeling software, to more extensively sample the conformations of the S1/S2 cleavage site containing loop. No noticeable energy gap was observed between the different ab initio models; thus they were first clustered. Singletons were removed from the ten lowest energy clusters, yielding up to eight clusters for the analysis (Fig. 4). The presence of a helical structure is observed for several of the low energy structures in most clusters and is formed in the vicinity of the furin cleavage site (Fig. 4). Such conformations, however, were never sampled during the MD simulations. The conformations from the MD simulation most similar to the ab initio models superposed with an average RMSD of 2.5 ± 0.2 Å. With the exception of one model, all the other ab initio models (Fig. 4) were closest to MD conformations that belonged to cluster 3 (Fig. 2). Only the MD conformation closest to model iv belonged to cluster 2. The presence of a helix could also influence the binding of the protease and thus cleavage of the loop containing the novel multibasic S1/S2 cleavage site.

Analysis of amino acid residues in the SARS-CoV-2 spike glycoprotein S1/S2 cleavage site containing loop, and of their potential structural and functional roles
From a structural point of view, the SARS-CoV-2 S protein proline residue (P681; Figs. 1d and 3c) in the insertion is eye-catching, because of the special and unique structural properties of this proteinogenic amino (imino) acid. MERS-CoV S protein is one of the other rare CoV spike proteins, that also contains a proline residue at the corresponding position in the S1/S2 protease cleavage site (Fig. 1e). When searching the database FurinDB (Tian et al., 2011) (http://www.nuolan.net/ substrates.html), which includes experimentally verified furin cleavage sites, it appears that a proline residue at position P5, i.e., the 5th residue prior to the furin cleavage site, is rare and appears in only 5 out of 132 sequences (three mammalian and two viral sequences). Since proline is unable to adopt several main chain conformations in proteins, it imposes strong conformational restraints on the peptide chain. It is therefore often found in turns, which force the peptide chain to change directions and separate secondary structures. This is supported by the ab initio modeling, where the proline is found at the N-terminus of short helices in several models (Fig. 4). Finding this proline in the insertion, just before basic amino acid residues, which define the SARS-CoV-2 S protein furin cleavage site is interesting, since it nicely separates the cleavage site from other structural elements, which might better expose it to the proteases. Recently, Andersen et al. (Andersen et al., 2020) proposed that the presence of the proline residue in the insertion would result in the addition of O-linked glycans at flanking positions S673, T678 and S686. In the recent structures of the S protein ectodomain Wrapp et al., 2020), only S673 could be modeled into the density map from cryo-EM. The authors of the published structures did not model a glycan at position S673 and no additional density is visible near S673 when inspecting the density maps (https:// www.ebi.ac.uk/pdbe/emdb/; EMD-21452 , EMD-21457  and EMD-21375 ). Glycans on the surface of viral proteins often mask immunodominant epitopes, thus protecting them from the host's immune system. However, glycosylation of residues flanking the furin cleavage site does not appear to be beneficial, since this would prevent the full maturation of the S protein by shielding the cleavage site from the proteases. In addition and more important, the recent mapping of O-glycosylation in SARS-CoV-2 spike protein by high resolution LC-MS/MS does not report O-glycosylation at positions S673, T678 and S686 (Shajahan et al., 2020).
The presence of alanine (A684) at position P2, i.e., the 2nd residue prior to the furin cleavage site, is also unusual in a furin cleavage site and appears in only 5 out of 132 sequences in the FurinDB (Tian et al., 2011). This position (P2) is predominantly occupied by Arg or Lys, and T. Lemmin, et al. Journal of Structural Biology: X 4 (2020) 100038 it has been shown that a basic residue at P2 greatly enhances processing efficiency (Shiryaev et al., 2013;Thomas, 2002). Thus, the alanine at P2 in SARS-CoV-2 S protein is expected to decrease the furin cleavage efficiency compared to sites containing basic amino acids at P2. However, this reduction in cleavage efficiency might be largely compensated by the presence of a total of three basic residues (i.e., a relatively high number in SARS-CoV-2 compared to other CoVs, see Fig. 1d) at the S1/S2 protease cleavage site loop.

Potential function of the furin cleavage site in SARS-CoV-2 S protein
From a functional point of view, the insertion of a multibasic protease cleavage site at S1/S2 in SARS-CoV-2 is an important new feature, which may account for its increased virulence. The highly pathogenic  T. Lemmin, et al. Journal of Structural Biology: X 4 (2020) 100038 avian influenza (AI) viruses are known to have evolved from low-pathogenic AI viruses (Perdue et al., 1997). While low-pathogenic AI viruses contain a single arginine residue, highly pathogenic AI viruses contain multiple basic amino acid (aa) residues at the cleavage site of the surface glycoprotein hemagglutinin (Chen et al., 1998;Ito et al., 2001;Perdue et al., 1997). Incorporation of basic aa at these sites was proposed to have originated by mutation/recombination events in influenza H9 viruses (Lee and Whittaker, 2017), or by polymerase slippage in influenza H5 and possibly H7 viruses (Nao et al., 2017). A site containing a single arginine is cleaved by trypsin-like proteases, whereas multiple basic amino acids are recognized by several cellular proteases including furin (Chen et al., 1998;Nao et al., 2017). Such characteristics may also contribute to understanding the differences of how SARS-CoV-1 (monobasic S1/S2 cleavage site) and SARS-CoV-2 (tribasic S1/S2 cleavage site) infect humans. Here, the 681-PRRA-684 insert (Fig. 1d) may not only confer an advantage in SARS-CoV-2 cell entry, but may consequently facilitate human-to-human transmission and thus the rapid spread of the disease compared to CoVs without a multibasic S1/S2 protease cleavage site. An additional feature that will influence the protease cleavage efficiency at the S1/S2 site of CoVs, is the length of the loop containing this site, which is flanked by two conserved beta-strands (Fig. 1d, also Figs. 2 and 4 for structures). In SARS-CoV-2, the loop harboring the S1/ S2 cleavage site, has a length of 15 aa and is the longest when compared with closely related CoVs, which have all 11 aa (Fig. 1d). Recently, a novel bat isolate (RmYN02), which exhibits a nucleotide sequence identity of about 93% with the SARS-CoV-2 genome, was identified (Zhou et al., 2020a). Conversely, the loop containing the S1/S2 protease cleavage site of the RmYN02 S protein is relatively short containing only 9 aa (Fig. 2H in (Zhou et al., 2020a)) compared to SARS-CoV-2 (15 aa) and closely related CoVs (11 aa) (Fig. 1d).

Conclusion
The novel multibasic S1/S2 protease cleavage site is an important new feature of SARS-CoV-2 and represents an attractive therapeutic target, since viral entry could be reduced or inhibited by blocking the proteolytic cleavage event. Our analyses of molecular dynamic simulations and ab initio modeling showed that the loop containing this cleavage site protrudes from the S protein surface, making it accessible to proteases. The neighboring N-linked glycans might, however, modulate accessibility of the protease cleavage site. The ab initio modeling also indicated that the loop might be moderately structured forming short helices close to the cleavage site. The impact of the nature, length, structure and dynamics of this loop on protease cleavage efficiency, and ultimately, the overall pathogenicity of CoVs remains, however, an open question that warrants immediate detailed analysis due to the current pandemic crisis.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.