Allosteric Regulation of 3CL Protease of SARS-CoV-2 and SARS-CoV Observed in the Crystal Structure Ensemble

Graphical abstract


Introduction
During structure-based drug discovery, the threedimensional structure of a therapeutic target protein is often determined redundantly in the form of a complex with various drug candidates, which reveals their binding modes at the atomic level. [1][2][3][4] Consequently, an abundance of structural information has been accumulated for some target proteins in Protein Data Bank (PDB) and other databases. [5][6][7][8][9][10] Ligand interactions can alter the structure of a receptor protein in different ways to produce structural variation in the protein. 11,12 When examining protein structures, it becomes clear that ligand binding is not the sole cause of structural variation in crystals: crystal packing and sequence alteration also affect structure. 13,14 Within a set of PDB entries for a protein, different space groups or different crystal packing are frequently found. Crystallographic experiments often use proteins with altered sequences either for mutagenesis experiments or in sample preparation. These factors generate the structural ensemble of crystal structures, which is termed a "crystal structure ensemble." According to the concept of protein dynamics, conformational selection 15 and linear response theory, 16 any structural changes of a protein, whether they occur naturally or artificially, are reflections of its intrinsic dynamics. Therefore, the crystal structure ensemble can be understood as a sampled subset of the true structural ensemble, although the samples may be limited and biased to some extent because they are not randomly sampled. In respect of the ensemble's reliability, the structural variations in the ensemble are statistically significant observations derived from repeated measurements. For these reasons, the crystal structure ensemble is an important source for the study of protein dynamics. 17 In the present study, we assembled a crystal structure ensemble of 3C-like protease (3CL pro , also known as main protease) from severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) as well as the highly homologous 3CL pro of SARS-CoV, the etiological agent for SARS in 2002 ( Figure 1). 3CL pro is a cysteine protease belonging to the C30 family (coronavirus 3C-like proteases) of the PA clan (serine/cysteine proteases with a chymotrypsin fold) 7 ; it plays a crucial role in viral replication by cleaving the polyprotein to release functional proteins. [18][19][20] The COVID-19 pandemic caused by SARS-CoV-2 poses an urgent challenge for the development of antiviral agents. [21][22][23] The inhibitor of 3CL pro is a potential candidate as an antiviral agent. The crystal structures of 3CL pro complexed with drug candidate molecules have been solved intensively and accumulated in PDB since the first structure was released in February 2020 (PDB:6lu7 24 ; Supporting data S1 and S2). Here, we also utilize the abundant structure data of 3CL pro from SARS-CoV because of its 96% sequence identity and superimposable structures ( Figure 1). The compiled data in this study are of version 10/25/2020, which contains SARS-CoV-2 3CL pro (83 entries; 113 independent chains) and SARS-CoV 3CL pro (101 entries; 145 independent chains) (Supporting data S1 and S2; see Data used in this study in Materials and Methods for details).
3CL pro is known for its highly dynamic nature. The function of 3CL pro is regulated by the structural transition between the active and collapsed states of the catalytic loop (termed the "C-loop" hereafter; also known as oxyanion binding loop). 25 The activation requires dimerization through the additional C-terminal helical domain (domain III). 26 The ligand binding sites of 3CL pro are located in flexible loops. 7 These dynamical behaviors were investigated in detail based on the crystal structure ensemble. The results of each entry are summarized in Supporting data S1 and S2 for the ligandbound and ligand-free chains, respectively. First, the overall dynamics is explored to identify the elements of the dynamic structure. Second, the dynamic regulation of the catalytic function in the C-loop is studied. Third, the ligand binding sites and the influence of ligand binding on the protein structure are investigated, and the ligand-induced activation is explained. Finally, the effects of the mutation between the two 3CL pro species on the dimeric structure and catalytic function are examined.

Results and Discussion
Overall motions in the crystal structure ensemble of 3CL pro To understand the dynamic regulation of the catalytic function of 3CL pro , the crystal structure ensemble was analyzed according to the following two steps (see Materials and Methods for more details). As the first step, the elements of motion regulating the catalytic function were identified by applying the Motion Tree to the crystal structure ensemble, 27,28 which is a hierarchical cluster analysis of the variance of interresidue distances (therefore, it is superposition free). In the Motion Tree based on all 258 independent chains, each cluster is defined as a group of residues moving cooperatively, represented by a branch of the Motion Tree; thus, they are termed "moving cluster" (Figure 2 (A)). The Motion Tree delineates the dynamics of 3CL pro as a set of moving clusters consisting of four flexible loops and domain III situated on the rigid core of the chymotrypsin fold (Figure 2(B)). The moving clusters are as follows: the "C-loop" (residues 138-143; the catalysis-related loop, also known as the L1 or oxyanion binding loop 29 ; the target fragment for functional regulation), "E-loop" (residues 166-178; loop starting at E166, L2), "Hloop" (residues 38-64; a mostly helical loop), "Linker" (residues 188-196; a linker to domain III, L3), and "domain III" (residues 200-300). All of Figure 1. Illustrations of SARS-CoV-2 3CL pro (PDB:7bqy, Chain A (green) and Chain B (cyan)) and SARS-CoV 3CL pro (PDB:2hob, Chain A (yellow) and Chain B (orange)), superimposed at the core region of the chymotrypsin fold. Both  these clusters appear to move independently of each other (Table S1). As detailed below, three of the four loops, namely the C-loop, E-loop, and Linker, contain major ligand binding sites. Domain III has been reported to play a crucial role in the dimerization required for the activation 25,26 and is discussed below in relation to mutation effects and the allosteric regulation of catalytic function . The   rigid core consists of the core part of domain I and  II (residues 13-17, 33-37, 71-111, 129-137,  154-165, 179-187, and 197-199; chymotrypsin  fold) and the dimer interface region (residues  3-12, 18-32, 65-70, 112-128, and 144-153) including the N-finger (residues 1-7) of the essential structural component for dimerization and activity, 25,[30][31][32] whereas residues 1 and 2 are highly flexible and play an important role in the functional regulation.
As the second step, the motions of each moving cluster were delineated by calculating the principal components (PCs) of the motions relative to the core region for the four loops and domain III (Figure 2(C))). Domain III was treated in two ways, one as the motion within a protomer (Figure 2(C)) and the other as the motion of the dimer (see Figure S1 and Supporting Text 1 and 2 for details). The motion of domain III within a protomer is a typical domain motion against the core region, as DynDom, a domain motion analysis program, identifies it as the domain motion. 33 C-loop: The switch controlling catalytic activity and the target of regulation Hydrogen bond between Gly143 and Asn28 regulates the formation of the oxyanion hole: A marker distinguishing active and collapsed states. The conformational change of the C-loop (residues 138-143) between the active and collapsed states switches on and off the catalytic activity of 3CL pro by forming and collapsing the oxyanion hole stabilizing the catalytic intermediate. 25,34 This activation mechanism of the C-loop conformational change is largely identical to that of zymogen activation in serine proteases, such as chymotrypsinogen (inactive/collapsed) to chymotrypsin (active), 35 although the activation of 3CL pro also requires dimerization. The representative active and collapsed structures are shown in Figure 3(A). The comparison of these structures suggests that the hydrogen bond (HB) between Gly143 and Asn28 is a sensitive signature of the oxyanion hole (hydrogen bonds are defined by LigPlot 36 ). In the active state, HB 143-28 together with HB 145-28 correctly arranges the main-chain NH groups of Gly143 and Cys145 to form the oxyanion hole. When the Cloop is collapsed or Gly143 moves downward as shown in the right panel of Figure 3(A) (PC1(C) in Figure 2(C) is drawn conversely as the activation process), HB 143-28 is cleaved but HB 145-28 remains intact. Consequently, Gly143 is separated from Cys145 to break down the oxyanion hole. As shown below, the formation of HB 143-28 is highly correlated with the C-loop conformation. Contrastingly, HB 145-28 exists in almost all chains (257 of 258 chains), with the exception of the mutant N28A (PDB:3fzd) 37 as both Cys145 and Asn28 belong to the stable dimer interface region. The importance of Asn28 is confirmed by the mutant N28A, in which the activity is completely lost and the affinity for dimerization is largely impaired. 37 Therefore, in the present study, we define the conformational states of the C-loop by HB 143-28: the active state occurs when HB 143-28 exists, whereas the collapsed state occurs when HB 143-28 is absent. The role of the two hydrogen bonds, HB 143-28 and HB 145-28, may correspond to those of the salt bridge between Asp194 and Ile16 (existing only in chymotrypsin; Ile16 has become the N terminus due to the autolysis) and the hydrogen bond between Ser195 and Gly43 (existing in both chymotrypsin and chymotrypsinogen). 35 Actually, any hydrogen bond listed in Table 1 (explained below), particularly HB 140A-1B formed along with dimerization, can be assigned to correspond to HB 194-16 in chymotrypsin. Figure 3(A) also shows that the catalytic dyad, Cys145 and His41, maintains the close arrangement in both states. Indeed, the distance between the two residues fluctuates by only a small extent (see the distance histogram in Figure S3) because His41 is located at the less mobile hinge region of the H-loop.  27,28 , calculated for the crystal structure ensemble consisting of all 258 chains of SARS-CoV 3CL pro and SARS-CoV-2 3CL pro . The variances D mn were symmetrized for two chains, or D mn = D MN (residues m and n in a protomer correspond to residues M and N in the other protomer, respectively), to treat the homodimer of 3CL pro ; thus, two sets of equivalent clusters corresponding to each protomer exist in the tree (see Methods). A taller branch with a high Motion Tree (MT) score, the scale of the abscissa, indicates a cluster moving to a greater extent. The amplitudes of motions follow the order: domain III > C-loop > E-loop $ H-loop $ Linker. The clusters over the threshold (a thin broken vertical line; MT score = 0.5 A) are defined as the moving clusters. Nevertheless, domain III is treated as a single cluster for simplicity. Based on the Motion Tree, we defined five moving clusters (the C-loop, E-loop, H-loop, Linker, and domain III). The Nfinger (residues 3-7) is shown to be a part of the core of the dimer interface region. Highly mobile Ser1 and Gly2 and the C-terminal residues 301-306 are not included in the analysis because their coordinates are often involved in missing residues. Loop 280-287 belongs to domain III (see discussion in "Influences of T285A mutation . . ."). (B) The moving clusters of SARS-CoV-2 3CL pro (PDB: 6lu7 as a representative). The color scheme for the moving clusters defined by the Motion Tree (A) is as follows: the C-loop, red; E-loop, green; H-loop, magenta; Linker, cyan; and domain III, bright orange. The core regions consist of the core part of domains I and II (white) and the dimer interface region (black). The C-terminal part of the N-finger (residues 3-7; blue) is within the dimer interface region, whereas residues 1 and 2 (yellow) are highly flexible. Notably, each of the N-termini is located near the C-loop of the other protomer. (C) The first principal components (PCs) of the moving clusters within a protomer are illustrated by arrows using the same color scheme used in (   Hydrogen bonds between the C-loop and surrounding residues cooperatively regulate catalytic function. When expanding the focus from the oxyanion hole to the whole C-loop, the conformation is described by various structural characteristics: the continuous principal component PC1 for the C-loop, PC1(C) (Figure 2 (C)), and five hydrogen bonds between the C-loop and residues surrounding the C-loop, HB 138-172, HB 139-126, HB 140A-1B (A and B indicate the two protomers), and HB 141-118, along with HB 143-28 ( Figure 3(A)). Among these HBs, HB 138-172 and HB 140A-1B are the hydrogen bonds with the moving clusters, i.e., the E-loop and the flexible fragment of the N-finger coupled with dimerization, respectively. These dynamic couplings with the C-loop are discussed below. The formation of these hydrogen bonds in each conformational state of the C-loop are summarized in Table 1; their formation was highly correlated with those of HB 143-28 (as indicated by the rate of agreement at the bottom of Table 1), except for HB 140-1 (explained below). Therefore, they represent almost the same structural information on the C-loop as that of HB 143-28; in other words, the conformational change occurs cooperatively involving these HBs as a structural transition. PC1(C) can be divided into the two categories with the threshold value of À1, namely the active state for PC1(C) > À1 and the collapsed state for PC1(C) < À1, for which the threshold value was roughly optimized to give a high correlation. Figure 3(B) shows the distribution of PC1(C), in which the active state has a definite value, whereas the collapsed state shows large variation. Thus, the structural change in the Cloop between the two states can be regarded as an order-disorder transition (Supporting Text 3 for details).
Charged states of the N terminus affect the conformation of the C-loop. A rather poor agreement, 0.81, was found in the interprotomer HB 140-1. It has been reported that HB 140-1 is involved in an important interprotomer interaction coupled with dimerization to stabilize the active state, as well as another interprotomer hydrogen bond between Glu166A and Ser1B that occurs concurrently with HB 140-1 (160 chains have both of HB 140-1 and HB 166-1; 21 chains have only HB 140-1; and 5 chains have only HB 166-1). 25,38 Nevertheless, as shown in Table 1, fewer active chains have HB 140-1 than have the other HBs. This can be explained by the uncharged NH group of Ser1 produced by some amino acids appended to Ser1 39 (data are summarized in Supporting data S1 and S2). The uncharged N terminus weakens the interaction to Phe140 O to destabilize the polar contact (83% (=54/65) of chains with the uncharged N terminus do not have HB 140-1) and mimics the state prior to the cleavage of the polyprotein at the Table 1 Statistics of C-loop conformation. N terminus. When only the chains with the innate charged N terminus are counted in the statistics, the agreement with HB 143-28 increases up to 0.95, i.e., the same level as the other indices.
Absence of HB 140-1 has different effects in ligand-free and ligand-bound chains: ligandinduced activation. It is still necessary to clarify the reason why many chains in the active state do not need stabilization via HB 140-1 (Table 1). This situation differs between ligand-free chains and ligand-bound chains. For ligand-free chains, 10 of 13 active/ligand-free chains without HB 140-1 lose two more hydrogen bonds of the three HBs, 138-172, 139-126, and 141-118, on average (Supporting data S2). Thus, the absence of HB 140-1 destabilizes the C-loop to produce a marginally active state. The active/ligand-bound chains without HB 140-1 show a completely different feature, i.e., the other four HBs remain intact as well as PC1(C) > À1 (Supporting data S1). This can be explained by ligand-induced activation 40,41 ; the ligand interactions with the Cloop always contain interactions between a moiety mimicking the main-chain carbonyl group of the P1 site and the main-chain NH groups of Gly143 and Cys145, indicating that Gly143 and Cys145 are in the position of the active state. Table 1 shows that the probability of the chains found in the active state is 0.98 (=166/(166 + 4)) in ligand-bound chains, whereas the probability decreases to 0.73 (=64/(64 + 24)) in ligand-free chains. Stabilization by ligand molecules is also confirmed by the collapsed/ligand-bound chains; they are in the collapsed state because they do not have any ligand interaction at the C-loop (Supporting data S1; see the discussion on ligand binding below).
Ligand binding and its influence on the structure leading to activation Ligand binding sites are in the moving clusters. The variety of ligand binding is another subject of the crystal structure ensemble; the dataset contains 167 chains complexed with 92 different ligands (Supporting data S1). Ligand binding was analyzed in terms of the ligand binding sites defined by LigPlot. 36 As shown in Supporting data S3, binding is characterized by whether each residue has polar/nonpolar/covalent interactions with the ligand; 25 residues in 3CL pro are identified as the major binding sites shared by 29-150 chains of the 167 ligand-bound chains. The major binding sites consist of hydrogen bonding residues known as the substrate binding subsites (S1-S6) 25,34 and their neighboring residues forming the nonpolar contacts. Another 16 minor sites (see the caption of Supporting data S3) were found adjacent to the major binding sites and contained only 51 ligandresidue contacts in total. Notably, the major binding sites mostly overlap with the moving clusters defined by the Motion Tree, i.e., the C-loop, Eloop, Linker, and H-loop, and these binding sites are named as the following five "binding clusters": the dimer interface region, H-loop, C-loop, E-loop, and Linker with some minor changes to the assignment (Supporting data S3). This observation indicates that the ligand binding sites, composed of the moving four loops, have a highly dynamic character that contrasts with a druggable rigid binding pocket; thus, it may not be easy to achieve high affinity without a covalent interaction at Cys145 (139 of the 167 ligand-bound chains have covalently bound ligands; Table 2A and Supporting data S3). Peptide substrates and nonpeptide compounds use essentially the same polar binding sites. To illustrate the structural details of ligand binding, we classified the ligands into the three types: the peptide substrate, peptide-mimic compound, and nonpeptide compound (see the caption of Table 2A for the definitions). As shown in Table 2A, the peptide/peptide-mimic compounds have more binding sites than are found in nonpeptide compounds, which is due to the size difference (average molecular weight: 578 for peptide/peptide-mimic compounds; 322 for nonpeptide compounds). The representative structures of the complexes are shown in Figure S5. The peptide substrates are recognized by hydrogen bonds with the subsites: S1:C-loop (F140, G143, S144, and C145); E-loop (H163 and H164); H-loop (H41); S3:E-loop (E166); S4:Linker (Q189); and S6:Linker (Q192) ( Figure S5(A)). The peptide-mimic compounds are also bound at the peptide moieties by the subsites, although the hydrophobic side-chains do not necessarily have a definite orientation ( Figure S5(B)). In contrast, the nonpeptide compounds show a large variety of binding poses ( Figure S5(C)). However, when the polar interactions are focused, the subsites correctly make hydrogen bonds with the polar atoms of the nonpeptide compounds ( Figure S5 (D)). Consequently, the variation in binding sites is strictly limited to a small set of the subsites and their adjacent nonpolar residues.  Table S1). However, when the conformations of the E-loop and Linker are classified in terms of the binding cluster (Supporting data S3), a weak but definite binding pose dependence of the conformation is observed. Here, the binding poses are classified as "EL" (binding both the Eloop and Linker), "E" (binding only the E-loop), "w" (weak; binding neither the E-loop nor Linker), and ligand-free (see the caption of Supporting data S3 for the definition). A one-dimensional histogram drawn along the collective variable, PC1(E)-PC1(L), more clearly shows the binding pose dependence (Figure 4(B)). In the ascending order of PC1(E)-PC1(L), the ligand-free, "w," "E," and "EL" poses appear in the histogram. The histograms of the ligand-free and "w" poses almost overlap because the ligand binding of the "w" pose does not have an interaction with either the E-loop or Linker. However, the four monomeric crystal structures are situated in the histogram at the smallest extreme because the largely skewed position of domain III makes PC1 (L) have largely negative values (Supporting data S2). These data indicate that the value of PC1 (E)-PC1(L) increases when the binding sites expand more to the E-loop and Linker. The representative structures of these groups show that when the collective variable increases, the E-loop shifts downward to accommodate larger ligands (Figure 4(D)). At the same time, the Nterminal (upper) part of the Linker shifts inward to make more interactions with the ligand, whereas the C-terminal (lower) part that does not participate in binding moves outward. The collective variable PC1(E)-PC1(L) correctly represents these motions in one-dimension. As shown in Figure 4(C), the motion along the collective variable of PC1(E) + PC1(L), perpendicular to PC1(E)-PC1(L), is an opening/closing motion of the two loops that exhibit almost no binding pose dependence. This suggests that PC1(E) + PC1(L) represents the intrinsic fluctuations. The representative structures are shown in Figure S7. The majority of the nonpeptide compounds have the binding pose "w" because of their small size (Table 2B).

Downward motion of the E-loop caused by ligand
binding stabilizes the active state of the C-loop. We also investigated the influence of the E-loop on the C-loop, or on catalytic activity. As shown in Table 1, one of the hydrogen bonds stabilizing the active state of the C-loop, HB 138-172, represents the direct interaction between the C-loop (Gly138) and E-loop (His172). As explained above, the Eloop makes downward motions depending on the size of the bound ligand (the larger the ligand becomes, the more the E-loop shifts downward). Figure 5(A) shows the representative structures with the E-loop situated at the lower position due to ligand binding and with the E-loop at the upper position in the ligand-free chain (PDB: 7brpA and 7bro, respectively); the former structure forms a hydrogen bond between His172 and Gly138, whereas the latter structure does not form this bond. Statistics for the PC1(E) dependence of the hydrogen bond formation are shown in Figure 5 (B); a monotonous increase was observed for the probability of formation of HB 138-172 with increasing PC1(E) (the downward motion). Simultaneously, the probability of the C-loop in the active state increases. To avoid the influence of ligand-induced activation, we also calculated the same quantities for ligand-free chains. The difference between the values for all chains and those for the ligand-free chains can be ascribed to the effect of ligand-induced activation, but the behavior of the increase with PC1(E) is found in both chains. Therefore, the downward motion of the E-loop stabilizes HB 138-172, which then stabilizes the active state of the C-loop. The interrelation among the three features, the bound ligand size, the motion of the E-loop, and the formation of HB 138-172, suggests that the activity of 3CL pro is maximized in large native substrates. Influences of T285A mutation and allosteric functional regulation through domain III T285A mutation closes the interface of the domain III dimer. In the above sections, we did not distinguish between SARS-CoV 3CL pro and SARS-CoV-2 3CL pro in our analysis of the crystal structure ensemble. However, since there are 12 amino acid alterations between the two 3CL pro , the influence of the mutations is discussed in this section. Ten sites among the twelve mutations are mostly located on the surface of the core part of domains I and II (solvent accessibility: $0.8); therefore, they have no substantial influence on the structure. However, two mutations, T285A and I286L, are located on the interface of the domain III dimer and have substantial effects on structure. Figure 6(A) shows the distributions of the interprotomer Ca distance between Thr(Ala)285A and Thr(Ala)285B; the distance of SARS-CoV-2 3CL pro is much shorter than that of SARS-CoV 3CL pro , as has already been observed in the structure of a triple mutant S284-T285-I286/A of SARS-CoV 3CL pro . 42 In Figure 6(C), the representative configurations of the interface of the domain III dimer are compared; the interprotomer hydrophobic contacts are formed among Ala285A (B), Ala285B(A), and Leu286B(A) in SARS-CoV-2 3CL pro , whereas Thr285 of SARS-CoV 3CL pro is distant from its counterpart. The smaller size of the alanine side-chain relative to that of threonine enables a shorter interprotomer distance. Furthermore, the hydrophobic packing of a pair of the alanine/leucine residues has greater affinity than that of a weak hydrogen bond between the two hydroxyl groups of threonine (a hydroxyl group may preferably make a hydrogen bond with water). We now focus on the details of the dynamics that occur along with the change in distance 285-285. As shown in Figure 6(C), Thr(Ala)285 is in a 16residue-long loop (residues 276-291). However, as illustrated in the Motion Tree (Figure 2(A)), the fragment involving Thr(Ala)285 (residues 280-287) is in a section of the moving cluster that constitutes the rigid core of domain III. Thus, this fragment is rigid and does not change the conformation independently of domain III, probably due to its winding shape with intraloop hydrogen bonds. Therefore, the difference in distance is not caused by the internal motion of the loop but rather by the rigid body motion of domain III. The respective distributions of PC2 (domain III dimer), a parameter describing the configuration of the domain III dimer ( Figure S1), of SARS-CoV 3CL pro and SARS-CoV-2 3CL pro are largely separated, similar to the distributions of distance 285-285 ( Figure 6(B)); domain III of SARS-CoV-2 3CL pro has more closed arrangements relative to the open arrangements of SARS-CoV 3CL pro . Indeed, the mode structure of PC2(domain III dimer) agrees well with the direction of motion in Thr(Ala)285 departing from the counterpart of the other protomer ( Figure S1). In Figure S8, a clear correlation between PC2 (domain III dimer) and distance 285-285 is also shown.
SARS-CoV-2 3CL pro has a higher activity than SARS-CoV 3CL pro : Experimental evidence. We investigated whether the difference in the configuration of the domain III dimer, observed between SARS-CoV-2 3CL pro and SARS-CoV 3CL pro (Figure 6(A)), affected the conformation of the C-loop or influenced catalytic activity. First, we analyzed the experimental data. Kinetic experimental data 24,43,44 indicate that SARS-CoV-2 3CL pro has a larger catalytic efficiency than SARS-CoV 3CL pro , although the kinetic parameters differ greatly among the three studies, 3-fold greater efficiency 43 and slightly greater efficiency. 24,44 Furthermore, it was reported that the T285A mutant of SARS-CoV 3CL pro had $1.4-fold higher activity than the wild-type and that the triple mutant S284-T285-I286/A showed $3.7-fold higher activity. 45 Given that distance 285-285 is reduced to 6.2 A in the triple mutant S284-T285-I286/A (for the seven mutant dimers), down from the 7.7-A of the average of all SARS-CoV 3CL pro (Figure 6(A)), it is reasonable to conclude that the motions of domain III influence catalytic activity.
Coupling between domain III and the C-loop: The uncharged N terminus. Based on the analysis of the crystal structure ensemble, we assessed the connection between domain III and the C-loop (Figure 7(A and C)). The probability of finding the C-loop in the active state, as well as the probability of HB 140-1 being formed, decreases with increasing distance 285-285, particularly over 8.5 A (Figure 7(A)), which clearly indicates the allosteric coupling between domain III and the Cloop. The chains with an uncharged N terminus (which is highly unlikely to form HB 140-1) accumulate in the region of distance 285-285 over 8.5 A (Figure 7(A)). This observation can be understood as a causal relationship in which the absence of HB140-1 due to the uncharged N terminus induces the opening motion of domain III dimer.
Here, we propose a possible scenario to explain these observations, using the structures illustrated in Figure 7(C) together with a variety of experimental evidence. The uncharged N terminus tends to break HB 140-1, as discussed above; thus, it destabilizes the active state in the C-loop. However, since neither Phe140 nor Ser1 directly interacts with domain III, it is necessary to identify a factor linking domain III and HB 140-1. As a possible factor, we identified an intraprotomer/ interdomain hydrogen bond between Asn214 OD1 and Gly2 N (HB 214-2). This unique interdomain interaction forms and breaks in accordance with the position of domain III or distance 285-285; the opening motion along PC2(domain III dimer) separates Asn214 from Gly2 ( Figure S1), and distance 214-2 is well correlated with PC2(domain III dimer) ( Figure S9(A)). As shown in Figure 7(A), the probability of formation of HB 214-2 decreases with increasing distance 285-285. In contrast, the other interdomain interactions occurring at the hinge region of the domain motion are stably maintained almost independently of the domain III position ( Figure S9(B)). Although these stable interactions do not operate as a switch, they have significant contributions to stabilizing the dimer structure; the mutations at the residues illustrated in Figure S9(B) (Arg4, Ser123, Ser139, Glu290, Arg298, and Gln299) impair dimerization and catalysis, 46,47 and the mutations R298A (PDB:2qcy and 3m3t) and S139A (PDB:3f9e) produce the monomeric crystal structures. The connection between HB 140-1 and HB 214-2 is explained as follows. The absence of HB 140-1 allows Ser1 to move freely and to be separated from Phe140. This conformational change accompanies the shift of Gly2 to destabilize HB 214-2 (the probability of formation of HB 214-2 decreases from the unconditional value of 0.59 (=150/254) for all chains to 0. 19  Coupling between domain III and the C-loop: The charged N terminus. The uncharged N terminus is chemically unchangeable and makes the absence of HB 140-1 independent of the other components such as the configuration of the domain III dimer and HB 214-2. Therefore, the role of HB 140-1 in functional regulation is most evidently observed in chains with an uncharged N terminus (Figure 7 (A)). Conversely, in the natively charged N terminus, the occurrence of HB 140-1 is changed 3 Figure 7. (A) Distributions along Ca distance 285-285: the probability of formation of the active C-loop conformation (thick black curve with circles); the probability of finding a chain with the uncharged N terminus due to some appended amino acids (solid brown curve with diamonds); the probability of finding a chain with HB 140-1 (broken blue curve with squares); the probability of finding a chain with HB 214-2 (dotted blue curve with triangles). The total number of chains at each interval of distance 285-285 is, in ascending order, 59, 57, 32, 78, 13, and 15. (B) As in (A), but for ligand-free chains with natively charged N-termini without an appended amino acid. The intervals of 8.5-9.5 and >9.5 are not presented because the number of chains in these intervals are not sufficient to calculate statistics. The total number of chains at each interval of distance 285-285 is, in ascending order, 18, 18, 6, and 16. (C) Two 3CL pro structures that explain the scenario for the accumulation of chains with uncharged N-termini at large distance 285-285, drawn after superposition at the core region. These structures are PDB:6lu7 (SARS-CoV-2): A chain (green) and B chain (cyan); distance 285-285 = 5.314 A, and PDB:7kfi (SARS-CoV-2) having A chain (salmon) and B chain (yellow) with transparency; distance = 9.858 A. Structures are superimposed at the core region of 6lu7A and 7kfiB. Only the key parts are illustrated. The entry 6lu7 has a natively charged N terminus, whereas 7kfi has an appended sequence at the N terminus (Gly(-2), Ala(-1), and Met0) drawn as lines. 6lu7 has both HB 140-1 and HB 214-2 (red broken lines). However reversibly under the influence of other components. However, this complicated situation occurs in the native condition and should also be investigated. Furthermore, the ligand-induced activation is another factor obscuring the influence of domain III on the C-loop because the C-loop conformation is determined by interactions with the ligand molecule. Hence, we used the ligand-free chains with charged N-termini to recalculate the quantities shown in Figure 7(A), although the number of chains was significantly reduced from 254 to 62. Figure 7(B) shows the results of the recalculation (the values for >8.5 A are not shown because the number of chains in this distance range was not sufficient to calculate statistical quantities). The numbers of chains with HB 140-1 and HB 214-2 were shown to decrease with distance 285-285; these values did not differ largely from those shown in Figure 7(A). However, the ratio of the number of chains in the active state, or chains with HB 143-28, clearly showed a monotonous decrease with distance 285-285 from 1.0 (distance < 5.5 A) to 0.69 ($7.5-8.5 A); these data contrast with those in Figure 7(A) showing values kept close to unity due to ligand-induced activation. Overall, these data clearly demonstrate that the opening motion of the domain III dimer has a destabilizing effect on the active state of the C-loop through the dissociation of HB 214-2 and HB 140-1. Based on the results presented above, we compared the numbers of formation of the hydrogen bonds for SARS-CoV 3CL pro and SARS-CoV-2 3CL pro using 62 ligand-free chains with a natively charged N terminus (Table 3). We found that the closed configuration of the domain III dimer in SARS-CoV-2 3CL pro results in a 0.54 greater probability of formation of HB 214-2 than the probability for SARS-CoV 3CL pro , as well as a 0.25 increase in the probability of formation of HB 140-1 and a 0.14 increase in the probability of occurrence of the active state. Although the influence is reduced by half for each interaction step connecting the four structural elements (the domain III dimer, HB 214-2, HB140-1, and Cloop), our analyses suggest that SARS-CoV-2 3CL pro has slightly increased activity over that of SARS-CoV 3CL pro , which is largely consistent with the experimental data. 24,43,44

Conclusion
The crystal structure ensemble, consisting of 258 independent chains, successfully describes the structural dynamics of SARS-CoV 3CL pro and SARS-CoV-2 3CL pro as well as elucidates the allosteric regulation of catalytic function. The structural dynamics is characterized by the motion of the four loops (the C-loop, E-loop, H-loop, and Linker) and domain III on the rigid core. Among the four loops, the C-loop causes the order (active)disorder (collapsed) transition, which is regulated cooperatively by the five hydrogen bonds with the surrounding residues. Three of the loops, the Cloop, E-loop, and Linker, constitute the major ligand binding sites with a limited variety of binding residues including the subsites. Ligand recognition at the main-chain NH groups of Gly143 and Cys145 induces the formation of an oxyanion holelike structure to produce the active conformation of the C-loop (i.e., ligand-induced activation). Ligand binding also causes the ligand size dependent conformational changes to the E-loop and Linker, which further stabilize the C-loop through HB 138-172. Mutation T285A from SARS-CoV 3CL pro to SARS-CoV-2 3CL pro significantly closes the interface of the domain III dimer and affects the stability of the C-loop conformation allosterically via HB140-1 and HB 214-2. Because of this allosteric regulation, the closed arrangement of the domain III dimer in SARS-CoV-2 3CL pro increases the stability of the active state of the C-loop and yields a slightly higher activity than that of SARS-CoV 3CL pro .
As a reference to the present results, the crystal structures of MERS-CoV 3CL pro , 3CL pro of Middle East respiratory syndrome-related coronavirus (2012) belonging to the same protease family (C30), were analyzed in a similar manner as in SARS-CoV and SARS-CoV-2 3CL pro . It was found that the structural properties were basically the same as those of SARS-CoV and SARS-CoV-2 3CL pro . The details are summarized in Supporting Text 4.

Material and Methods
Analysis of the crystal structure ensemble As a scheme for the analysis of the crystal structure ensemble, the identification of the overall dynamic structure should precede the PCA, and then the PCA is applied separately to various moving parts of the protein. This scheme is to avoid the application of the PCA to the whole protein molecule. It is because the PCA tends to produce a mode structure in which the mode vector has non-zero elements at all atoms considered in the analysis, representing correlation in motion extending to the whole molecule, due to the orthonormal condition in the eigenvalue problem. Therefore, it is difficult to  (25)  28 The probability was calculated as the number of chains having the HB (the number in the parenthesis) divided by the total number of chains listed in the column "All". The column of "active" is those having HB 143-28 as in the above definition. describe a localized motion by the PCA of the whole protein molecule. Motion Tree using the variance of the residue distances enables us to identify the moving clusters of any size from a domain level to a residue level without any prior knowledge (see below and Figure 2(A)). Each of the moving clusters thus found is then separately subjected to the PCA (Figure 2(C) and Figure S1). Here, it is important that the PCA does not exclude the translation and rotation motions of the moving clusters; the motion should be defined as a relative motion against the core region of the protein via superimposition onto the core region.

Data used in this study
The crystal structure ensemble was constructed for 3CL pro of SARS-CoV-2 and SARS-CoV based on the PDB data of the version of 10/25/2020. The following entries were not used in the analysis: the entries from the PanDDA analysis (115 entries), the entry with domain swapping (PDB:3iwm), and the entries containing only domain III (PDB: 2k7x, 2liz, and 3ebn). The compiled data contains 83 entries/113 independent chains for SARS-CoV-2 3CL pro and 101 entries/145 independent chains for SARS-CoV 3CL pro . These are listed in Supporting data S1 and S2 for the ligand-bound and ligandfree entries, respectively. The data after 10/25/2020 until 7/25/2021 were summarized in Supporting data S4-S6, which correspond to Supporting data S1-S3, respectively (SARS-CoV-2 3CL pro : 154 entries and 226 chains; SARS-CoV 3CL pro : 5 entries and 6 chains). The analyses of Supporting data S4-S6 were summarized in Figure S10.

Motion Tree
We developed a method to define the building blocks moving cooperatively, which we achieved through hierarchical clustering of interresidue distances (for pairwise comparisons) or their variances (for the comparison of many entries) and subsequent construction of a dendrogram, namely the Motion Tree. 27,28 The Motion Tree illustrates, in a hierarchical manner, a pair of clusters at each node that moves reciprocally with the amplitude of the tree height of the node named "Motion Tree (MT) score." Because of the straightforward application to the structure ensemble without the need for a structural superposition procedure, a comprehensive understanding of the structural dynamics of various protein molecules can be achieved. 17,[48][49][50] We compared 258 chains in a Motion Tree using the variance-based scheme. The variance of distance fluctuation, {D mn }, used as a metric for hierarchical clustering, is calculated as D mn = <Dd 2 mn > 1/2 , where d mn is the distance between Ca atoms of residues m and n, Dd mn is the associated deviation from the mean distance, and <. . .> is the average over the structural ensemble. We did not include highly mobile Ser1 and Gly2, as well as C-terminal residues 301-306, in the analysis because these residues are often in the list of missing residues. Since 3CL pro is in the homodimeric form, D mn and the resulting clusters have to be symmetrical upon the exchange of the two protomers. However, asymmetric dimers exist in the crystal structure ensemble; they are considered to be under the influence of crystal packing or in different states of ligand binding. For the purpose of removing these influences, D mn was symmetrized using the duplicated structures of AB and BA, where AB is the original dimer and BA is the dimer with the protomers exchanged. Because of the symmetrization, two equivalent clusters corresponding to each protomer exist in the Motion Tree. This symmetrizing operation was also applied to the calculation of the PCs for the domain III dimers.