Allosteric Hotspots in the Main Protease of SARS-CoV-2

Graphical abstract


Introduction
The global pandemic of COVID-19 (coronavirus disease 2019) is caused by SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2), 1-4 a member of the coronavirus family of enveloped, single-stranded ribonucleic acid (RNA) viruses that also includes the virus responsible for the severe acute respiratory syndrome (SARS) epidemic of 2003. 5 Since coronaviruses have been known to infect various animal species and share phylogenetic similarity to pathogenic human coronaviruses, the potential of health emergency events had already been noted. 6 However, their high mutation rate, similarly to other RNA viruses 7 and particularly of the SARS-CoV-2 Spike protein, 8 made the development of long lasting drugs challenging. Developing therapeutics against coronaviruses is of renewed interest due to the ongoing global health emergency.
One of the main approaches for targeting coronaviruses is to inhibit the enzymatic activity of their replication machinery. The main protease (M pro ), also known as 3C-like protease (3CL pro ), is the best characterised drug target owing to its crucial role in viral replication. [9][10][11] The M pro is only functional as a homodimer and the central part of the active (or orthosteric) site is composed of a cysteine-histidine catalytic dyad 12 (see Figure 1 (B)) which is responsible for processing the polyproteins translated from the viral RNA. 13 The M pro of SARS-CoV-2 is very similar to that of SARS-CoV: they share 96 % sequence similarity, and exhibit high structural similarity (the root mean square deviation between Ca positions is only 0.53 A). 12 Indeed, many of the residues that are important for catalytic activity, substrate binding and dimerisation are conserved between both proteases. 14 However, mutations in SARS-CoV-2 M pro (for a full list see Table S1) would indeed modulate distant regions via allosteric signalling. Notably, two of those mutations at the dimer interface (Thr285Ala and Ile286Leu, see Figure 1) have been the object of particular interest: on one hand, it has been suggested that they could be responsible for closer dimer packing in SARS-CoV-2; 12 on the other hand, previous mutational studies on those positions have revealed an impact on catalytic activity in SARS-CoV M pro . 15 Currently, the development of inhibitors for the M pro of SARS-CoV-2 12,[16][17][18] focuses on blocking the active sites to disrupt viral replication, 19 similarly to the strategy followed for the design of other inhibitors for coronavirus proteases. [20][21][22] Targeting the active site enables high affinity of the drug molecules, but can also result in offtarget toxicity through binding to proteins with similar active sites. 23,24 Drug resistance is another major concern, especially when the active site may potentially change owing to mutations. Targeting an allosteric site distal from the main binding site provides an alternative strategy by increasing the range and selectivity of drugs that can fine-tune protein activity, yet circumventing some of the aforementioned disadvantages. For reviews and recent successes of allosteric drug design, see Wenthur et al. and Cimermancic et al. 25,26 This approach also holds additional potential for drug repurposing targeting binding at allosteric sites to reduce time and costs for bringing new drugs to the market, 27 an important consideration in the time-sensitive setting of COVID-19. 28 Encouragingly, following the very recent surge of research around the M pro of SARS-CoV-2, some experimental studies have found drugs and small fragments that bind to sites other than the substrate binding site on this protein, and that might have implications for allosteric regulation. 29,30 Preliminary studies have also simu-lated binding events to distant areas of the protein by using docking and molecular dynamics (MD), [31][32][33] or elastic network models (ENMs). 39,40 Moreover, there have already been indications of allosteric processes mediated by the extra domain in the protease of the old SARS-CoV. [41][42][43]15 We focus here on the allostericity of the SARS-CoV-2 main protease, and specifically whether there are potential allosteric sites strongly connected to the active site that may offer alternative ways to inhibit virus reproduction. The identification of allosteric sites in enzymes remains challenging and is still often done serendipitously. Computational prediction of allosteric sites has become an active field of research for drug design (for reviews see 44,45 ) as it promises to help reduce the laborious and time-consuming process of compound screening. Thermodynamic models, notably the classic Monod-Wyman-Changeux 46 and Koshland-Né methy-Filmer 47 models, offer insights into the conformational changes of protein structure upon ligand binding. However, they do not explain the underpinning molecular signal transmission within the protein, which has been argued as a key structural component of allostery, 48 nor the idea of allosteric signals being propagated over bond paths within the structure. 49 Other studies have used MD simulations, which model the dynamics of proteins at the atomic level, to detect communication pathways in the protein structure that can be exploited for allosteric residue and site identification. 50,51 To alleviate the substantial computational resources required by MD simulations, as well as their inability to explore all the required time and length scales, variations of normal mode analysis (NMA) or ENM are widely employed and have achieved moderate accuracy in allosteric site detection when tested on known allosteric proteins. 52-55 Figure 1. Overview of the SARS-CoV-2 main protease dimer. Atomic coordinates are obtained from the PDB file (PDB ID: 6Y2E). (A) shows the full dimer with the active site residues and residues 285/286 on both monomers shown as spheres. The second monomer is shown with increased transparency to visualise where the monomers interact. Colours are according to domain: Domain I residues 10 to 99 -dark green, domain II residues 100 to 182 -dark blue, domain III residues 198 to 303 -orange, loops in light green. (B) Zoom-in of the active site with histidine 41 and cysteine 145 forming a catalytic dyad which is extended to a triad by a water molecule in close proximity.
The toolbox for allosteric site prediction is continuously growing, and new methods range from statistical mechanical models 56,57 to methods based on graph theory. 58 However, most of them overcome the computational requirements of MD at the cost of resolution by looking at coarse-grained representations of protein structures. 59 To overcome some of these limitations, we have recently introduced a suite of methods for the analysis of high-resolution atomistic protein graphs derived from structural data. The methods are computationally efficient and can span across scales in an unsupervised manner. The graphs have atoms as nodes and retain key physicochemical detail through energy-weighted edges obtained from structural information and interatomic potentials of covalent and weak interactions (hydrogen bonds, electrostatics and hydrophobics) which are known to be important in allosteric signalling. [60][61][62] Here we apply two techniques that take full advantage of this detailed atomistic graph: bond-to-bond propensity (B2Bprop) 63 and Markov Transients (MT). 61 Both methods share a common foundation, namely, a Markov process sourced at atoms or bonds of interest diffusing on the atomistic protein graph is used to explore the structure and reveal important protein regions regarding signal propagation. Yet, each method evaluates different and complementary properties.
B2B-prop quantifies how fluctuations at a given set of bonds (the 'source') get redistributed to any other bond in the protein graph and provides a measure (with statistical significance) of instantaneous connectivity as mediated by the graph structure at stationarity. Unlike most network approaches, B2B-prop is formulated on the edges of the graph and thus makes a direct link between energy and flow through bonds, i.e., physico-chemical interactions. 63 B2B-prop is capable of successfully predicting allosteric sites in a wide range of proteins without any a priori knowledge, other than the active site. 63 Of particular relevance to the obligate homodimeric protease studied here, B2B-prop has been subsequently used to show how allostery and cooperativity are intertwined in multimeric enzymes such as the wellstudied aspartate carbamoyltransferase (ATCase). 64 B2B-prop has been benchmarked against two extensive allosteric protein databases, 65,66 and shown to outperform methods that use simple distance cutoff for interactions or coarse-grained descriptions. 52,54,55 MT provides additional information by shedding light on the catalytic aspects of allostery. MT extract pathways implicated in allosteric regulation by analysing the dynamical transients of propagation from the active site as the diffusion progresses on the atomistic graph. 61 Specifically, MT compute a statistical measure that highlight atoms and residues that are reached significantly fast by fluctuations propagating from the source. Crucially, MT analysis takes into account all possible pathways, not just the shortest or optimal paths -an important feature since allosteric communication is known to involve multiple paths across the protein. 67 MT analysis has been successful in identifying allosteric paths in caspase-1, 61 as well as previously unknown allosteric inhibitor binding sites in p90 ribosomal s6 kinase 4 (RSK4) which complemented and helped guide experimental and clinical studies in drug repurposing for lung cancer. 68 Both B2B-prop and MT share a common foundation (namely, the use of diffusive processes on the atomistic protein structure), yet they evaluate different and complementary properties: B2B-prop finds bonds and residues that accumulate a disproportionate amount of fluctuations injected at the 'source' at stationarity, whereas MT highlights atoms and residues that are reached particularly fast by the propagation of fluctuations from the source. Mathematically, bond-to-bond propensity reveals properties of the stationary distribution of fluctuations, whereas Markov Transients reflects properties of pathways for signal propagation by concentrating on the transient approach to stationarity. Therefore, each method reveals different aspects of the underlying allosteric mechanisms: B2B-prop analysis gives insights into the effects of structural connectivity, whereas MT analysis is better suited to capture the time scales and catalytic effects of the enzyme. As M pro is a catalytic protein, we expected both methods to prove valuable in revealing regions of the protein (hotspots) that can affect different aspects of allosteric regulation (i.e., sites and pathways). In both approaches we employ an energy-weighted atomistic protein graph and their low computational demands make them suitable to be applied to large proteins and complexes while retaining physico-chemical and atomistic detail. Both methods have been recently built into a web server 65 that creates atomistic graphs from PDB structures and analyses them using B2B-prop and MT, thus facilitating their application to user-provided biomolecular structures.
In this paper, we apply these two methodologies in the setting of COVID-19. We analysed the SARS-CoV-2 main protease and obtained bondto-bond propensities for all bonds as well as Markov transient half-times t 1=2 for all atoms in the protein. Our results shed light on the allosteric communication patterns in the M pro dimer, highlighting the role of the dimer interface. We use our methods to show how the subtle structural changes between SARS-CoV and SARS-CoV-2 affect the dimer properties.
By applying a rigorous scoring procedure, we identify four statistically significant hotspots on the protein that are strongly connected to the active site and propose that they hold potential for allosteric regulation of the main protease. Aligning our results to hits from the Diamond Light Source XChem fragment screen, 69 we find molecules that could be a first starting point for allosteric drug design. The inhibitory effect of some of these molecules has been proven by mass spectrometry based assays. 29 By providing guidance for allosteric drug design we hope to help drug targeting efforts to combat COVID-19.

Results
Exploiting bond-to-bond propensity to provide insights into the M pro dimer at atomistic resolution We analysed the resolved apo structure of the main protease M pro of SARS-CoV-2 (PDB ID: 6Y2E). 12 Although we concentrate our exposition on 6Y2E, we have also repeated our analysis for another structure (PDB ID: 7JP1) with the same results, see SI Table S7, Figure S1 and S2.
In its active form, the protease forms a homodimer, and each monomer has three domains ( Figure 1(A)). The active site in each monomer forms a catalytic dyad, which is expanded to a triad by the presence of a water molecule 12 (Figure 1(B)).
Our analysis starts with the PDB file, from which we construct an atomistic graph that includes both strong (covalent) bonds and weak interactions (hydrogen bonds, electrostatic and hydrophobic interactions) as well as structural water molecules, which are known to be catalytically important (see Methods and Figure 5).
We then employ B2B-prop and MT to characterise the propagation of perturbations emanating from the source residues across the atomistic protein graph. To quantify these effects, we use quantile regression to score all bonds and atoms, and consequently all residues. This allows us to identify statistically significant 'hotspots', i.e., regions of the protein that are affected more strongly (using B2B-prop) or reached more quickly (using MT) by perturbations emanating from the source (see Methods).
We use these techniques in two ways: firstly, in a forward step, we source perturbations at the two active sites of the dimer and identify hotspots in the rest of the protein, which we mark as putative allosteric sites; secondly, in a reverse step, we source perturbations at the obtained hotspots and analyse the pattern of propagation back to the active sites and to other regions of interest in the dimer (e.g., the dimer interface).
We start with an exploration of the structure of the M pro of SARS-CoV-2 using bond-to-bond propensity analysis. Figure 2 shows the forward step of B2B-prop using the active sites in the homodimer (specifically, the catalytically active residues histidine 41 and cysteine 145 in both monomers) as sources. The propensity of every residue in the protein is regressed against their distance to the active site using quantile regression ( Figure 2(C)), and the resulting quantile score (QS) of every residue is shown on the protein structure using a colour map ( Figure 2(A-B)). Quantile regression allows us to rank all residues in the protein. The list of residues with QSs above 95 % (a total of 40 residues) is given in Table S2.
This initial B2B-prop analysis reveals two main areas of interest: a hot region at the back of the monomer opposite to the active site (see Figure 2 (A)), which is a main focus of our study in the sections below; and a hot region overlapping with part of the dimer interface (see Figure 2(B)).
Protease dimerisation is under influence of mutated residues. Given that M pro is an obligate dimer, the detection of a hot region at the dimer interface points at the importance of cooperative effects 64 and we first explore further some of its features.
The hot region at the dimer interface contains four residues that form salt bridges between the two monomers (serine 1 and arginine 4 from one monomer connect to histidine 172 and glutamine 290 from the other monomer). These bonds have been found experimentally to be essential for dimer formation 70,42 in the original SARS-CoV, an obligate homodimer.
Furthermore, a comparison of the M pro of SARS-CoV-2 and SARS-CoV (see Table S1) shows that there are two mutated residues at the dimer interface: threonine 285 and isoleucine 286 (SARS-CoV protease) mutated to alanine 285 and leucine 286 (SARS-CoV-2 protease). These two residues have been shown to lead to closer dimer packing in the M pro of both SARS-CoV and SARS-CoV-2. 15,12 To further clarify the interactions between the dimer halves ( Figure 1(A)), and how the dimer connectivity is changed in the SARS-CoV-2 protease, we carried out a comparative analysis of the apo structures of the M pro of SARS-CoV-2 (PDB ID: 6Y2E 12 ) and SARS-CoV (PDB ID: 2DUC 71 ). Specifically, we ran bond-to-bond propensity analysis sourced from the two mutated residues (285 and 286) in both structures. Table 1 shows the top 20 residues by B2B-prop QS (sourced from 285/286) for both structures. We find that 16 out of the top 20 residues are at the dimer interface for 6Y2E compared to 8/20 for 2DUC. Hence, although we find strong connectivity towards dimer interface residues in both structures, there is an increased connectivity in the SARS-CoV-2 protease, including to important residues such as serine 1 and arginine 4. This can be attributed to a closer dimer packing due to the two smaller side chains of 285/286 in the new protease. 12 A mutational study showed experimentally that closer dimer packing led to increased activity in SARS-CoV, 15

Identification and scoring of putative allosteric sites
Predicted sites using bond-to-bond propensity. Based on our B2B-prop analysis we detected two main hot regions on the protease, each of which contains a 'hotspot' or site with contiguous high-scoring residues (see Table S2) which could be targetable for allosteric regulation of the protease ( Figure 3): Site 1 (Figure 3(A) framed in yellow) is located at the back of the monomer with respect to the active site and is formed by 9 residues from domain I and II (full list in Table S3).
Site 2 is located at the dimer interface and contains 6 residues (listed in Table S3) which are located on both monomers (Figure 3(B) framed in pink). Two of these residues, glutamine 290 and arginine 4 of the respective second monomer form a salt bridge which is essential for dimerisation. 42 Sites 1 and 2 have a high average residue quantile score of 0.97 and 0.96, respectively, which is much higher than random as quantified by a statistical comparison to 1000 random sites with 10,000 bootstrap resamples (see Methods) which gives a QS of 0.53 (95 % CI: 0.53-0.54) for a random site of the size of Site 1, and a QS of 0.52 (95 % CI: 0.51-0.53) for a site of the size of site 2.
Next, we investigated the connectivity of the putative allosteric sites using B2B-prop in its reverse step, i.e., we carry out a full B2B-prop analysis using as source all the residues within each of the identified sites and rank all residues using quantile regression. The average residue quantile score of the active site for the reverse B2B-prop sourced at site 1 is 0.64, which is above a randomly sampled site score of 0.47 (95 % CI:0.47-0.48) obtained by our statistical bootstrap. Therefore there is a significant bi-directional coupling between the active site and site 1. However, the same score for site 2 is only 0.49, which is only marginally above a randomly sampled site score of 0.48 (95 % CI:0.47-0.48). As site 2 is located at the dimer interface, this finding is in line with the above described suggestion that the allosteric effect is not conferred by direct signalling from the dimer interface towards the catalytic centre but rather indirectly through strengthening of cooperative effects. Nonetheless, site 2 might provide scope for inhibiting the M pro by disrupting dimer formation, and could help elucidate the link between domain III and the catalytic activity of the M pro .   been effective in predicting relevant sites in catalytic enzymes such as caspase-1 61 and RSK4. 68 We performed a full MT study in the forward step (i.e., source at the active site) including quantile regression of Markov half-times against distance to find residues that are reached significantly faster than expected by perturbations emanating from the active site (see Methods and Figure 4). The QSs of all residues are shown in Figure 4(A) (as a colourmap), and the top-scoring residues (30 in total with QS > 0.95) are listed in Table S2.
This MT analysis led us to the detection of two more putative sites in the SARS-CoV-2 M pro , which are located at the back of the monomer relative to the active site, as shown in Figure 4(C): Site 3 (Figure 4(C), framed in turquoise) is located solely in domain II and consists of 10 residues, as listed in Table S4. One of the residues is a cysteine at position 156 which might provide a suitable anchor point for covalent drug design.
Site 4 (Figure 4(C), framed in orange) has 11 residues (list in Table S4)  Following the same thought process as for sites 1 and 2, we investigated the reverse step by sourcing our computations from the residues in each of sites 3 and 4 and scoring the active site to measure the impact of the putative sites on the catalytic centre.
With source at site 3, the active site has an average residue MT quantile score of 0.66, well above a random site score of 0.53 (95 % CI: 0.52-0.53) thus indicating a significant reciprocal link between site 3 and the active site. For site 4 (as for site 2), on the other hand, the average residue MT quantile score of the active site is 0.52, similar to a randomly sampled site score of 0.50 (95 % CI: 0.50-0.51), hence we do not detect a significant connectivity from this site back to the active site. Judging from previous experience in multimeric proteins 64 this might be due to another structural or dynamic factor not yet uncovered between site 4 and the active site.
Summary of predicted sites and evaluation of small fragment binding. We summarise the information of the predicted sites in several tables. The list of residues is presented in Tables 1, which also include the solvent accessible surface area (SASA) as a coarse indicator of targetability. Table 2 presents the average residue quantile scores for the four sites obtained with the forward step of either B2B-prop or MT compared against the statistical bootstrap. The average residue quantile scores of the active site for the reverse step (i.e., source at the identified sites) is presented in Table S5 with the corresponding statistical bootstrap values.
To provide a first indication of the druggability of the identified sites, we used the experimental results from the Diamond Light Source XChem fragment screen. 69 This screen identified 25 small fragments that bind outside of the active site of and 15 of these bind within 4 A of at least one of the four putative allosteric sites predicted here.
Due to the computational efficiency of our methodologies, we were able to conduct a full analysis of all 15 structures using both our methods. Specifically, we used B2B-prop and MT analyses using the small fragments as sources and scored the connectivity to the active site. The results are presented in Table S6. We found that several fragments have high connectivity to the active site according to one or both of our methods. The fragment deposited with PDB dentifier5RE8 might be of particular interest as it has the highest connectivity to the active site taking both methods together. Moreover, one fragment (PDB ID: 5RFA), which is located at the dimer interface, has been found experimentally to act as a destabiliser of dimerisation and an inhibitor of M pro . 29 Fragment 5RFA is only 5.8 A away from site 2 and overlaps spatially with another fragment (PDB ID: 5RGQ) that is less than 4 A away from site 2.
Taken together, this indicates that disrupting dimerisation provides scope for inhibition of the M pro . Furthermore, in the same experimental study, 29 one of the fragments (PDB ID: 5RGJ) binds within 4 A of site 1 and has been shown to inhibit the proteolytic activity of the M pro . Fragment 5RGJ has a relatively high connectivity to the active site (Table S6).
Since the posting of our original preprint, potential allosteric pockets have been published by other research groups. Several have identified binding pockets at the dimer interface, which partially overlap or are in close proximity to our site 2. 31,32,36,38 This is in line with our conclusion above that disrupting dimerisation could inhibit M pro . Shi et al. 42,43 found regulation of SARS-CoV M pro by its extra domain, and cryptic sites involving the extra domain have been proposed for SARS-CoV-2 M pro . 31,[34][35][36]39 In this regard, B2B-prop highlights five residues, Asp229, Phe230, Tyr239, Lys269 and Leu272 (all with QS > 0.95), which are within the extra domain and align with pockets proposed by others. On the other hand, our sites 1, 3 and 4 all lie within domain I, and only Komatsu et al. 34 have suggested a putative site in domain I, although on the opposite side of the sites 1,3,4 found here. Therefore, our finding of sites 1, 3 and 4 suggests that more attention would be required to examine the potential of domain I to modulate M pro activity.

Discussion
During the global pandemic of COVID-19 that started in January 2020, we have seen an increase of research to develop new drugs against the disease-causing virus SARS-CoV-2. A wide range of approaches from chemistry, structural biology and computational modelling have been used to identify potential protease inhibitors. However, most of these initiatives focus on investigating the active site as a drug target, 12,17 high-throughput docking approaches to the active site, 16 or re-purposing approved drugs 72 and protease inhibitors 73 which bind at the active site.
To increase the targetable space of the SARS-CoV-2 main protease and allow a broader approach to inhibitor discovery, we have provided a full computational analysis of the protease structure which gives insights into allosteric signalling and identifies potential putative target sites. Our methodologies are based on concepts from graph theory and the propagation of perturbations on a protein graph. We have previously demonstrated the applications of B2Bprop and MT analyses for the identification of allosteric sites and communication pathways in a range of biological settings 61,63,64,68 and have been benchmarked on extensive databases. 65,66 Applying B2B-prop to the SARS-CoV-2 M pro revealed a hot region of connectivity at the dimer interface. Since dimerisation is known to be essential for the Table 2 Scoring of the 4 identified putative allosteric sites (source at the active site). Included is a statistical score computed from 1,000 randomly sampled sites (with 10,000 bootstrap resamples) to obtain the 95 % confidence interval (CI). proteolytic activity of the SARS-CoV M pro , 15 we carried out a comparative analysis with the SARS-CoV-2 protease. Important for dimer packing and mutated in SARS-CoV-2 are residues 285 and 286. 12 When sourced from these residues, we find a higher proportion of dimer interface residues within the top 20 scoring residues for SARS-CoV-2, confirming a stronger dimer connectivity as described in literature. 12 Therefore targeting sites at the dimer interface might provide scope for inhibitor development. 74 It is also worth remarking that, beyond the study of the dimer interface, we see a similar overall pattern of hot and cold regions in the SARS-CoV M pro . In particular, we find high overlap for the four identified sites ( Figure S3) which gives us confidence that a potential drug effort on allosteric sites would find applications both in COVID-19 as well as SARS.
Using our approaches we identified four allosteric binding sites on the protease: Sites 1 and 2 were identified using B2B-prop (forward step) and hence have a strong connectivity to the active site at stationarity. Using our reverse step from both sites, we found that site 1 displays reciprocal connectivity to the active site, whereas site 2 (which sits at the dimer interface) is indirectly connected to the active site.
This suggests that site 1 might be a functional site and any perturbation at site 1 would induce a structural change of the protease thereby impacting the active site directly. Our methods measure connectivity through perturbations. Yet proteins can be both inhibited or activated over allosteric mechanisms, and we can not tell from our results alone whether a perturbation would lead to an up-or down-regulation of activity. However, a fragment near site 1 has been shown experimentally to exhibit some inhibitory effect on the M pro by El-baba et al. 29 Notably, site 2, although not directly coupled to the active site as a functional site, is located at the dimer interface (Figure 3(B)) and provides a deep pocket for targeting and potentially disrupting dimer formation. Targeting site 2 could thus result in a conformational change of the protease and inhibition of dimerisation. 29 We identified two further sites using MT. These sites are reached the fastest by a signal sourced from the active site and are both located at the back of each monomer relative to the active site. Using the reverse step, Site 3 is found to display reciprocally fast propagation back to the active site; hence perturbations at site 3 would thus potentially affect the catalytic activity of M pro . Site 3 (Figure 4(C)) contains a cysteine residue (Cys156) which provides an anchor point for covalently binding inhibitors. 75 Similar to site 2, site 4 does not display a reciprocally fast connection to the active site; hence actions exerted at site 4 could affect other parts of the protein which in turn could lead to an altered activity of M pro .
We also include the analysis of structures containing small fragments bound to the structure of SARS-CoV-2 M pro from the Diamond Light Source XChem experimental fragment screen. 69 We analysed 15 PDB structures with fragments that bind in proximity (less than 4 A to the putative sites, and we scored the active site using the fragments as the source. We find that several fragments display strong connectivity to the the active site, as measured by B2B-prop and MT, and some have been investigated in experimental studies 29 where their binding leads to a decrease in protease activity. Moreover, the X-ray screening study by Günther et al. 30 identified an allosteric site which overlaps with our allosteric hotspot 3 around residues 151 to 153. Taken together, these results show the promise of our approach, which might provide a starting point for rational drug design. Our methods provide in depth insights into the global connectivity and have been used to propose putative allosteric sites on the main protease which could be combined with drug repurposing for approved and investigational drugs to target potential allosteric sites. 27,68 Although B2B-prop and MT are based on static structures, they can be readily applied to ensembles of structures obtained from solution nuclear magnetic resonance (NMR) or MD. 63 We hope our methods can help broaden the space of druggable targets in proteins, and aid in the development of effective medications for COVID-19 that can interfere with the main protease of SARS-CoV-2. Additionally, the high mutation rate of SARS-CoV-2 adds a further motivation to study the effect exerted by residue mutations on allostery. The computational efficiency of our methods allows further indepth exploration of mutational analyses, 61,76,77 a direction that we leave for further research.

Methods
Protein Structures. We analysed the X-ray crystal structures of the apo conformations of the SARS-CoV-2 (PDB ID: 6Y2E 12 ) and the SARS-CoV (PDB ID: 2DUC 71 ) main proteases (M pro ). The dimeric structure of the SARS-CoV-2 with PDB ID 6Y2E was constructed using the MakeMultimer.py webserver based on the BIOMT records contained in PDB files. All residues of the M pro proteins that are mutated between the two viruses are listed in Table S1. Both structures contained a water molecule in proximity to the catalytic dyad formed by histidine 41 and cysteine 145. These water molecules were kept while all other solvent molecules were removed. Atom and residue, secondary structural names and numberings are in accordance with the original PDB files. The dimer interface was investigated using the online tool PDBePISA 78 (for a full list of the resulting dimer interface residues see https://doi.org/10.6084/m9.figshare. 12815903). We have checked for the robustness of our results by analysing a second published structure of the SARS-Cov-2 M pro with PDB ID 7JP1. The results of B2B-prop and MT analyses for both wild type structures are highly coincident as shown in Figure S1. No major differences in the spatial patterns of hot and cold regions were observed, especially in relation to the identified hotspots as seen in Figure S2 and Table S7.
Atomistic Graph Construction. Instead of the coarse-grained descriptions typical of most network methods for protein analysis, we use protein data bank (PDB) 79 structure files to derive fully atomistic protein graphs from the threedimensional protein structures. In our graph, the nodes are atoms and the weighted edges represent interactions, both covalent bonds and weak interactions, including hydrophobic, hydrogen bonds and salt bridges (See Figure 5). The edges are weighted with physico-chemical energies obtained from well studied potentials, as discussed below. Details of earlier versions of this approach can be found in Refs. 60,61,63 We summarise briefly the main features below and we note three further improvements in the current version: (i) the stand-alone detection of edges without need of third-party software; (ii) the many-body detection of hydrophobic edges across scales; and (iii) the improved computational efficiency of the code. For further details of the updated atomistic graph construction used in this work see. 65,62 Figure 5 gives an overview of the workflow. We start from atomistic cartesian coordinates of a PDB file. Since X-ray structures do not include hydrogen atoms and NMR structures may not report all of them, we use the software Reduce 80 to add any missing hydrogen atoms. Hydrophobic interactions and hydrogen bonds are identified with a cutoff of 9 A and 0.01 kcal/mol respectively. In addition, hydrogen bonds are also identified based on the angles related to the hybridisation of the donor -acceptor atoms. The edges are weighted by their energies: covalent bond energies from their bond-dissociation energies; 81 hydrogen bonds and salt bridges by the modified Mayo potential; 82,83 hydrophobic interactions by using a hydrophobic potential of mean force. 84 Bond-to-bond propensity.
Bond-to-bond propensity (B2B-prop) analysis was first introduced in Ref. 63 and further discussed in Ref. 64 , hence we only briefly summarise it here. This edge-space measure evaluates the redistribution of perturbations introduced at the source towards every bond in the protein graph at stationarity. The edge-to-edge transfer matrix M was introduced to study non-local edge-coupling and flow redistribution in graphs 85 and an alternative interpretation of M as a Green's function is employed to analyse the atomistic protein graph. The element M ij describes the effect that a perturbation at edge i has on edge j. M is given by Here B is the n Â m incidence matrix for the atomistic protein graph with n nodes and m edges; W = diag(w ij ) is an m Â m diagonal matrix where w ij is the weight of the edge connecting nodes i and j, i.e. the bond energy between those atoms; and L y is the pseudo-inverse of the weighted graph Laplacian matrix L 86 and defines the diffusion dynamics on the energy-weighted graph. 87 To evaluate the effect of perturbations from a group of bonds b 0 (i.e., the source), on bond b of other parts of the protein, we define the bond propensity as: and then calculate the residue propensity of a residue R: Figure 5. Atomistic Graph Construction. We showcase the general procedure here on the main protease of SARS-Cov-2: Atomic coordinates are obtained from the PDB (ID: 6Y2E 12 ) and hydrogens are added by Reduce. 80 Edges are identified and the weights are assigned, as described in the methods section, by taking into account covalent bonds as well as weak interactions: hydrogen bonds, electrostatic interactions and the hydrophobic effect which are colour.ed as indicated. Markov Transients. A complementary, nodebased method, Markov Transients (MT) identifies areas of the protein that are significantly connected to the source, usually a site of interest such as the active site, by computing the speed of signal propagation at the atomistic level. The method was introduced and discussed in detail in Ref. 61 and has successfully identified allosteric hotspots and pathways without a priori knowledge. 61,68 Importantly, it captures all paths that connect the two sites. The contribution of each atom in the communication pathway between the active site and all other sites in a protein or protein complex is measured by the characteristic transient time t 1=2 (or t half ): where t ðiÞ 1=2 is the number of time steps in which the probability of a random walker to be at node i reaches half its stationary value. This provides a measure of the speed by which perturbations originating from the active site diffuse into the rest of the protein by a random walk on the above described atomistic protein graph. To obtain the transient time t 1=2 for each residue, we take the average t 1=2 over all atoms of the respective residue.
Quantile Regression. To determine the significant bonds with high bond-to-bond propensity and atoms with fast transient times t 1=2 at the same geometric distance from the source, we use conditional quantile regression (QR), 88 a robust statistical measure widely used in different areas of science. 89 In contrast to standard least squares regressions, QR provides models for conditional quantile functions which are obtained by solving an optimisation problem (a linear program). This is significant here because it allows us to identify not the "average" atom or bond but those that are outliers from all those found at the same distance from the active site and because we are looking at the tails of highly non-normal distributions.
As the distribution of propensities over distance follows an exponential decay, we use a linear function of the logarithm of propensities when performing QR for B2B-prop, whereas in the case of the t 1=2 of MT, which do not follow a particular parametric dependence on distance, we use cubic splines to retain flexibility. From the estimated quantile regression functions, we then compute the quantile score (QS) for each atom or bond. To obtain residue QSs, we use the minimum distance between each atom of a residue and those of the source. Further details of this approach for B2Bprop analysis can be found in Ref. 63 and for MT analysis in Ref. 76 .
Site scoring with structural bootstrap sampling. To assess the statistical significance of a site of interest, we score the site against 1000 randomly sampled sites of the same size. For this purpose, the average residue QS of the site of interest is calculated. After sampling 1000 random sites on the protein, the average residue QSs of these sites are calculated. By performing a bootstrap with 10,000 resamples with replacement on the random sites average residue QSs, we are able to provide a 95 % confidence interval to assess the statistical significance of the site of interest score in relation to the random site score.
Residues used when scoring the active site (reverse step). To identify the residues that are used as the active site to be scored in the reverse step of both B2B-prop and MT analyses, we proceed as follows. First, we found all structures with non-covalent hits bound in the active site from the XChem fragment screen against the SARS-CoV-2 M pro . 69 There were 22 such structures. These structures were further investigated using PyMOL (v.2.3) 90 to identify residues that have atoms within 4 A of any of the bound fragments. These residues are Thr25, Thr26, His41, Cys44, Thr45, Ser46, Met49, Tyr54, Phe140, Leu141, Asn142, Ser144, Cys145, Met162, His163, His164, Met165, Glu166, Leu167, Pro168, Asp187, Arg188, Gln189, Thr190. This list constitutes the active site as a site of interest in all scoring calculations.
XChem fragment screen hits selection. From the above mentioned XChem fragment screen against the SARS-CoV-2 M pro , 69 25 hits were found at regions other than the active site. The 15 fragments which contain atoms that are within 4 A from any of the putative allosteric site residues we obtained were selected as candidates for further investigation as shown in Table 3.
For each of these fragment-bound structures, we performed bond-to-bond propensity and Markov transient analyses to evaluate the connectivity to the active site. The active site was scored as described above.
Visualisation and Solvent Accessible Surface Area. We use PyMOL (v.2.3) 90 for structure visualisation, and presentation of Markov Transients and bond-to-bond propensity scores on the structure. PyMOL was also used to calculate the residue solvent accessible surface area (SASA) with a rolling probe radius of 1.4 and a sampling density of 2.

Materials & Correspondence
All requests for data and code shall be directed to s.yaliraki@imperial.ac.uk.

DATA AVAILABILITY
Data will be made available on request.

DECLARATION OF COMPETING INTEREST
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.