Screening of world approved drugs against highly dynamical spike glycoprotein of SARS-CoV-2 using CaverDock and machine learning

Graphical abstract


Introduction
A new coronavirus (SARS-CoV-2) outbreak began in Wuhan in the province of Hubei at the end of 2019. Despite many similarities to the 2002 outbreak of SARS-CoV, the new SARS-CoV-2 outbreak had higher morbidity and mortality. Most infected individuals show no or mild symptoms, but some present general complications such as acute respiratory distress syndrome, pneumonia, and septic shock, potentially leading to the patient's death. [1][2][3][4] Drawing on established knowledge about the original virus, research groups worldwide have focused their efforts on two viral proteins: i) the spike (s)-glycoprotein, with the aim of disrupting its recognition of the membrane-bound angiotensin-converting enzyme 2 (ACE-2); and ii) the main viral protease (Mpro, 3CLpro), 5,6 with the aim of disrupting viral replication by hindering the processing of several polyproteins that are translated from the viral RNA. Another approach for tackling the spread of the new virus builds on work on the original SARS virus, which resulted in the development of a vaccine designed to induce the production of antibodies against the viral sglycoprotein, 7,8 preventing it from recognising and binding to ACE-2. Unfortunately, work on this vaccine was discontinued because it had side effects in animal models that prevented its testing in humans. 9,10 Currently, there are over 300 therapies [11][12][13] in development that are intended to prevent the spread of the virus and end the pandemic (https://covid-19tracker.milkeninstitute.org/). These efforts to create a vaccine or a potent inhibitor that can be used as an a posteriori medical treatment with acceptable side-effects are being undertaken by both private companies and academic institutions. Both viral and host proteins are being targeted. While most efforts are focused on disrupting the viral protease or viral polymerase, the viral genome is also being targeted with the aim of disrupting its replication. In particular, the host enzymes involved in nucleotide synthesis are being studied with the aim of halting the final step in viral genome replication. However, most therapies in development target proteins acting upstream of replication; there are almost 40 preclinical and over 30 clinical trials targeting viral surface proteins including the s-glycoprotein. Several host cell membrane proteins are also being targeted, including CD147 and TMPRSS2 14 and, most importantly, ACE-2. 15,16 When the SARS-CoV-2 enters the body, s-glycoprotein units on the surface of the virus act as "hooks", triggering attachment to a host cell. [17][18][19] The s-glycoprotein is homo-trimer with three domainsthe cytoplasmic tail, the transmembrane region, and most importantly, the ectodomain. 20 The ectodomain is further divided into three areas: the proximal membrane region, the S2 subunit, and at the top, the S1 subunit. The receptor-binding domain is located in the S1 subunit. ACE-2 recognises the S1 subunit, and between 1 and 3 s-glycoprotein monomers can bind to ACE-2 by opening and moving upwards. Before the s-glycoprotein/ACE-2 binding event, the covalent bond between subunits S1 and S2 is primed for cleavage to permit the displacement of the S1 subunit. The viral membrane then fuses with that of the host cell via a series of substantial conformational changes.
Several conformations of the viral s-glycoprotein have been observed by electron microscopy, including both semi-open (PDB ID 6VYB) and closed (PDB ID 6VXX) conformations. 21 The existence of visibly different conformations demonstrates that the viral s-glycoprotein can undergo conformational changes affecting not just its surface but also the gorge within the S1 subunit and the S2 subunit. Because most studies have focused on localised sites such as the active site of the viral Mpro protease or the receptor-binding domain of the s-glycoprotein, we felt that there was a gap in our knowledge about the virus and that several steps along the pathway from infection to propagation remain to be explored. In particular, the long putative tunnel created by the formation of the s-glycoprotein trimer has, to our knowledge, received little study. Studying drug interactions in such long tunnels would be laborious and computationally expensive if using alchemical 22,23 or ligand migration methods. 24,25 However, a long tunnel is a perfect target for study with CaverDock. [26][27][28] CaverDock is an in-house tool that uses Caver, 29 to identify tunnels in protein structures, and an optimised version of the well-established algorithm from AutoDock Vina to calculate possible ligand trajectories along those tunnels and the corresponding binding energies. 30 CaverDock discretises each identified tunnel into a series of discs and models a ligand's passage through the tunnel by constraining one ligand atom to lie within a disc a time, sequentially. The ligand's conformation and binding energies are then calculated using Autodock Vina, with the ligand (aside from the constrained atom) being free to explore the conformational space; the protein is treated as a rigid body. Once the conformation and binding energy have been calculated, the constrained atom is shifted to the next disc and the process is repeated until the ligand has moved through the full length of the tunnel. The tool is continuously maintained and is freely available as both a stand-alone program and a webtool named CaverWeb. 31,32 Since the start of the pandemic, the scientific community has recognized the need for collaboration and sharing of results by pledging to make data publicly available as soon as possible. In this work, we used data from a 10 µs molecular dynamics (MD) simulation of the s-glycoprotein trimer conducted at the D.E. Shaw Institute, 33 from which we extracted the main representative conformations.
We also used the original closed structure of the s-glycoprotein retrieved from the Protein Data Bank, giving a total of six structures to study. 34 Each structure was subjected to virtual screening using every drug in the globally approved drugs subset of the ZINC15 database. 35 This subset contains 4359 unique drugs approved by the US Food and Drug Administration, European Medicines Agency, and other significant authorities.
Binding energies along the s-glycoprotein tunnel were calculated for every drug and all six structures. We then compared the results obtained to identify the best ligands for each tunnel position in each conformation. We also analysed each drug to identify the contacts made with each monomeric unit of the s-glycoprotein trimer. This allowed us to select drugs that were predicted to interact with all three monomers and are thus likely to suppress opening of the S1 subunits and thereby prevent the binding of the s-glycoprotein to ACE-2. Finally, we performed quantitative structure-activity analysis (QSAR) to correlate the binding energies of the drugs with their physicochemical properties and used multivariate statistical methods to select top candidates. We are currently implementing this virtual screening methodology into CaverWeb 31 to allow the community to perform automated calculations against other target proteins using the globally approved drug dataset ( Figure 1).

Construction of the S-glycoprotein ensemble
The cryo-EM structure of the trimeric SARS-CoV-2 spike glycoprotein was obtained from the RCSB Protein Data Bank. 36 The selected structure (PDB ID: 6VXX) corresponds to the closed state of this protein.
To obtain sufficient conformational diversity for our analysis of the s-glycoprotein trimer, we used the results of a 10 µs MD simulation conducted by the D. E. Shaw group, which started from the same cryo-EM structure of s-glycoprotein. This trajectory was clustered using the cpptraj 37 module of AmberTools 16 38 and a distance-based metric defined by the mass-weighted root-mean-square deviation (RMSD) of the backbone atoms of the residues surrounding the gorge of the S1 domain. The RMSD was calculated relative to the starting structure. All residues located within 20 Å of the centreline of the tunnel in the initial s-glycoprotein structure (calculated as described below; 565 in total) were included when calculating this metric. The hierarchical agglomerative clustering algorithm was used with averagelinkage, a minimum distance between clusters (epsilon) cut-off of 2.5, sieve 5, and a minimum of 5 clusters.

Tunnel analysis
Before the tunnel analysis, three residue segments were removed from the MD snapshots (residues 365 to 372, 1333 to 1340, and 2301 to 2308). These segments were loose during the simulation and bind to the mouth of the s-glycoprotein tunnel. The tunnel extending through the s-glycoprotein trimer was characterized using HOLE v2.2.005. 39 The vector for the HOLE calculation was defined by the centre points between the C-alpha atoms of the following residues: LYS 1034 and PRO 986 in all three subunits of the s-glycoprotein structure, and LYS 858, 1826, 2794 and PRO 810, 1778, 2746 in the MD snapshots. A sample rate of 0.9 Å was used, and the end radius was set to 10 Å. We analysed the tunnel radii and cut the segment going through the S1 domain until the first extreme tunnel bottleneck was reached; the distance at which this bottleneck was encountered varied between 60 and 80 Å depending on the structure or snapshot under consideration. The output of the HOLE was converted into the CAVER 3 PDB file format 29 to enable discretization for CaverDock calculations. However, the tunnel predicted by HOLE for the s-gp structure contained disconnections that made it undiscretisable. Therefore, we remodelled this tunnel using CAVER 3.02, starting from C-alpha of Thr A 1009. The probe radius, shell radius, and shell depth were set to 0.7, 20, and 20, respectively. Finally, the selected tunnel parts were discretized into a series of discs using the discretiser tool with default settings. 27

Ligand dataset
The globally approved drug dataset was downloaded from the ZINC database 35 on the 26th of May 2020 in mol2 format. Only the first protonation state of each drug molecule was saved. The SMILES codes for all ligands were collected and stored in CSV files, which were then uploaded to the Mordred 40 web server to obtain the molecular descriptor values needed for the QSAR calculations.

CaverDock calculations
Only the part of the tunnel in the S1 domain was considered in the CaverDock calculations. We discretised the tunnel into a set of discs using the program's default settings. 27 The ligand and receptor files were prepared using MGLtools 1.5.7. 41 The grid box was generated around the relevant part of the tunnel using a script from the CaverDock package. The default drag atom (i.e. the atom closest to the centroid of the molecule) was used. Calculations were run in the inward direction only, in the lower-bound trajectory mode.

Principal Components Analysis
Principal Component Analysis (PCA) 42 was used to facilitate understanding of the data resulting from the CaverDock calculations. The data matrix consisted of 4358 ligands (objects) docked into six different protein states obtained from the CaverDock trajectories. The data for each ligand consisted of its minimum binding energy along the CaverDock trajectory and three percentage values representing the proportion of the trajectory during which the ligand was in contact with one, two, or all three individual units of the s-glycoprotein trimer. The data were autoscaled to unit variance and centred before analysis.

Partial Least Squares Analysis
Partial Least Squares (PLS) analysis 43 was used to explore the relationships between the minimal binding energies of 4358 ligands (objects) docked to six different protein states (dependent variables Y) and 1326 molecular descriptors of individual ligands (independent variables X). 2D and 3D molecular descriptors were calculated using the software tool Mordred 40, which is particularly suitable for our purpose because it can calculate descriptors even for large molecules. PLS reveals the correlation structure among variables X and Y by reweighting variables X with PLS weights and projecting them to a smaller number of new latent variables. Autoscaled and centred data were used in the PLS analysis. The importance of every molecular descriptor in the model was assessed using the variable importance in the projection (VIP) parameter 44 and plots of the PLS variable weights. 44 Internal validation was performed to assess the quality of the developed PLS models 45 by cross-validation and permutation testing. During cross-validation, 43 a portion of the Y data are excluded during model development, and the resulting model is used to predict the missing data. The predictions are then compared to the original data to obtain a Q 2 value. Q 2 provides a more realistic estimate of a model's predictive power than the squared multiple regression coefficient R 2 . In this study, 1/7 of the compounds were deleted during each cross-validation round. During permutation testing, the model was recalculated 999 times by randomly re-ordering the dependent variable y. The statistical package SIMCA-P version 12 (Umetrics, Umeå, Sweden) was used to perform all statistical analyses.

Results and discussion
The cryo-EM structure of the spike glycoprotein We initially analysed the cryogenic electron microscopy (cryo-EM) structure in the closed conformation (PDB ID: 6XVV). This choice was made because our objective was to block the viral infection mechanism by over-stabilizing the closed conformation to suppress the protein's biological activity.
Despite missing some loops on the surface, the cryo-EM structure had a sufficiently high resolution and structural integrity inside the tunnel for virtual screening with CaverDock. Because the goal was to block large conformational changes of the s-glycoprotein trimer, we ranked the best binding drugs based on both their overall binding energies and the extent of their contacts with all three monomeric units. Three distinct clusters of drugs with binding profiles showing clear energy minima were identified, each binding to a different region of the tunnel (Figure 2). The first cluster consisted of drugs binding in the region immediately behind the first bottleneck of the subunit S1 gorge, between 12 Å and 21 Å from the trimer's surface. Since this region is immediately behind the tunnel's second tightest bottleneck, we hypothesise that drugs in this cluster are flexible enough to cross that narrow part of the tunnel and then undergo a conformational change to adopt an optimal binding conformation.
The second and smallest cluster of drugs binds in the middle of the tunnel. Although we consider this group to be a cluster, the binding positions of the drugs at the extremes of the cluster differ by 10 Å: ZINC000004099004 binds 26 Å from the surface, while ZINC000008214470 binds at 36 Å. The final region of the tunnel is also the most populated; 99.5% of the drugs tested in the virtual screen bind most strongly in its deepest third, between 45 Å and 65 Å from the surface. All of the top ten drugs identified in this study ( Figure 2) belong to this final cluster (Electronic Supporting Information -ESI) and have consistently lower binding energies than any drug binding preferentially in the other two regions. In addition, most of the drugs with the lowest binding energies belong to the cluster binding at position 3 (ESI). The profile of the tunnel in this region is narrower than in the other tunnel regions.

The S-glycoprotein dynamical ensemble
The D. E. Shaw research institute studied the dynamical ensemble of the s-glycoprotein by performing a 10 µs MD simulation starting from the closed cryo-EM structure mentioned above (PDB ID: 6VXX). This simulation became stable after 6 µs, as shown by the root-mean-square deviation (RMDS) plot (SI- Figure 1). Due to the s-glycoprotein's high flexibility, the cryo-EM structure lacks several parts of its sequence, causing several segments to be seemingly disconnected from the rest of the structure.
Unfortunately, during the MD simulation, two fragments corresponding to residues 446-454 and 461-468 detached themselves from their correct positions and drifted to different locations within the structure.
These events are responsible for the two spikes seen in the RMSD plots at around 2.1 and 5.2 µs (SI- Figure   2). These unrealistically dynamical fragments, which were originally located on the outer surface of the sglycoprotein, were excluded from all subsequent analyses in this work.
We clustered the MD snapshots based on the RMSD of the gorge residues to obtain diverse but biologically relevant conformations of the s-glycoprotein. The obtained clusters are ranked in terms of their populations. The most populated cluster, s1, dominated almost the entire second half of the trajectory (SI- Figure 1). The mean RMSD of the gorge residues in this cluster was 3.46 ± 0.13 Å, which is close to the average value for the entire simulation (3.66 ± 0.38 Å) (SI- Figure 3). Conversely, the least populated cluster (s7) had RMSD values indicating that it remained close to its starting structure (1.62 ± 0.69 Å). Representative structures of the clusters (SI- Figure 3) and their tunnels (SI- Figures 5 and 6) were also obtained, enabling further analysis (SI- Figure 3).
CaverDock calculations were performed using representative structures of the 5 most populated clusters in the same way as described for the cryo-EM structure ( Figure 2). Each tunnel had a unique profile, but in all cases, the narrowest section was in the deepest region of the tunnel, close to the S2 subunit. The vast majority of the ligands have their lowest binding energies in this region ( Figure 3). The sole exception is the most populated state, s1, for which the majority of the ligands have their lowest binding energies in the middle of the tunnel (Figure 2). The tunnel in this state is slightly wider than in the other states, making it difficult for ligands to form contacts with all three monomers. The tendency for the binding energies of drugs to be lowest immediately before or after a bottleneck was seen for all states.

Principal Component Analysis (PCA)
Multivariate statistical analyses were used to: (i) comprehend the large data sets obtained from the CaverDock calculations, (ii) establish structure-activity relationships, and (iii) select the best potential drug candidates. Two statistically significant models were generated by PCA using the CaverDock results obtained using the set of 4358 ligands and six protein states. The data used in the PCA were the minimum binding energies for each drug along the trajectory and the proportions of the trajectory during which the docked ligand was in contact with one, two, and all three individual subunits of the s-glycoprotein trimer, expressed as percentages.  Inhibitors are shown using all-atom models, coloured by atom type. The first PCA model (PCA-1) used 24 variables: 3 related to the minimum binding energies for each protein state, and 3 quantifying the percentages of the trajectory during which the drug was in contact with 1, 2, and 3 units of the trimeric s-glycoprotein. Ten statistically significant principal components were obtained, collectively explaining 98% of the variation in the data. The second model, PCA-2, was generated using 12 variables representing the energy minima and the percentages of each trajectory during which the drug was in contact with all three monomeric units of the s-glycoprotein trimer for each of the six studied protein states. This model yielded only two principal components that explained 85% and 8% of the variation in the data, respectively. Because it had only two principal components, this model was easier to interpret than the first. The top hits predicted by the two models were very similar, so only the results obtained with the simpler model 2 will be discussed further. By inspecting the distribution of the docked compounds in the 2D space spanned by the first two principal components (Figure 4), the compounds interacting most strongly with all three subunits of the spike protein were identified (ESI).
Such compounds are most likely to modify the conformational behaviour of the s-glycoprotein and thus affect its biological function. The distribution of the 12 variables used to cluster the ligands is shown at the bottom of Figure 4.

Partial Least Squares Analysis (PLS)
A PLS analysis was performed to correlate the minimum binding energies for each ligand from the In this way, the number of variables was reduced from 1326 to 56. A new model generated with these variables, PLS-2, had three principal components, with an R 2 of 0.84 and a Q 2 of 0.84. Validation by permutation testing -scrambling the Y variables while keeping the X-matrix unchanged -indicated that this correlation would be very unlikely to be observed by chance, as expected given the large number of observations on which the model is based.
The observed minimal binding energies were plotted against the corresponding predicted values for the starting structure 6VXX and state s4, for which the worst and best fits were obtained, respectively (SI- Figure 7). VIP values were computed to quantify the relative importance of the chosen molecular descriptors in explaining the differences in the minimum binding energies for all six states (SI- Figure 8).
The most influential variables were FMF (a molecular framework ratio descriptor of the shape of the

Top Ranked Drugs
We obtained a ranking of the best binders from the PCA and selected the top ten for further analysis ( Figure 5). These ligands had consistently low binding energies in all of the studied protein structures and exhibited a high percentage of contacts with all three monomeric units of the sglycoprotein trimer during the CaverDock simulations. Although drugs in clusters S1, S3, and S5 occasionally formed contacts with only one monomer, these cases represented less than 10% of the corresponding trajectory. This ranking reflects our assumption that strong interactions with all three monomers in different states of the trimer will reduce the trimer's capacity for conformational change, which is essential for the biological activity of the spike glycoprotein. We also found that multivariate statistical methods were needed to rank the drugs meaningfully. For example, a simple ranking of the drugs based on their minimum binding energies would not have placed Daclatasvir ( Figure 5) in the top 10 because its binding energies for all six conformations are higher than those of some drugs that were not selected. It was thus clear that interaction with all three monomers was weighted strongly in the ranking of the drugs; for three of the studied protein states, Daclatasvir was observed in contact with two and three subunits of the s-glycoprotein trimer, and in the remaining three states (clusters s1, s3 and s5) it was in contact with two or three subunits for at least 96.4% of the trajectory (SI-Table1).
Among the drugs ranked in the top 10 was a dye for cataract surgery (ZINC000169289767), three drugs currently used as antiviral agents against the hepatitis C virus (ZINC000164760756, ZINC000936069565, ZINC000068204830), an antifungal (ZINC000028639340), a microsomal triglyceride transfer protein inhibitor (ZINC000027990463), a hepatoprotective drug for chronic hepatitis (ZINC000096015174), an agent used to treat squamous cell carcinoma of the head and neck (ZINC000003934128), a vasoconstrictor used to treat migraines (ZINC000003978005), and an agent for treating cerebral and peripheral vascular events that are also used in Alzheimer's studies to inhibit γsecretase (ZINC000003995616).

Figure 5
Top ten inhibitors predicted using CaverDock simulations and machine learning. Drugs are shown in the first column with their names and chemical structures. Binding energies per drug for each protein state (cryoEM 6VXX and MD states S1-S5) are reported in kcal/mol. The bar plots under each binding energy represent the percentage of the corresponding trajectory during which these compounds formed contacts with one monomer (red), two monomers (yellow), and three monomers (green).

Conclusions
Here we describe a computational workflow that was used to perform virtual screening based on It should be noted that the length of the tunnel in the studied s-glycoprotein structures ranges between 57 Å and 77 Å, making it several times longer than typical enzyme tunnels. However, this long tunnel can serve as a good representative of the structural features present in transmembrane proteins.
We used machine learning to identify the most promising drug candidates based on their strength of binding inside the tunnel and their likely ability to prevent the s-glycoprotein trimer from undergoing functionally necessary conformational change. Although we only selected 10 inhibitors here for the sake of brevity, this number could easily be increased. CaverDock is fast enough to analyse an even higher number of snapshots to cover the protein's conformational space more comprehensively or to examine a significantly greater number of ligands. Importantly, this workflow is currently being made available on the CaverWeb tool to enable automated virtual screenings of the ZINC globally approved drugs dataset.
This will enable researchers around the world to perform virtual screening and data analysis in the same way as reported here, in a user-friendly manner. It will also be possible to export the results as comma separated value (CSV) files and/or Pymol sessions to be opened and processed locally by the user. The procedure will be applicable to any protein with an available tertiary structure containing tunnels or channels and should thus find diverse applications in drug design, protein engineering, and metabolic engineering.