Long Time Scale Ensemble Methods in Molecular Dynamics: Ligand–Protein Interactions and Allostery in SARS-CoV-2 Targets

We subject a series of five protein–ligand systems which contain important SARS-CoV-2 targets, 3-chymotrypsin-like protease (3CLPro), papain-like protease, and adenosine ribose phosphatase, to long time scale and adaptive sampling molecular dynamics simulations. By performing ensembles of ten or twelve 10 μs simulations for each system, we accurately and reproducibly determine ligand binding sites, both crystallographically resolved and otherwise, thereby discovering binding sites that can be exploited for drug discovery. We also report robust, ensemble-based observation of conformational changes that occur at the main binding site of 3CLPro due to the presence of another ligand at an allosteric binding site explaining the underlying cascade of events responsible for its inhibitory effect. Using our simulations, we have discovered a novel allosteric mechanism of inhibition for a ligand known to bind only at the substrate binding site. Due to the chaotic nature of molecular dynamics trajectories, regardless of their temporal duration individual trajectories do not allow for accurate or reproducible elucidation of macroscopic expectation values. Unprecedentedly at this time scale, we compare the statistical distribution of protein–ligand contact frequencies for these ten/twelve 10 μs trajectories and find that over 90% of trajectories have significantly different contact frequency distributions. Furthermore, using a direct binding free energy calculation protocol, we determine the ligand binding free energies for each of the identified sites using long time scale simulations. The free energies differ by 0.77 to 7.26 kcal/mol across individual trajectories depending on the binding site and the system. We show that, although this is the standard way such quantities are currently reported at long time scale, individual simulations do not yield reliable free energies. Ensembles of independent trajectories are necessary to overcome the aleatoric uncertainty in order to obtain statistically meaningful and reproducible results. Finally, we compare the application of different free energy methods to these systems and discuss their advantages and disadvantages. Our findings here are generally applicable to all molecular dynamics based applications and not confined to the free energy methods used in this study.

However, the results and implications obtained are very similar for all systems. Therefore, to avoid making the main text very lengthy, we only included a representative figure in the main text (only one system in most cases) displaying the general findings and similar/identical figures for all other systems have been included in this document.

Free Energy Methods: Direct versus ESMACS
When performing "long" simulations, a large number of binding poses are observed and it is hard to identify the most stable one in the absence of available experimental information. In such cases, ESMACS-s is not so useful as resultant ∆G values may vary substantially. As an example, we randomly picked out two different binding poses of 93J sampled within the long trajectories of 3CLPro-93J system at three binding sites (A, B and H1) and performed ESMACS-s calculations using each of them as the starting structures. The differences in ∆G ESM ACS−s values obtained starting from the two different binding poses at each 93J binding site are 11.44 kcal/mol, 4.93 kcal/mol and 6.61 kcal/mol respectively which are quite large and can result in very different rankings. Table S1: Free energies obtained using different protocols. "ESMACS-s" and "ESMACS-l" correspond to free energies obtained using the standard ESMACS protocol (initiated from a chosen conformation) and those using bound conformations extracted from "long" trajectories, respectively. Error bars are included in brackets and denote the standard errors across all replicas that sample a given binding site. All values are in kcal/mol. S-2 Figure S1: Ligand-protein residue contact frequency distribution plots for each "long" replica are shown adjacent to their respective ligand occupancy maps for the 3CLPro-93J system. The ligand-residue contact frequencies correspond to the fraction of frames in which a hydrophobic contact is formed between the ligand and a given protein residue. Occupancy maps of the ligand around the protein represent the isovalue surfaces (wireframe representation) rendered at a fractional occupancy of 0.03 across all frames of the simulation trajectory. In other words, they represent volumes of the simulation box where the ligand is likely to be found with 97% probability, that is in 97% of all trajectory frames.
S-3 Figure S2: Ligand-protein residue contact frequency distribution plots for each "long" replica are shown adjacent to their respective ligand occupancy maps for the 3CLPro-RQN system. The ligand-residue contact frequencies correspond to the fraction of frames in which a hydrophobic contact is formed between the ligand and a given protein residue. Occupancy maps of the ligand around the protein represent the isovalue surfaces (wireframe representation) rendered at a fractional occupancy of 0.03 across all frames of the simulation trajectory. In other words, they represent volumes of the simulation box where the ligand is likely to be found with 97% probability, that is in 97% of all trajectory frames.
S-4 Figure S3: Ligand-protein residue contact frequency distribution plots for each "long" replica are shown adjacent to their respective ligand occupancy maps for the PLPro-GRL system. The ligand-residue contact frequencies correspond to the fraction of frames in which a hydrophobic contact is formed between the ligand and a given protein residue. Occupancy maps of the ligand around the protein represent the isovalue surfaces (wireframe representation) rendered at a fractional occupancy of 0.03 across all frames of the simulation trajectory. In other words, they represent volumes of the simulation box where the ligand is likely to be found with 97% probability, that is in 97% of all trajectory frames.

S-5
Reps 1-6 (top to bottom) Reps 7-12 (top to bottom) Figure S4: Ligand-protein residue contact frequency distribution plots for each "long" replica are shown along with their respective ligand occupancy maps for the 3CLPro-RQN-LZE system. The ligand-residue contact frequencies correspond to the fraction of frames in which a hydrophobic contact is formed between the ligand and a given protein residue.
Occupancy maps of the ligand around the protein represent the isovalue surfaces (wireframe representation) rendered at a fractional occupancy of 0.03 across all frames of the simulation trajectory. In other words, they represent volumes of the simulation box where the ligand is likely to be found with 97% probability, that is in 97% of all trajectory frames. S-6