Automatic and accurate ligand structure determination guided by cryo-electron microscopy maps

Advances in cryo-electron microscopy (cryoEM) and deep-learning guided protein structure prediction have expedited structural studies of protein complexes. However, methods for accurately determining ligand conformations are lacking. In this manuscript, we develop a tool for automatically determining ligand structures guided by medium-resolution cryoEM density. We show this method is robust at predicting ligands in maps as low as 6Å resolution, and is able to correct receptor sidechain errors. Combining this with a measure of placement confidence, and running on all protein/ligand structures in EMDB, we show that 58% of ligands replicate the deposited model, 16% confidently find alternate conformations, 22% have ambiguous density where multiple conformations might be present, and 4% are incorrectly placed. For five cases where our approach finds an alternate conformation with high confidence, high-resolution crystal structures validate our placement. This tool and the resulting analysis should prove critical in using cryoEM to investigate protein-ligand complexes.


INTRODUCTION 24
Recent advancements in both microscope hardware and computational processing have led to 25 cryo-electron microscopy (cryoEM) emerging as a mainstream method for biomolecular 26 structure determination. While in ideal cases cryoEM data approaches atomic resolutions [1,2,27 3], most structures determined by cryoEM are in the 3-5Å resolution range. At these resolutions, 28 model building is time consuming, error prone, and often ambiguous. To assist this process, 29 methods have been developed to automatically build de novo polypeptide chains into EM data 30 [4][5][6][7], and with the advent of AlphaFold 2, high-quality starting models can oftentimes be 31 obtained from sequence information alone [8,9]. While these methods help build protein models 32 into cryoEM density, tools for automatic fitting of small molecule ligands into cryoEM data are configuration, limiting the automation and applicability of these approaches.

49
The protein modeling software Rosetta recently incorporated a new small molecule force field, 50 RosettaGenFF, which accurately models the energetics of arbitrary biomolecules in a manner 51 balanced against Rosetta's protein forcefield [17]. Combining this energy model with a genetic-52 algorithm (GA) optimization method allowing for full receptor sidechain flexibility, 53 GALigandDock, yielded superior performance in ligand docking accuracy compared to other 54 state-of-the-art methods. Here, we leverage the docking power of RosettaGenFF and GA 55 optimization to overcome the challenges of modeling small molecules at near-atomic resolution. 56 We integrated cryoEM density data with the physically realistic force field of RosettaGenFF to An overview of EMERALD is illustrated in Figure 1. GALigandDock places ligands in a protein 65 pocket by iteratively refining a pool of 100 conformations, selecting the best 100 models at each 66 generation using predicted energy. To enable this method to use cryoEM density, two changes 67 were integral: density-guided initial ligand placement and the use of density in model selection at 68 each round. Our initial placement (fully described in Methods) first models density as a pseudo-69 atomic skeleton (Fig. 1b). When generating the initial population of ligands, ligands are placed at 1c) and finally refined in Rosetta to minimize the energy of the models (Fig. 1d). The full 74 protocol generates a structure in 30-120 minutes, depending on the size of the ligand and the 75 cryoEM map.

77
To test RosettaEMERALD, we ran our docking protocol on all ligands with 25 or fewer rotatable 78 torsion angles present in deposited cryoEM structures determined at a minimum of 6Å 79 resolution. This yielded 1053 ligands to be placed. For each model, we ran three independent 80 trajectories, and we analyzed the resulting models using three different criteria: a) agreement of 81 the deposited model to the lowest energy predicted structure; b) density fit and number of 82 protein/ligand hydrogen bonds; and c) convergence of the three trajectories. This last criterion is 83 used to evaluate the confidence in a predicted model.

85
The results of these docking trajectories are summarized in Figure 2. In 58% of the cases, our recapitulated, but the EMERALD model had a worse density fit or fewer hydrogen bonds than 95 the deposited model ("non-match, worse quality", 4%). We found that incorporating EM data in 96 GALigandDock is necessary for recapitulating deposited ligand structures with high success 97 rates (Suppl. Fig. 1). Also, docking success is resilient to changes in overall map resolution

Crystal models confirm alternate conformations for EM data 119
To cross-validate our results -particularly in cases where we found a different solution than the 120 deposited model -we looked for all models with a corresponding high-resolution crystal 121 structure (see Methods). We identified 100 cases where EMERALD converged on a ligand 122 placement and a corresponding high-resolution crystal structure was available. The converged   converges on an ampicillin molecule that is flipped so that its phenyl group is now in a pocket of 160 unassigned density (Fig. 4a, b). While the deposited model places the phenyl group sandwiched 161 between two phenylalanine residues (Fig. 4a), our docked model packs the group near a cluster 162 of hydrophobic residues known to interact with other antibiotics [28] (Fig. 4b). Additionally, 163 nearby arginine, serine, and threonine residues have been suggested to generally coordinate  (Fig. 4c, d). The original structure has the guanidine group making cationic interactions with 176 neighboring phenylalanine, glutamate, and aspartate residues, all shown to be important for drug 177 binding [30-33] (Fig. 4c). The docked model deviates by adopting an orientation for the 178 guanidine group in-plane with the ligand, making more direct interactions with the aspartate and 179 glutamate but losing its cation-π interaction with the phenylalanine residue (Fig. 4d). However,  (Fig. 4g-h), but Rosetta converges on the same model for all 3 replicates, 194 suggesting some level of confidence.

Low-confidence unmatched cases show pseudo-symmetry or weak density 196
While our analysis confidently discovers alternate ligand models, 58% of docked molecules with 197 similar quality to the deposited model have medium or low confidence. We found that small 198 molecules that have pseudo-symmetry or have flexible moieties represent these low-confidence 199 cases because of the challenges they provide from their often noisy and inconclusive density. In 200 some instances, two or more replicates of EMERALD agree on a substructure of the molecule 201 (EMDB: 21844, PDB: 6WLW, dark blue, Fig. 5a, b)[36], but differ in a rotamer of a functional 202 group or a flexible group (light blue, Fig. 5a, b). For other ligands, ambiguous density leads to 203 little agreement among the reference model and low-energy Rosetta models (Fig. 5c, d). The  (Fig. 5d). Altogether, these entries show the difficulty in interpreting cryoEM data at medium to 209 low resolution leading to ambiguous density explanations for a single map, and the limits to 210 automated ligand docking using our protocol.

Cases with worse ligand models show poor initial sampling 212
To learn what improvements could be made to EMERALD in the future, we looked at instances 213 where EMERALD predicts a ligand with worse metrics than the reference model. We found that 214 these cases often had density that is discontinuous or noisy, leading to incorrect skeletonization.  Fig. 4c). Without a complete 217 skeleton, the initial population struggles to find the deposited conformation, placing the head 218 group exposed to solvent (Suppl. Fig. 4b). In this case, if the 2.63Å data is instead truncated at 219 4.0Å resolution, the density becomes more continuous, and the skeleton generated by 220 EMERALD matches the ligand conformation much more closely (Suppl. Fig. 4d)  To demonstrate our protocol's utility in structure determination, we used EMERALD to create a 228 model for linoleic acid bound in a previously undetermined protein structure. Determining this 229 model manually would be an arduous task considering high flexibility of the ligand (Fig. 6a).

230
Despite the difficulty of modeling the suspected ligand, EMERALD predicts a small molecule 231 conformation that fits the density, makes an anchoring electrostatic interaction with a 232 neighboring arginine residue, and introduces little torsional strain throughout the hydrophobic 233 tail (Fig. 6b). This placement is supported by the structure of linoleic acid bound to a related Here, we show a method EMERALD that accurately and automatically models small molecules The method should be generally applicable to most ligands with fewer than 25 rotatable bonds; 246 larger ligands have too large of a search space for this algorithm to effectively sample.

247
Discontinuous or noisy density also proved challenging, though modified map processing to 248 improve density connectivity was shown to rescue at least one of these cases. Our current 249 approach only models a single ligand at a time, which complicates density assignment for Currently, our method requires the modeler to know the identification and approximate binding 255 location of the ligand, a non-trivial task when studying novel protein-ligand complexes. For 256 more utility during model building, our method could expand to recognize potential unmodeled 257 ligand blobs and quickly assess possible ligands to determine identity. As is, however, 258 EMERALD offers an automatic tool for ligand modeling that will prove helpful for the now 259 common scenario of ligand-bound structure determination through cryoEM, and EMERALD 260 will serve as a valuable addition to the toolkit of Rosetta EM modeling methods [5, 42, 43] for 261 model building under one software package.

AVAILABILITY 263
All methods described are available as part of Rosetta, using weekly releases after week X, 2022.