Comparison of Ligand Affinity Ranking Using AutoDock-GPU and MM-GBSA Scores in the D3R Grand Challenge 4

Comparison of Ligand Affinity Ranking Using AutoDock-GPU and MM-GBSA Scores in the D3R Grand Challenge 4. ChemRxiv. Preprint. Molecular docking has been successfully used in computer-aided molecular design projects for the identification of ligand poses within protein binding sites. However, relying on docking scores to rank different ligands with respect to their experimental affinities might not be sufficient. It is believed that the binding scores calculated using molecular mechanics combined with the Poisson-Boltzman surface area (MM-PBSA) or generalized Born surface area (MM-GBSA) can more accurately predict binding affinities. In this perspective, we decided to take part in Stage 2 in the Drug Design Data Resource (D3R) Grand Challenge 4 (GC4) to compare the performance of a quick scoring function, Autodock4, to that of MM-GBSA in predicting the binding affinities of a set of Beta-Amyloid Cleaving Enzyme 1 (BACE-1) ligands. Our results show that re-scoring docking poses using MM-GBSA did not improve the correlation with experimental affinities. We further did a retrospective analysis of the results and found that our MM-GBSA protocol is sensitive to details in the protein-ligand system: i) neutral ligands are more adapted to MM-GBSA calculations than charged ligands, ii) predicted binding affinities depend on the initial conformation of the BACE-1 receptor, iii) Abstract Molecular docking has been successfully used in computer-aided molecular design projects for the identiﬁ-cation of ligand poses within protein binding sites. How-ever, relying on docking scores to rank different ligands with respect to their experimental afﬁnities might not be sufﬁ-cient. It is believed that the binding scores calculated using molecular mechanics combined with the Poisson-Boltzman surface area (MM-PBSA) or generalized Born surface area (MM-GBSA) can more accurately predict binding afﬁnities. In this perspective, we decided to take part in Stage 2 in the Drug Design Data Resource (D3R) Grand Challenge 4 (GC4) to compare the performance of a quick scoring function, Autodock4, to that of MM-GBSA in predicting the binding afﬁnities of a set of β -Amyloid Cleaving Enzyme 1 (BACE-1) ligands. Our results show that re-scoring docking poses using MM-GBSA did not improve the correlation with experimental afﬁnities. We further did a retrospective analysis of the results and found that our MM-GBSA protocol is sensitive to details in the protein-ligand system: i) neutral ligands are more adapted to MM-GBSA calculations than charged ligands, ii) predicted binding afﬁnities depend on the initial conformation of the BACE-1 receptor, iii) protonating the aspartyl dyad of BACE-1 correctly results in more accurate binding pose and afﬁnity predictions.


Introduction
Accurate estimation of protein-ligand interactions and the energetic basis of ligand binding are of great importance for successful structure-based drug discovery projects. Much effort has been devoted to develop computational methods for evaluating the binding of a ligand to a protein of interest and the strength of that binding, including docking and scoring approaches [1], virtual screening [1,2], and physics-based free energy methods [3].
Blind community wide-challenges such as the Drug Design Data Resource (D3R) Grand Challenge [4] and the Statistical Assessment of Modeling of Proteins and Ligands (SAMPL) (https://samplchallenges.github.io/) [5] provide an excellent platform for developers to test and validate their computer-aided drug design methodologies against experimental datasets. Moreover, the unbiased setting of these challenges allows the participants to compare the performance of their workflows and techniques to other computational methods in the field, thus highlighting potential areas of enhancement.
Every year, the D3R Grand Challenge organizers release pharmaceutically relevant datasets of protein-ligand complexes for the evaluation of ligand pose prediction and binding affinity protocols. This year, we decided to bring together our expertise in pose prediction and binding free energy calculations to participate in the D3R Grand Challenge 4 (GC4) to rank a set of ligands based on their predicted affinities towards the β -Amyloid Cleaving Enzyme 1 (BACE-1) involved in Alzheimer s disease [6].
In Stage 1a of D3R GC4, the participants were asked to predict the binding affinities of 154 BACE-1 compounds using any available PDB structure data or binding data. At the end of Stage 1, co-crystal structures of additional 20 ligands were released by the D3R organizers, allowing the participants to use these structures to refine the affinity predictions. In the present work, we describe our participation in Stage 2, where we predicted the binding affinities of the 154 ligands.
Many previous studies have evaluated the ability of molecular mechanics combined with the Poisson-Boltzmann surface area (MM-PBSA) and molecular mechanics combined with generalized Born surface area (MM-GBSA) [7,8] methods to predict ligand binding poses and binding affinities, and compared the accuracy of these predictions to that of docking scores [9][10][11][12][13][14][15][16][17][18]. Kaus et al. [9] have reported that MM-GBSA method performs better than standard scoring functions in ranking binding poses. Similarly, Rastelli et al. [15] have shown that both MM-GBSA and MM-PBSA rescoring methodologies improve the affinity ranking estimated by AutoDock scoring function.
On the other hand, participants in past D3R Grand Challenges have reported poor correlations between their estimated MM-GBSA binding scores and experimental affini-ties [16][17][18]. In fact, MM-GBSA and MM-PBSA are not always accurate methods for drug design projects because the quality of the results appears to depend on the protein-ligand system, the ligand sets, and details used in the method, such as the interior dielectric constant, the continuum-solvation method, the charges, and the entropies [7].
The ambition of this collaborative report was to evaluate only the binding affinity prediction accuracy of MM-GBSA scores relative to AutoDock4 scores. Therefore, starting with the same binding poses generated using AutoDock-GPU, we compared our predicted affinities with the experimental affinities for the two scoring approaches -AutoDock4 scores and MM-GBSA scores. Based on the correlation statistics and performance metrics obtained for both submissions, we found that re-scoring the affinities using MM-GBSA free energy estimates did not improve the correlation with experimental values. During post-analysis, we found that MM-GBSA scores depend on the initial protein conformation, the protonation states of the BACE-1 active site, and the charge of the ligands, and we were able to improve our MM-GBSAbased correlation metrics retrospectively.

AutoDock4 energy score
The AutoDock4 scoring function [19] can be described as a classical (additive) force field supplemented with an implicit solvation model and a ligand entropy term. The force field consists of three terms: a 12-6 Lennard-Jones potential, a Coloumb potential with a distance-dependent dielectric constant, and a 12-10 potential for hydrogen bonds. The implicit solvation model is an additive pairwise potential in which the solvent accessibility of each atom is estimated based on the proximity of surrounding atoms [20]. It is an empirical model that depends both on atom type parameters and the partial charges [19]. It accounts for (de)solvation of both the ligand and the receptor. The ligand entropy term accounts for the loss of conformational freedom of the ligand upon binding, adding a penalty that is linearly dependent on the number of rotatable bonds.
Several approximations are employed in AutoDock, in order to make the calculations suitable for rapidly docking large libraries of compounds. Scores are single point evaluations of the scoring function, thus neglecting the majority of entropic contributions. Bond lengths and bond angles are treated as rigid and non-polar hydrogens are omitted. Rotatable bonds are allowed to rotate freely, neglecting rotamer preferences. Partial charges are based on the empirical method from Gasteiger and Marsili [21], and can be considered "lower quality" in comparison to charge assignment methods based on some form of electronic structure calculation, even via semi-empirical methods such as AM1-BCC Comparison of ligand affinity ranking using AutoDock-GPU and MM-GBSA scores in the D3R Grand Challenge 4 3 [22].

MM-GBSA score calculations
MM-GBSA calculations are a widely used tool for estimating protein-ligand binding free energies [7,15,23,24]. Some propose these as a reasonable compromise between fast and inaccurate methods like scoring functions [25] to computationally rigorous but expensive methods like alchemical free energy calculations [26].
MM-GBSA calculations are usually performed on an set of protein-ligand binding conformations generated using a short MD simulation, which can either be in implicit or explicit solvent. MM-GBSA energy values, which provide an estimate of the free energy of binding (∆ G bind ), are then calculated using end-point estimates given in the equation below: The free energy for each of the complex, receptor and ligand is evaluated using contribution from four different terms where E MM is the molecular mechanical energy in the gasphase consisting of contributions for electrostatic, van der Waals and internal energies, E GB is the polar solvation free energy based on the Generalized-Born implicit solvent model, E SASA is the non-polar solvation term calculated using the solvent accessible surface area (SASA) and T S solute is the product of the absolute temperature T and the solute entropy S solute . The solute entropy term can either be ignored (often done for congeneric series) or approximated. The quasiharmonic approximation [27] and normal mode analysis of the vibration frequencies [28] are most commonly used for estimating the solute entropy term.
Here, MM-GBSA scores are reported in units of kcal/mol and are based on end-point free energy estimates. However, they should not be confused with true binding free energy calculations [26], as they involve several additional approximations. Partly as a result, calculated MM-GBSA values are typically much lower (more negative) than experimental binding free energies. The differences arise from the different approximations used to account for the solvation and configurational entropy of the protein and the ligand. Also, water molecules or cavities in the binding pocket are modeled using bulk (continuum) water, which cannot capture the finer details of water-mediated interactions present in explicit water simulations, and in some cases can produce artifactual water placements. Hence in this work, we will be referring the MM-GBSA values as MM-GBSA scores and not MM-GBSA free energies.

Docking protocol
Docking was performed using AutoDock-GPU, an OpenCL implementation of AutoDock4. The search procedure utilizes three types of ligand motion: translation, rotation (of the entire ligand as a rigid body) and rotation of atoms affected by each rotatable bond. The search algorithm is a genetic algorithm (GA) hybridized with ADADELTA [29], a gradient-based optimizer. In every GA generation, all individuals (i.e. poses) are subjected to 500 ADADELTA iterations. The number of GA runs was 100, and each GA performed 10 million score evaluations, totaling 10 9 score evaluations per docking.
Each ligand was docked to 10 different protein conformations, corresponding to the following PDB IDs (chain): , 4FS4 (A) and 4RCF (A). These structures were the selected representatives of different binding pocket conformations. REDUCE [30] added hydrogen atoms assuming pH 7, while allowing Asn, Gln and His side chains to flip. Then, the standard protocol [31] was used to assign AutoDock4 atom types, partial charges, and determine which bonds are rotatable. After all dockings were performed, the binding pose with the best score, as well as the protein structure it was docked into, constituted the starting structure for the MD simulation and subsequent MM-GBSA calculation.
Macrocycle conformations were explored during the docking by artificially breaking one of the chemical bonds within each macrocycle, generating the corresponding open linear structure which can be modeled as fully flexible during docking. To restore the original bond, and consequently the cyclic structure, a potential is applied to attract the atoms formally bonded to their covalent bond distance. This allows the macrocycle conformation to be fully sampled flexibly during docking.
Nearly all BACE-1 ligands of GC4 share a common substructure that is also found in several PDB structures. This substructure consists of a hydroxyl group attached to a short two-carbon aliphatic chain, followed by an amide. Based on publicly available BACE-1 structures, this substructure has a conserved binding mode, making several hydrogen bonds with the binding pocket. To exploit this information during the docking, we used the ligand in the PDB 4DPF as a template for the position of the common substructure of GC4 ligands. A biasing penalty was applied to three atoms: the hydroxyl oxygen, one of the carbons in the aliphatic chain and the nitrogen in the amide. This penalty increases linearly with the distance from the corresponding atom of the template structure. If this distance is below 1.2Å, the biasing penalty is zero, and the output score corresponds to the original, unaltered AutoDock4.2 scoring function.
To take advantage of further similarities with BACE-1 ligands in the PDB, we used pose filters to discard docked poses that differ from the binding modes observed in 2F3F, 4DPF and 4K8S. These pose filters act on specific chemical motifs that occur both in GC4 ligands and in these reference structures. The chemical motifs were identified manually by visual inspection, and the filtering process was automated using Python scripts and OpenBabel [32,33].
This docking protocol, including the biasing penalty and pose filters, was used to predict the binding poses of the 20 BACE-1 ligands in Stages 1a and 1b of GC4. We describe the performance of variations of this protocol, as well as further details about the methodology in a separate publication [34].

Re-evaluating the binding affinities of the ligands using MM-GBSA scores
We generated a 14ns long MD trajectory for each of the protein-ligand complexes in explicit solvent and then reevaluated the binding scores using end-point MM-GBSA calculations. The MD simulations were performed using the pmemd.cuda module of Amber18 simulation package [35]. We added partial charges to the ligand atoms using the Antechamber program (from Amber 16 package [36]) and AM1-BCC charge model [22]. The simulated system was prepared using tleap (also available in Amber16 package) and used Amberff99sb [37], GAFF version 1.8 [38] and TIP3P water [39] for the protein, ligand and water force field respectively. The protein-ligand complex was placed in a cubic simulation box with 10Å of water surrounding the complex. We next added Na+ and Cl-ions to neutralize the system and to ensure a salt concentration of 0.1M. The protein heavy atom-hydrogen bonds were constrained using SHAKE. The simulation used a time step of 2 fs. Particle mesh Ewald method was used to evaluate long-range electrostatic interactions with 9.0Å cutoff for the real space electrostatics and van der Waals forces.
We first minimized the ligand, water, and the ions for 1000 steps with 25 kcal/mol-Å 2 positional restraints on the protein, followed by another 1000 steps of minimization with the protein restraints reduced to 10 kcal/mol-Å 2 . Next, the system was heated from 10 K to 300 K in NVT ensemble for 140 ps with 10 kcal/mol-Å 2 restraints on the protein-ligand complex. We then successively decreased the restraints on the protein-ligand complex for 20 ps to first 5 and then to 2 kcal/mol-Å 2 , followed by 2 kcal/mol-Å 2 restraint only on the ligand. We used Langevin thermostat with a collision frequency of 2 ps −1 to maintain the temperature of the system. We simulated the system for 14 ns in NPT ensemble for the production run. Isotropic pressure scaling was used to regulate the pressure with a relaxation time of 2 ps. The first four ns was discarded as equilibration.
We saved the positions of the atoms every 100 ps during MD. The final trajectory used for the MM-GBSA calculations consisted of 100 frames that correspond to the last 10 ns of the production trajectory. MM-GBSA scores were calculated using the MMPBSA.py program [40] in Amber16 at a salt concentration of 0.1 nM and using the GBneck2 model [41]. Quasi-harmonic approximation was used to approximate the solute entropy.

Results and Discussions
Our main goal in this work was to see whether re-scoring docked poses with MM-GBSA scores can improve the correlation of predicted binding affinities with experimental values. We used AutoDock4.2 scores and MM-GBSA scores to rank the binding affinities of BACE-1 ligands in Stage 2 in D3R GC4. Our workflow consists of a series of entirely automated steps, starting with a SMILES representation of the ligands up to the MM-GBSA score. For this reason, the results reported herein are representative of automated approaches .   z3uni  urt76  x0qtn  tjny7  rnbjh  jjbys  wia0t  jscxd  jvbjy  3iqgq  ich8p  8dsky  u867k  zcwfq  zanor  utgv6  dxji8  cq7ug  pupet  k250i  ycdh7  6jyjp  5rdda  0cz3i  cqnrg  p280k  yboen  ufr7g  aadjz  zcngp  c6b63  pxhuz  xx4i5  py7r6  dzyxt  i88wa  u7r6y  kzsv5  q6mvt  pngkk  sycgn  n4p8a  p2f22  60tpq  nneeb  hfxsa  b84mj  tf236  b7bv0  xhiji  fgths  0fkx5  mhkzs  The performance of docking and MM-GBSA scores in predicting the affinities of the 154 ligands is assessed by the Kendall's τ and Spearman's ρ rank correlation coefficients. We also report Pearson's r and R 2 . We have compiled all these metrics for our predictions in Table 1. Since all metrics lead to the same conclusions, we base our discussion on Kendall's τ values, which is the first metric reported in the D3R evaluation page.  Fig. 2a, submission cq7ug b Fig. 2b, submission utgv6 Re-scoring docked poses with MM-GBSA did not improve correlation with experimental values. The Kendall's τ between experimental pK d and AutoDock4.2 scores (submission cq7ug) is 0.19 ± 0.06, and that of MM-GBSA scores (submission utgv6) is 0.20 ± 0.06. Thus, the predictive performance of these methods is statistically identical and both of them correlate poorly with the experimental values. Nevertheless, our predictions are statistically better than a random prediction which has an average Kendall tau of zero. In the context of all submissions to Stage 2 of GC4, our predictions ranked in the top third (Fig. 1).
MM-GBSA calculations have more detailed representation of the underlying physics at play than docking scores and literature work from other groups suggested they would be more accurate [9,11,13]. So it was perhaps surprising that there was not any improvement in the affinity estimation with the MM-GBSA rescoring. We also saw that there was no correlation between the AutoDock4 scores and MM-GBSA scores (Kendall's τ equal to -0.06 ± 0.05). These findings prompted us to investigate more about our protocol to check whether any change in the simulation conditions could improve the results.

Ligands with different net charge have different correlation metrics
With the goal of identifying aspects of the MM-GBSA approach that could be improved, we searched for features that are associated with particularly good predictions. Our ligand dataset consisted of both positive and neutral ligands. Previous work by Rastelli et al. [15] has shown that there is a decrease in correlation between predicted and experimental affinities for ligands with different formal charges, which led us to separately analyze ligands with different formal charges (Table 1).
We found that the predicted affinities of ligands modeled in a neutral state (n=18) exhibited better rank correlation (Kendall's τ of 0.44 ± 0.15) with experiment than those for ligands modeled with a +1 charge (Kendall's τ of 0.19 ± 0.07). This suggests that neutral ligands are more amenable to MM-GBSA calculations. Sun et al. [10] have reported ligand binding affinity prediction accuracy degrading with net charge of the ligand. Their Pearson's r degraded from 0.608 ± 0.003 for ligands with net charge zero to 0.564 ± 0.003 for those with net charge one. It is to be noted here, that in our study the sample size for neutral ligands is small (only 18 ligands).

Predictive performance varies with protein conformations
A total of ten protein conformations were considered for docking the ligands. The MM-GBSA calculations were performed using the protein conformation that produced the ligand pose with the best docking score, according to the AutoDock4.2 scoring function. The majority of ligands were simulated in the protein conformations associated with PDBs 4EWO (n=75) and 2WF3 (n=69).
We next decided to investigate whether different protein conformations resulted in different correlation metrics. Table 2 lists the correlation statistics for subsets of ligands docked and simulated in different protein conformations. Fig. 2 plots the predicted MM-GBSA scores against the experimental binding affinities for different protein conformations.
The subset of ligands simulated in 4EWO exhibited poorer rank correlation with experiment (Kendall's τ of 0.15 ± 0.09) for the MM-GBSA scores compared to the other protein structures used -2WF3, 2B8V, 2P4J (Kendall's τ of 0.36 ± 0.07). For AutoDock4 scores, we see similar rank correlation coefficients for different protein conformations. Hence in the succeeding work, we looked into the protein structure 4EWO to see whether we modeled it correctly.

Protonation states affect MM-GBSA scores
We noticed while doing post-analysis of our results that the catalytic aspartyl dyad (Asp32, Asp228) of BACE-1 in the prepared protein structure for 4EWO had both aspartates in the protonated form (i.e., Asp32 H , Asp228 H ). This is a possible source of error, because in the apo state, Asp32 is protonated and Asp228 is de-protonated in the active pH range (3.5-5.5) of BACE-1 [42]. Although the protonation state of the aspartyl dyad changes in the presence of inhibitors [43,44], it is a safer choice to model the aspartyl dyad as Asp32 protonated, and Asp228 de-protonated (Asp32 H , Asp228 − ).
Hence, we decided to recalculate the MM-GBSA scores of the ligands docked and simulated in 4EWO, but with the Asp32 H , Asp228 − protonation state of the 4EWO structure.
The correlation statistics of the 4EWO simulations with the Asp32 H , Asp228 − protonation state are reported in Table 3 and the plot of predicted versus experimental affinities is depicted in Fig. 2c. Also, Fig. S1 shows the distribution of the MM-GBSA scores for the two protonation states. Overall, the MM-GBSA scores were lower for the single protonated aspartyl dyad compared to the double protonated, indicating that the binding pose is more stable for the single protonation state. The Kendall τ improved by about one standard deviation, confirming that the Asp32 H , Asp228 − form is more adequate than modeling both Asp as protonated.
Comparison of ligand affinity ranking using AutoDock-GPU and MM-GBSA scores in the D3R Grand Challenge 4 7 Moreover, we computed the RMSD of the ligands between the initial docked poses and the final states at the end of the MD simulations for the two protonation states of 4EWO: i) Asp32 H , Asp228 H and ii) Asp32 H , Asp228 − . We used Chimera [45] to align the active site of each initial BACE-1-ligand complex to the last frame of the corresponding MD trajectory. The alignment of the active site was done within 5Å of the ligand. Then, we computed the RMSD values with Chimera. Out of 75 ligands, 54 had lower RMSD values when docked and simulated in BACE-1 with the single protonated aspartyl dyad (Asp32 H , Asp228 − ). The RMSD values are reported in the Supplementary Information available on https://github.com/MobleyLab/D3R-2018-AutoDock-MMGBSA/blob/master/RMSD.csv. This result shows that the binding mode of BACE-1 ligands depends on the protonation states of aspartates 32 and 228 and confirms that it is better to model the protonation state of BACE-1 as Asp32 H , Asp228 − for the given ligand dataset. Overall, these findings suggest that in large-scale binding affinity calculations, it is better to use the protonation state of the apo form of the protein.

Simulating in 2WF3 instead of 4EWO improves calculated binding affinities
Since the correlation between predicted and experimental affinities was better for the subset of ligands modeled in the structure 2WF3 than in 4EWO (Table 2), we recalculated the docking and MM-GBSA scores for the ligands modeled in 4EWO using the 2WF3 structure instead. Then, the correlation coefficients on the entire set were computed using the updated scores ( Table 3). The performance of MM-GBSA improved, displaying a Kendall τ equal to 0.30 ± 0.05, while the performance of docking remained constant (Kendall τ equal to 0.21 ± 0.06). This shows that MM-GBSA is potentially better than docking, in agreement with the more accurate physical description, but the results are sensitive to modeling choices, such as protein conformation and protonation state, making it difficult to achieve optimal predictive performance in a prospective context.
The largest difference between 2WF3 and 4EWO is in the 'flap' region (Fig. 3), which interacts extensively with BACE-1 ligands. Interestingly, docking scores are generally lower when docking to 4EWO (Fig. 2a), while MM-GBSA scores are lower when simulated in 2WF3 (Fig. 2b and S2). The exact reasons behind this opposite trend are unknown, but we speculate that none of the 10 protein conformations used for docking was "good enough" to accommodate the 75 ligands that displayed lower (i.e. better) docking scores in 4EWO. Arguably, the docking score was better in 4EWO because the flap is slightly more open, thereby accommodating ligands that could not fit as well in 2WF3. On the other hand, we argue that 2WF3 is more representative of the actual protein-ligand complex, resulting in lower (i.e. better) MM-GBSA scores on average.
Overall, these arguments highlight the sensitivity of docking to receptor conformation, and motivate the inclusion of flexibility into the receptor [46], not only for the improvement of docking methodology by itself, but also for providing better docked poses for free energy calculations. In this study, we looked at two different parameters for the MM-GBSA calculations, namely protonation state and protein conformation during our retrospective analysis. Both of these parameters turned out to have non-trivial impact on the calculated binding affinities. However, they are many more MM-GBSA calculation parameters which can possibly affect the binding affinities which were not explored in the current work.
We performed the MM-GBSA calculations using the GB-neck2 model in this work, which has not, to our knowledge, been benchmarked against older GBSA models (GB-HCT, GB-OBC1 and GB-OBC2 [47,48]) for affinity prediction. Protein targets are sensitive to GBSA models [10], hence using older GBSA models might improve the binding affinities. Another option is to perform MM-PBSA calculations instead, which are physically more accurate and computationally more expensive.
We used the popular 'single trajectory protocol' [49] in this work, which involves simulating only the protein-ligand complex and re-purposing the bound conformations of the protein and ligand for the unbound calculations. It is assumed that the protein and the ligand sample similar conformations in the bound and unbound states which might not be always valid. We do not expect to see a big difference in predicted affinities when using the alternative 'multiple  Fig. 2d trajectory protocol'. However, it may be worth trying this in the future for the macrocycles in the current dataset since the macrocycles do not have much flexibility in the binding pocket due to their size, and might possibly sample other conformations when simulated in pure solvent.
Other parameters which could be investigated in this context are different entropic approximations, ligand charge models and the solute or interior dielectric constant [14]. Lastly, another avenue worth exploring from a cost-cutting point of view is to use single-point minimized structures for the MM-GBSA calculations, similar to single-point docking score calculations. There are studies present in the literature [15,24] which have shown similar correlations between MM-GBSA scores obtained using single-point structures and those using ensembles of MD generated structures. However, recently published literature suggests that most studies still use multiple conformational snapshots from MD for the MM-GBSA calculations [50][51][52].

Conclusions
There are two components in structure-based affinity prediction challenges -pose prediction and binding score evaluation. Both of these contribute to the accuracy of predicted binding affinities. In order to evaluate only the affinity prediction capability of different scoring methods, the starting protein-ligand binding poses need to be the same, which is rarely the case in D3R Grand Challenges where each participating group has its own pose prediction and binding score evaluation methodology.
To better separate pose prediction from scoring, our two groups decided to collaborate and combine our different areas of expertise in this study -docking and free energy calculations and participate in the structure-based binding affinity prediction challenge for the target BACE-1 in D3R GC4. We used AutoDock-GPU for docking the ligands, employing the AutoDock4 scoring function, and then calculated MM-GBSA binding affinities. MM-GBSA binding energy calculations did not improve the predictive performance with respect to AutoDock4 scores. In a retrospective analysis, we identified two modeling aspects that were detrimental to the quality of MM-GBSA scores, namely the choice of protein conformation and the protonation state of residues in the binding pocket. While it is clear that MM-GBSA can make better predictions than docking scores, making the optimal modeling choices is a non-trivial task that requires knowledge of the system under study and thus is likely difficult in a prospective setting.
Thus, our assessment of MM-GBSA performance roughly agrees with that of several previous research groups which participated in previous D3R Grand Challenges [16][17][18] specifically, we find that MM-GBSA does not perform par-ticularly well at binding affinity estimation. These resultsnot just our own -highlight the importance of blind challenges for the community to evaluate method performance, particularly for drug design projects.  download file view on ChemRxiv D3R-AutoDock-MMGBSA.pdf (0.97 MiB)