Implicit ligand theory for relative binding free energies: II. An estimator based on control variates

Implicit ligand theory describes the relationship between the noncovalent binding free energy and the binding free energy between a ligand and multiple rigid receptor conformations. We have previously shown that if the receptor conformations are sampled from or reweighed to a holo ensemble, the binding free energy relative to the ligand that defines the ensemble can be calculated. Here, we apply a variance reduction technique known as control variates to derive a new statistical estimator for the relative binding free energy. In applications to a data set of 6 reference ligands and 18 test ligands, statistically significant differences between the estimators are not observed for most systems. However, in cases where such differences are observed, the new estimator is more accurate, precise, and converges more quickly. Performance improvements are most consistent where there is a clear correlation, with a correlation coefficient greater than 0.3, between the control variate and the statistic being averaged.


Introduction
Noncovalent binding between small organic molecules and biological macromolecules is a ubiquitous process in biology and is critical to the mechanism of most drugs. Thus, there has been significant interest in developing computational methods to predict binding free energies, which quantify the strength of these interactions [1][2][3][4][5][6][7].
Alchemical pathway methods [10] are a class of theoretically rigorous but computationally expensive binding free energy methods. These methods involve sampling conformations of receptor-ligand complexes from a series of thermodynamic states along a possibly nonphysical pathway. They can be used to obtain either absolute binding free energies between a receptor and ligand ( • DG RL , as defined in equation (1)) or relative binding free energies between a receptor and two different ligands ( • DDG RL , as defined in equation (2)). In • DG RL calculations, the pathway may involve decoupling or physically separating the ligand from the receptor [11][12][13][14]. In • DDG RL calculations, the ligand is transformed from one molecule to another [15,16]. Alchemical pathway methods have accurately predicted binding free energies for many systems including protein-ligand [9,[17][18][19][20][21][22] and protein-protein complexes [23,24]. However, computational expense continues to restrict more widespread application of the methods.
Since 2012, we have been developing binding free energy methods based on implicit ligand theory (ILT) [25]-a statistical mechanics framework for calculating absolute or relative binding free energies using a presampled set of rigid receptor conformations-that can quickly calculate binding free energies for a large set of ligands binding to the same receptor [26][27][28]. Computing • DG RL involves drawing from or reweighing receptor configurations to the apo ensemble-the receptor alone, in the absence of ligand [25]. In a • DDG RL calculation, receptor configurations are drawn from a holo ensemble-where the receptor is bound to a ligand [29]. After receptor configurations are selected, the next step of an ILT-based free energy calculation is to compute the Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. binding potential of mean force (BPMF)-the • DG RL between a flexible ligand and a rigid receptor-for the receptor configurations. The standard absolute • DG RL , computed using the 'apo estimator', is an exponential average of the BPMF over the apo ensemble.
• DDG RL of a ligand relative to the reference ligand defining the holo ensemble, computed with the 'holo estimator' (equation (6)), is an exponential average of the BPMF over the reference holo ensemble.
ILT has several advantages over alternative binding free energy methods based on a flexible receptor. The main advantage of ILT is that once conformations of the receptor are thoroughly sampled, they can be used to calculate the binding free energies of many ligands. In contrast, in alchemical pathway calculations with a flexible receptor, receptor conformational sampling needs to be performed for every receptor-ligand complex. Another advantage of ILT is that BPMF calculations are much faster and more scalable than binding free energies to flexible receptors [28]. Protein-ligand interaction energies may be pre-computed on a grid and interpolated, reducing the number of nonbonded force calculations and essentially eliminating the relationship between receptor size and computational speed. Finally, it is worth noting that unlike most methods for • DDG RL estimation, the holo estimator does not require one ligand to be transformed into another. Therefore it can be employed even for cases where the target ligand significantly differs from the reference one.
Regarding receptor conformational sampling, the optimal choice of apo or holo ensemble depends on the extent to which ligand binding induces a consistent conformational change in the receptor. Both the apo and holo estimators are exponential averages that are dominated by the lowest BPMFs. The receptor configurations with the lowest BPMFs are those that are prevalent in the holo ensemble of interest. In contrast, BPMFs for receptor configurations that poorly accommodate a ligand may be approximated as infinite without a significant effect on the binding free energy estimate. Thus, converged • DG RL estimates with ILT require sampling receptor configurations that are prevalent in the holo ensemble for the ligand of interest. If ligand binding induces a significant conformational change, then simulating the apo ensemble is unlikely to yield such conformations. In this case, relevant configurations may be obtained by simulating one or more holo ensembles with different ligands [26]. In principle, enhanced and biased sampling techniques may also be used for receptor configurational sampling.
Here, we describe a variation of the holo estimator that is based on exploiting the BPMFs between a ligand and its own holo ensemble. These BPMFs may be used to estimate the relative binding free energy between a ligand and itself, the 'self' relative binding free energy, To improve the holo estimator, we apply a statistical variance reduction technique known as control variates. In the control variates technique, the bias in an estimate of a known expectation value is used to correct the bias in the estimate of an unknown expectation value. The expectation values are of quantities that can be calculated from the same samples. The method works best when the quantities are highly correlated and therefore bias in one estimate is related to bias in the other.
The structure of this paper is as follows: first, we introduce notation and review the holo estimator. We then develop a variation of the estimator based on control variates. We then apply the newly optimized estimator to • DG RL calculations for T4 lysozyme and 24 ligands. Finally, we compare the accuracy of the unoptimized and optimized estimators with the results from alchemical pathway calculations.

Theory
The absolute standard binding free energy of a ligand L noncovalently bound to receptor R to form a complex RL is given by, where k B is Boltzmann's constant and T is the temperature in Kelvin. C R , C L and C RL are equilibrium concentrations of the receptor, ligand and complex, respectively. • C is the standard concentration, 1 M. In this work, we focus on • DDG RL , which is defined as the difference in binding free energies between a target ligand L and a reference ligand L o , 2.1. ILT estimator for relative binding free energies Previously, we reported that • DDG RL can be written as an expectation value [29], RLo is the unnormalized probability density of the receptor-ligand complex in the bound state. It includes: ( ) x I o , a function based on the ligand external coordinates x o that specifies whether the ligand is in the binding site; ( ) x J o , the Jacobian for transforming the ligand external coordinates from Cartesian into the coordinate system used for x ; o and ( )  r X , the potential energy of species the BPMF-the binding free energy of the flexible target ligand L to a rigid receptor conformation. It also includes ( ) o , the interaction energy between the reference ligand L o and the receptor R when they form a bound complex r RL o .
If configurations of the complex are drawn from a different statistical distribution than the bound state, then importance sampling may be used to estimate where the summation is over conformations of the receptor.ˆ( ) B r R is an estimate of the BPMF. Whenˆ( ) B r R is the BPMF of the reference ligand itself, we obtain an estimate of the 'self' relative binding free energy, ⎧ We will refer to this expression as estimator A. Although estimator A ensures that the 'self' relative binding free energy estimate is zero, it may not be statistically optimal.

ILT estimator based on control variates
To apply the control variates technique, we need a property with an exactly known expectation value. In the special case that the ligand is the reference ligand, then the left hand side of equation (5) is unity. Rearrangement and combination of the expectation values yields ( ) For notational convenience, we also define a function based on the key observable in equation (5), where C is an arbitrary constant. In the standard control variates method, C is optimized to minimized the variance of the sample mean of ( ) m r RL o . In this paper, we select C to minimize the variance of a new estimator for relative binding free energy, which is given by,ˆ( The bar over the variable name denotes the sample mean, such that N is the number of sampled receptor configurations. To optimize C, we use error propagation formulae to obtain the variance ofˆ• DDG RL B , with respect to C, set its derivative to zero, and solve for C. The variance of the estimator is given by, The derivative can be set to zero, resulting in, Substituting these identities into equation (17), rearranging, and solving for C gives, We will refer to equation (13) with C given by equation (18) as estimator B. Estimator B is the key theoretical result of the paper. In the remainder of the paper, we will compare the accuracy and precision of estimator A (equation (9)) and estimator B (equation (13)).

Methods
All of the quantities needed to calculate binding free energies using estimators A (equation (9)) and B (equation (13)) and to assess their quality were obtained in previous studies. For the reader's convenience, we describe some key details in this section.
To benchmark the accuracy of the two estimators, we used absolute binding free energies obtained via an alchemical pathway method. The absolute binding free energy • DG RL between T4 lysozyme and 24 small organic molecules was previously calculated [26] using YANK, a software package developed by the Chodera group [12]. In these calculations, the AMBER ff14 force field [30] was used for T4 lysozyme and the generalized AMBER force field [31] with Bondi radii [32] and AM1BCC partial charges [33,34] for ligands. Solvent was treated with the OBC2 generalized Born/surface area implicit solvent model [35]. Calculations were repeated 3 times for each system. Standard deviations were less than 0.5 kcal/mol for 22 systems and less than 1 kcal/mol for all 24 systems [26].
We considered one reference set and two test sets: test set I to benchmark consistency with YANK and test set II to benchmark against experimental results. The reference set consisted of methylpyrrole, benzene, p-xylene, phenol, n-hexylbenzene and DL-camphor. Test set I consisted of 18 remaining ligands from YANK calculations. Test set II included 18 ligands having experimental binding free energies reported in references [36][37][38]. As seven ligands were part of both test sets, we considered a total of 29 ligands.
ILT-based binding free energies were also computed using quantities obtained in previous studies. In previous work, 96 receptor snapshots were selected from every alchemical state of YANK calculations with the reference ligands. The BPMF between each ligand and the selected receptor configurations was calculated [26] using our software package AlGDock [28,39]. The same force field and implicit solvent model were used in the YANK and AlGDock calculations. Interaction energies ( ) Y r RL o were calcuated using the OpenMM 6.3.1 library [40,41]. Since receptor snapshots selected for AlGDock calculations were drawn from all alchemical states, not just the holo state, we needed to reweigh them to the holo state. To do this, we used the multistate Bennett Acceptance Ratio (MBAR) [42] to calculate the weights ( ) w r RL o in the holo ensemble. Estimator A (equation (9)) and B (equation (13)) were used to calculate relative binding free energies. Relative binding free energies were converted to absolute binding free energies by adding them to the YANK binding free energy of the appropriate reference ligand.
The accuracy of the two estimators was compared via several metrics. The root mean square error (RMSE) and Pearson's R were computed with respect to YANK results. We also compared the estimators using the difference in absolute deviation from YANK,   (9)) and B (equation (13) Bootstrapping was used to estimate statistical errors. For relative binding free energies, random sets of samples were drawn with replacement from the set of 96 receptor snapshots. The standard deviation of free energy estimates from 100 such sets was reported as the uncertainty. To estimate error bars for RMSE and Pearson's R values, the set of 18 test ligands was resampled with replacement 100 times.

Estimator B improves consistency with YANK
Based on binding free energies of all test ligands in test set I relative to all reference ligands, estimator B is more consistent with YANK than estimator A ( figure 1 and table 1). The RMSE of free energies with respect to YANK is lower with estimator B (3.11±0.21 kcal/mol) than with estimator A (3.76±0.29 kcal/mol). The difference in RMSE is larger than the estimated error, suggesting that the improvement is statistically significant. While the Pearson's R of estimator B (0.72±0.05) is also slightly higher than that of estimator A (0.65±0.05), the improvement is not large compared to the estimated error.
Although estimator B has a better performance overall, differences in quality metrics are only statistically significant for binding free energies relative to two reference ligands (figures 2 and 3 and table 1). For three out of six reference ligands (methylpyrrole, benzene and p-xylene) estimator B yields slightly higher RMSEs compared to estimator A, but the differences are smaller than the estimated error. For another three systems (phenol, DLcamphor and n-hexylbenzene) estimator B is more consistent with YANK than A. For phenol, the improvement is slight and is unlikely to be statistically significant. However, for DL-camphor and n-hexylbenzene, which are the two most difficult cases yielding the largest RMSEs, estimator B shows a significant reduction in RMSEs. For n-hexylbenzene, the RMSE is dramatically reduced from 7.65±0.33 kcal/mol to 5.82±0.35 kcal/mol. On the other hand, differences between the RMSE of the two estimators are smaller than estimated error (table 1).

Estimator B reduces errors with respect to experiment
Overall, estimator B reproduces experimental results more accurately than estimator A ( figure 4 and table 2). The RMSE with respect to experiment of estimator B (2.77±0.21 kcal/mol) is lower than the RMSE of estimator A (3.53±0.28 kcal/mol). The difference in RMSE is statistically significant because it is much larger than the estimated errors. The Pearson's R of estimator B (0.32±0.07) is slightly higher than that of estimator A (0.25±0.08), but the difference is small compared to the estimated errors. When considering each reference ligand separately (figures 5, 6 and table 2), the RMSE of estimator B with respect to experiment is significantly lower than the RMSE of estimator A in three of six cases. In particular, it significantly improves the most difficult case, based on the reference ligand n-hexylbenzene. For two reference ligands, methylpyrole and p-xylene, estimator A has a somewhat lower RMSE than estimator B. The two estimators performs equally well in the case of the reference ligand benzene. Due to large estimated errors, differences in the Pearson's R between the two estimators are not statistically significant.

Estimator B converges faster for the most difficult case
For all reference ligands except for n-hexylbenzene, the rate at which the binding free energy estimators converge is comparable ( figure 7). Convergence was assessed by the rate at which the RMSE with respect to YANK is reduced as the number of receptor snapshots is increased. Convergence behavior is very similar for five out six reference ligands: methylpyrrole, benzene, p-xylene, phenol and DL-camphor; the curves look rather similar and level off starting at almost the same number of receptor snapshots. For p-xylene and DL-camphor, estimator B appears to converge more closely than estimator A, but differences are well within error bars. On the other hand, estimator B shows significantly faster convergence in the RMSE for the hardest system, n-hexylbenzene. With estimator A, increasing the number of snapshots actually increases the RMSE with respect to YANK.

Overall, estimator B is more precise than A
In addition to being more accurate and converging more quickly than estimator A, estimator is also more precise. In the majority of systems, 63%, the error estimated for estimator B is smaller than for estimator A ( figure 8)

Conclusions
Using the control variates technique, we have developed a new statistical estimator for using ILT to calculate binding free energies relative to a reference ligand that defines the holo ensemble. We have applied the new estimator to previously-collected simulation data for the binding of 6 reference ligands and 18 test ligands to T4 lysozyme. In most cases, differences in the performance of the new and old estimator are not statistically significant. In the cases where statistically significant differences are observed, the new estimator outperforms the old one; it is more accurate, precise, and convergences more quickly. As anticipated for the control variates technique, performance improvements are greatest in cases where there is a strong correlation between the statistic ( ) h r RL o and control variate ( ) g r RL o . We recommend using the estimator over the previous one. Python code for the new estimator is included in the SI and on github at https://github.com/nguyentrunghai/RelBfe_control_variates. Data analyzed in this study has been deposited in the github repository.