One Size Does Not Fit All: The Limits of Structure-Based Models in Drug Discovery

A major goal in computational chemistry has been to discover the set of rules that can accurately predict the binding affinity of any protein-drug complex, using only a single snapshot of its three-dimensional structure. Despite the continual development of structure-based models, predictive accuracy remains low, and the fundamental factors that inhibit the inference of all-encompassing rules have yet to be fully explored. Using statistical learning theory and information theory, here we prove that even the very best generalized structure-based model is inherently limited in its accuracy, and protein-specific models are always likely to be better. Our results refute the prevailing assumption that large data sets and advanced machine learning techniques will yield accurate, universally applicable models. We anticipate that the results will aid the development of more robust virtual screening strategies and scoring function error estimations.


Summary
Structure-based models are an integral part of the drug discovery process. Although empirical models, such as scoring functions, have been developing over the last three decades, 1 their performance still remains poor. 2 By utilizing statistical learning theory 3 and information theory, 4 we show that even the most idealized structure-based model is significantly limited in its accuracy.
In Section 2, we provide additional background on using the cross entropy as a loss function by showing its relationship to the mean squared error. In Section 3, we prove the three main inequalities that underly fundamental sources of error for empirical scoring functions. The first (Section 3.1), is an upper bound to scoring function bias showing that arbitrarily large errors are possible if one samples the protein-ligand complexes in training and test sets in a different manner. The second and third inequalities show that while the scoring function which is optimal over a diverse range of protein-ligand complexes has the lowest bias of any other model (Section 3.2), its minimum error is on average larger than the minimum error of a protein-targeted scoring function (Section 3.3). The inability of a generalized scoring function to achieve the lowest error on protein-specific data sets has important implications for virtual screens. In Section 4 and 5 we detail the scoring function we used to test our theoretical predictions, list the complexes used in the single-protein data sets (Table S1), compare our diverse model to other popular models (Table S2) and show the minimum and maximum errors of the free parameter space surveyed in Figure 3 of the main text ( Figure S1 and Table S3).
2 The cross entropy of normal distributions 2.1 Relationship to mean squared error Our analysis utilizes cross entropy to quantify the error in scoring function regression. Similar information theoretic approaches to statistical learning can be found in references 5 and 6. . The cross entropy C(Y |X) for binding affinities y given bound structures x is given by where p(x, y) is the true PDF of the data set of binding affinities and structures and q(y|x) is our model PDF of the conditional relationship between x and y. The optimal scoring function for a data set sampled from p(x, y) is encoded in the conditional distribution p(y|x). For example, a well known result in statistical learning theory 3 is that the scoring function that has the minimum mean squared error (MMSE), denoted f p (x), is given by the conditional average If the model q(y|x) is a normal distribution with variance σ 2 q and conditional average f q (x) so that the cross entropy is equivalent to the mean squared error (MSE) of the scoring function f q (x) when applied to data sampled from p(x, y).
The integral is the MSE of the scoring function f q (x). Note that the variance σ 2 q does not depend on f q (x). Therefore, the scoring function that minimizes the cross entropy also minimizes the MSE for normally distributed errors. In a similar manner, one can also show that if q(y|x) is chosen to be a Laplace distribution, the scoring function that minimizes the cross entropy also minimizes the mean absolute error. 3

Relationship to scoring function regret
If one applies a scoring function that is optimal for a data set of structures and affinities sampled from p β (x, y) to a data set sampled from a different distribution p β (x, y), then the bias incurred by this model is given by This bias represents a fundamental source of error for scoring functions which are trained and applied to data sets that have different sampling probabilities. In the main text (in the section "The optimization of scoring functions"), a scenario was considered where the conditional distributions p α (y|x) and p β (y|x) were treated as normal distributions to help conceptualize this form of error. This section gives more background to that example and proves Equation 13 from the main text. The following analysis is treated more generally in reference 7. Let p α (y|x) and p β (y|x) be normal distributions with variances σ 2 α and σ 2 β and optimal scoring functions f α (x) and f β (x) respectively. We investigate the case when the variances are similar in magnitude and we set σ β = σ α + where | | |σ α |. The relative entropy becomes where the second line follows from the relative size of to σ α and MMSE α is the minimum mean squared error achievable on p α (x, y) and MSE β,α is the mean squared error of f β (x) applied to p α (x, y). Since the relative entropy is always positive, 4 Equation S4 means that the MSE of a modeled scoring function is always higher than that of the true scoring function. The difference between MSE β,α and MMSE α is the regret of model for normally distributed errors. 7 As well as the regret, the relative entropy is also proportional to the mean squared distance between the true and estimated scoring functions. To show this, we start from Equation S4 so that we have as stated in main text Equation 13. A graphical interpretation of this inspired the schematic diagram shown in Figure  1 of the main text, in which different data sets have their own optimal models that are embedded in a space that preserves the mean squared distance between them. While the above is symmetric with respect to the exchange of p α (y|x) and p β (y|x), this is only true for special cases and is not true generally. 4 3 Proofs of main inequalities

Proof of Equation 9
Before proving Equation 9 from the main text, two inequalities must be stated. First, for non-negative numbers k 1 , k 2 , ..., k n and l 1 , l 2 , ..., l n the log-sum inequality is given by with equality if and only if k i /l i is equal to a constant value. 4 For the second inequality, we utilize the chain rule for relative entropy, 4 given by Using the above inequalities, where the last line follows from the discrete definition of relative entropy using natural logarithms. 4

Proof of Equation 11
We Equation 11, we utilize a standard information theoretic manipulation 4 in the second line. Starting from the left hand side of Equation 11 we have: where the last line follows from the fact that D p ω (y|x)||q(y|x) ≥ 0.

Proof of Equation 12
To prove Equation 12, first we show that the average cross entropy of the average density, p ω (y|x), on each protein specific density, denoted C i (Y |X), is equal to the conditional entropy, h ω (Y |X), of the average density: As in Equation 2 of the main text, we also have The inequality stated in Equation 12 arises by considering the second term in Equation S11: the average relative entropy between a protein-specific density and the generalized density. As the relative entropy is non-negative, the average can be zero if and only if D p i (y|x)||p ω (y|x) = 0 for all i = 1, 2, ..., N . However, each p i (y|x) is distinct. This means that at most only one i out of N can have D p i (y|x)||p ω (y|x) = 0. The rest must have relative entropies greater than zero, so Combining Equations S10, S11 and S12 we find as stated in the main text.

Scoring function development
In this section, we outline the scoring function we used to test our analytical predictions. As is typical, we approximated binding free energy, denoted ∆G, as a linear combination of protein-ligand interaction terms so that where a 1 to a 6 regression coefficients and the U x are protein-ligand interaction potentials described below. Unless otherwise stated, the distance cutoff for atomic interactions was 6Å. The mathematical notation used in this section is distinct from the notation used previously.
We modeled van der Waals interactions with the Morse potential where r ij is the distance between the atoms i and j, R ij is the sum of their van der Waal radii and α is the parameter that sets the width of the potential. The van der Waals radii of the atoms were taken from reference 10 . While previous empirical models have used a number of Gaussian terms to measure shape complementarity, 9 we have sought to simplify regression by using a single shape-based term. While designed for bonded interactions, the Morse potential was chosen purely for its functional shape, which is similar to the Lennard-Jones 6-12 potential. Accordingly, we set α = 6/R ij to reproduce the scale of this interaction. However, unlike the Lennard-Jones 6-12 potential, U M tends to zero at a faster rate for an increasing r, thereby rewarding larger atomic distance less, and has a softer hard core potential. For an atomic pair, the above potential has a minimum of −1, indicating a 'good' fit between the two atoms.
Using the partial charges computed by AutoDockTools (ADT), 10 we calculated the electrostatic potential energy, U E , between pairs of polar atoms: where q i and q j are the partial charges on the atoms i and j. Formal charges were used for metal ions. We used the same hydrogen bond potential as in X-score and AutoDock Vina. 9,11 For a particular pair of atoms, the hydrogen bond strength, B ij , is given by where R X ij is the sum of the X-Score van der Waals radii of the atoms. The total hydrogen bond score, U B is then given by (S17) Following the same convention and parameterization as X-Score, metal interactions were also scored using U B . As AutoDock 4's scoring function, 12 we used a torsional score U T , given by where N is the number of rotatable bonds in the ligand. The rotatable bonds were first identified using ADT and adapted using our own scripts so that bonds between aromatic rings and carboxylic acids or diaminomethanes (such as in benzamidine) were not counted as rotatable. Bonds between aromatic rings were treated as rotatable, which is the default in ADT.
Heuristic lipophilic and hydrophilic interaction terms, developed originally for the classification of water molecules in binding sites, 13 were also used as descriptors for protein-ligand interactions. Their inclusion into the scoring function descriptor sets is due to their success in water molecule classification and was partly inspired by the Hint forcefield, 14 which is uses calculated atomic logP data to account for a ligand's chemical enviroment when bound to the protein.
The lipophilic interaction term is simply a sum of carbon contacts weighted by their distance and is given by where l is equal to one for carbon atoms and equal to zero for all other elements. The d 0 is a constant equal to 1Å and is included to set the interaction range and ensure to non-dimensionality of the exponent. Similarly, The hydrophilic term is given by where p is the hydration propensity of the atom. For protein atoms, these were taken from the study by Kuhn et al. 15 For ligand atoms, the average values of the protein hydration propensities for each element were used. Halogens were given the same hydration propensity as the average carbon value and metal ions were given the same value as the highest protein atom propensity, due to their comparatively high solvation free energies. Label Structures  HIV1 Protease  B   1GNM, 1GNN, 1GNO, 1A30, 1A9M, 1AAQ, 1AJV, 1AJX, 1B6J,  1B6K, 1B6L, 1B6M, 1BDQ, 1BV7, 1BV9, 1BWA, 1BWB, 1C70, 1D4K,  1D4L, 1D4Y, 1DIF, 1DMP, 1G2K, 1G35, 1HBV, 1HIH, 1HII, 1HOS,  1HPO, 1HPS, 1HPV, 1HPX, 1HSH, 1HVH, 1HVI, 1HVJ, 1HVK,  1HVL, 1HVR, 1HVS, 1HWR, 1HXB, 1HXW, 1IIQ, 1IZH, 1IZI, 1LZQ,  1MES, 1MET, 1MRW, 1MRX, 1MSM, 1MSN, 1MTR, 1NH0, 1ODY,  1OHR, 1PRO, 1QBR, 1QBS, 1QBT, 1QBU, 1SBG, 1SDT, 1SDU,  1SDV, 1SGU, 1SH9, 1T7J, 1W5V, 1W5W, 1W5X, 1W5Y, 1Z1H, 1Z1R,  1ZP8, 1ZPA, 1ZSF, 1ZSR, 2AOC, 2AOD, 2AOE, 2AQU, 2AVM, 2AVO Table S3). Unlike the rest of the single-protein data sets, the bracketed complexes are not present in the 2011 edition of refined PDBbind set, so for consistency, they were omitted from all other analyses.   Figure S1: Grid of heatmaps showing the mean absolute error for our own scoring function when applied to data sets A-I (see Table 1). Each heatmap shows the absolute error of the model on the data set as the two free parameters of the scoring functions are varied (x and y-axis). This figure supplements Figure 3 of the main text, which shows the relative errors on each data set.   Figure 3 from the main text. These values determine the color ranges used in each grid.