A Small Step Toward Generalizability: Training a Machine Learning Scoring Function for Structure-Based Virtual Screening

Over the past few years, many machine learning-based scoring functions for predicting the binding of small molecules to proteins have been developed. Their objective is to approximate the distribution which takes two molecules as input and outputs the energy of their interaction. Only a scoring function that accounts for the interatomic interactions involved in binding can accurately predict binding affinity on unseen molecules. However, many scoring functions make predictions based on data set biases rather than an understanding of the physics of binding. These scoring functions perform well when tested on similar targets to those in the training set but fail to generalize to dissimilar targets. To test what a machine learning-based scoring function has learned, input attribution, a technique for learning which features are important to a model when making a prediction on a particular data point, can be applied. If a model successfully learns something beyond data set biases, attribution should give insight into the important binding interactions that are taking place. We built a machine learning-based scoring function that aimed to avoid the influence of bias via thorough train and test data set filtering and show that it achieves comparable performance on the Comparative Assessment of Scoring Functions, 2016 (CASF-2016) benchmark to other leading methods. We then use the CASF-2016 test set to perform attribution and find that the bonds identified as important by PointVS, unlike those extracted from other scoring functions, have a high correlation with those found by a distance-based interaction profiler. We then show that attribution can be used to extract important binding pharmacophores from a given protein target when supplied with a number of bound structures. We use this information to perform fragment elaboration and see improvements in docking scores compared to using structural information from a traditional, data-based approach. This not only provides definitive proof that the scoring function has learned to identify some important binding interactions but also constitutes the first deep learning-based method for extracting structural information from a target for molecule design.


Equivariance and Invariance
A function ψ : X → Y is invariant with respect to a group of transformations G : X → X if: ψ(g(x)) = ψ(x) ∀g ∈ G ∀x ∈ X (1) ψ is equivariant with respect to G if there is a transformation T : Y → Y such that: ψ(g(x)) = T (ψ(x)) ∀g ∈ G ∀x ∈ X In other words, if ψ is equivariant under G, applying elements of G to x will apply an analogous transformation to the output in Y . The order in which the transformations are applied does not matter to the end result.
While CNNs offer invariance to elements of the group G = T (n), the set of all translations in n-dimensional space, they lack invariance to the group of all roto-translations, SE(n). This means that rotating the protein-ligand input in 3D space may produce different predictions, behavior which is undesirable. Usually, the way to combat this is to augment training data by assigning random rotations to the inputs, but this does not achieve true rotational invariance. For a more thorough introduction to symmetry groups and equivariance, see references. [1][2][3][4] Group-equivariant neural networks are a new class of deep learning model which offer invariance to various symmetry groups. They do this by using a series of equivariant layers followed by one or more invariant layer. For CNNs, these include Group Equivariant CNNs 1 and steerable CNNs. 5 CNNs still have the fundamental problem of a discrete input space and an extremely large input space and parameter count (a 48 × 48 × 48 grid with 22 channels is more than 2.4 × 10 6 floating point numbers). If we view molecules as graphs, however, we can apply groupequivariant GNNs to the problem of pose predictions.
For these reasons, we use an architecture based on E(n)-Equivariant Graph Neural Network (EGNN) layers 2 for pose selection. The E(n) symmetry group contains all elements from SE(n), as well as the group of all reflections; EGNN layers are also permutation-equivariant, meaning that the network is invariant to the order of the input vertices.

Model
A graph is a natural way of representing a molecule, and attribution to individual inputs-namely atoms-is an intuitive process when using graph neural networks (GNNs). We use an architecture based on E(n)-Equivariant Graph Neural Network (EGNN) layers. 2 The E(n) symmetry group contains all elements from SE(n), as well as the group of all reflections; EGNN layers are also permutation-equivariant, meaning that the network is invariant to the order of the input vertices.
PointVS is a lightweight E(n)-equivariant graph neural network layer model, consisting of an initial projection to take the number of node features from 12 to 32, then 48 EGNN layers, followed by a global average pooling of the final node embeddings. This is followed by a sigmoid layer which gives a label y ∈ [0, 1] during the first stage of training on pose prediction. During finetuning on affinity data, this final layer is replaced by a randomly initiated fully connected layer and ReLU activation, which outputs y ∈ (R + ) 3 . This architecture includes residual connections for node features to allow for easier gradient flow and a richer combination of early and late representations. It also uses a shallow neural network as an attention mechanism, 6 which learns to score network edges-in this case, representing atomic interactions-by their importance.

Training
All models were trained for 10 epochs with a batch size of 32, using the Adam optimizer and cosine warm restarts with a maximum learning rate of 8×10 −4 and a weight decay of 10 −4 . For classification models, the selected model is the one with the best Top-1 value on the test set; for regression models, the selected model is the one with the highest Pearson's Correlation Coefficient (PCC) for affinity on the PDBBind Core Set.
For regression models with pretraining, once the classification task is complete, the best model is selected and the final linear layer with sigmoid is replaced by a multi-target regression, where the target is either pK d , pK i or pIC 50 . The labels for the two types of affinity data not present in the training set are ignored for calculating loss, so those weights are unaffected.

Datasets
The Redocked Set Training for pose classification uses the same training data as the Redocked model outlined by McNutt et al. 7 In pre-vious work by the same authors, 8 ligands from the Pocketome v17.12 9 were redocked using AutoDock Vina into their cognate receptors, with an RMSD calculated between the redocked poses and the crystal pose. Sub-2Å poses were labeled as active poses, with the rest labeled as inactive. Further negative examples were generated adversarially: a gnina CNN trained on the original redocked data set was used to optimize docked poses. Optimized poses that were scored highly (> 0.9) by the trained CNN but were more than 2Å RMSD from the crystal pose were added as inactive poses. In total, there are 52,385 active and 734,417 inactive poses in the version of Redocked2020 used here, which we have shortened to Redocked.

PDBBind Refined: gninaSet Pose Set
The PDBBind refined set v.2019 10 is a carefully selected set of high-quality protein-ligand complexes, complete with binding affinities. In their gnina 1.0 paper, 8 the authors took the 4,852 unique protein-ligand complexes, removing solvents and any other molecules besides the protein and the ligand, as well as removing structures with ligands that do not have a mass between 150 and 1,000 Da. This filtered, preprocessed version of the refined set is hosted by the authors of gnina 1.0 8 at https://bits. csb.pitt.edu/files/gnina1.0_paper. Gnina was used with default settings, excepting --num modes=50, --min rmsd filter=0.0, and the use of the standard Vina scoring function in refinement and ranking rather than a CNN.
Of the 4,260 filtered structures, the 441 which are not used in training the 'redocked' gnina model are selected as the test set for consistency with that work, which we call gninaSet Pose . Up to 50 poses are reranked by our GNN (some of which may be very similar), after which the standard 1.0Å RMSD filter is applied to get a final pose ranking.

PDBBind General and Core Sets
The PDBBind v.2020 General set contains protein-ligand complexes and binding affinities curated from the PDB. 11 The affinities are in the form of pK d , pK i or pIC 50 , and this was the training set for affinity regression. The PDBBind Core set, also known as the CASF-2016 test set for affinity regression, comprises 285 high-quality crystal structures, as well as their binding affinities, 12 and is a subset of the General set. This is the test set for regression (scoring power).

Assessing Attribution Methods
As described in the paper, we use three methods for attribution: atom masking, bond masking, and edge attention analysis. We showed that when carrying out atom masking with gnina, 8 bond masking with InteractionGraph-Net (IGN) 13 and edge attention analysis with PointVS, the method that extracted important protein atoms in most agreement with PLIP, the protein-ligand interaction profiler. 14 To learn to what extent this could be accredited to the method used for attribution, we also carried out atom masking and bond masking on PointVS, the results of which are given in Table 1.  Table 1. Mean rank correlation calculated between the scores of the top five (ρ 5 ) and top ten (ρ 10 ) ranked protein atoms by PointVS, gnina, and InteractionGraphNet (IGN) and the distance between them and the nearest polar ligand atom. The means were calculated over a subset of 20 randomly selected bound structures from the PDBBind Core set (see SI for PDB IDs).

Hotspots API: Calculation and Processing of Hotspots
For a given protein target, the API calculates acceptor, donor, and apolar atomic propensity maps using Super-Star. 15 This tool defines a grid covering the protein with points spaced 0.5Å apart, and then using data from the Cambridge Structural Database (CSD), 16 assigns a propensity for the three atomic probe types-acceptor, donor, and apolar-to each point. These scores are assigned based on the principle that if a positive interaction between two groups at a certain distance and angle is made, then it will occur more frequently in structures stored in the CSD, and thus be assigned a higher propensity score. These grid point scores are then weighted according to how buried in the protein they are so that more buried sites are favored.
Further scores are then calculated with small chemical probes consisting of aromatic rings with an acceptor, donor, or apolar atom in the substituent position. The three weighted propensity grids are sampled by their corresponding probe: a given probe is translated to all grid points with scores above a threshold of 15, and randomly rotated 3000 times around the center of the substituent atom. For each pose, the probe atom scores are read from their corresponding grids, and the probe score is found as the geometric mean of the scores of the probe's atoms.
Once complete, the sampled probe scores are assigned to an output grid.
To process the donor and acceptor grids, we follow the protocol outlined in Hadfield et al. 17 A greedy clustering algorithm is employed. A cluster is initialized as a random single point, and all points less than 1Å away that are not part of another cluster are added. Then, for each point added, this process is repeated, so all remaining unclustered points within 1Å are added. This process continues until there are no unclustered points within 1Å of the points forming the cluster. Once a cluster is finalized, a new cluster is defined by selecting another single unclustered point. This process is repeated until all points have been assigned to a cluster. The centroids of these clusters are defined as the mean position of the points constituting them. Clusters comprising less than eight points are removed, as well as those with centroids outside of an apolar region (this is assessed using the apolar grid). The remaining clusters are then allocated a score consisting of the number of points comprising them, and output alongside their associated scores as the acceptor and donor hotspot maps for a given protein target.

Mpro: Hotspot Map Dependency on Fragment Screen Size
We randomly generate sets of bound structures of Mpro of various sizes, and perform attribution on them to obtain PointVS hotspots. The results from this are shown in Fig  4. As another assessment of the consistency of PointVS hotspots, we assert that the top five hotspots found when PointVS is given all of the Mpro fragment screen structures in Table 1 can be considered the 'ground truth' top five hotspots. We then carry out attribution on various sizes of randomly sampled sets of bound structures, as above, and record how many of the top five hotspots for a given x10329 0A x11041 0A DLS-EU0844 0A x12719 0A x10178 0A x11417 0A DLS-EU0481 0A x10396 0A x12300 0A x10638 0A DLS-X0969 0A z7vh8 0A x10324 0A x10334 0A DLS-EU0241 0A x11368 0A P0884 0A x10392 0A DLS-X0962 0B P2007 0A x11346 0A x11708 0A DLS-X0742 0A x11231 0A x2649 0A x11318 0A DLS-EU0172 0A x10906 0A x11641 0A x10019 0A DLS-X0918 0A x3080 0A x10800 0A x11428 0A DLS-X1018 0A x11560 0A x10996 0A x10733 0A DLS-X0962 0A x11013 0A x10417 0A x11475 0A DLS-X0900 0A x10566 0A x10247 0A x10478 0A DLS-X0685 0A x10900 0A x11354 0A DLS-X0905 0A x10423 0A x2562 0A x12321 0A DLS-X0910 1A x11557 0A x11801 0A x11025 0A DLS-X0933 0A x10645 0A x10237 0A x10834 0A DLS-X0837 0A x10626 0A x11011 0A x11507 0A DLS-EU0293 0A x10236 0A x10535 0A x10976 0A DLS-EU0424 0A x11186 0A x10610 0A DLS-EU0046 0A Table 2. Fragalysis identification codes of the structures supplied to PointVS to obtain hotspots. 4 random sample coincide with the ground truth top five hotspots. The results from this are shown in Fig 1. We see that increasing the size of the randomly selected set of bound structures results in increased agreement with the ground truth hotspots. However, we also note that this effect drops off when the size of the sets of bound structures reaches 30, implying that after this point, increasing the size of a fragment screen would have little benefit on the quality of the top five hotspots produced by PointVS. This is a promising sign of the applicability of the PointVS hotspots, as Mpro is very unusual in having such a large number of fragment structures available.

Fragment Elaboration: Numbers of Elaborations Generated
Hotspots for the targets Mac1 and NSP14 are tested on 35 and 30 fragments respectively. The number of fragments successfully elaborated for each hotspot and the mean number of elaborations for each fragment are shown in Tables 3a and b.

FGFR1 Case Study
To demonstrate the potential of the PointVS hotspot maps for real-world fragment-to-lead campaigns, we present a case study on a well-studied set of drug targets: the fibroglast growth factor receptors. Aberrant activation of these receptors is associated with a wide variety of cancers, making them a promising druggable target. In 2019, a collaboration between Astex and Newcastle University resulted in erdafitnib: 22 a drug that has since been FDA approved, making it the third fragment-derived drug to reach this stage.
In the process of designing erdafitnib, the collaborators heavily relied on a fragment screen performed on the target in 2008. A fragment found in this screen, shown in Fig 2, was found to have high potency, and was used as a starting point in the design process. In Patani et al., 23 two hydrogen bonds are found to form between erdafitnib and the protein. We investigated whether we could identify the protein atoms forming these bonds as important using either PointVS or the API hotspots.
The API placed one donor hotspot 1.5Å away from the location of one of the erdafitnib atoms that forms a hydrogen bond with a protein atom. However, it did not recognize the point the second binding ligand atom is found at as important.
To obtain the PointVS hotspots, we performed attribution on 18 of the bound structures of FHFR1 available in the PDB. 11 These bound structures contain compound-like molecules rather than fragments, so we filtered them to ensure that none had a Tanimoto similarity of greater than 0.7 to erdafitnib. This resulted in us selecting the structures corresponding to the following PDB codes: 4NKS, 4NK9, 1AGW, 3C4F, 4NKA, 4V01, 4V05, 5B7V, 5A46, 4UWB, 3RHX, 1FGI, 5A4C, 4UWC, 5VND, 6P69.
We found that PointVS correctly identified both of the protein atoms that form hydrogen bonds with ligand atoms as important, ranking them as the third and sixth most important atoms for binding.
To provide an example of how knowledge of these two hotspots could be used, we supplied one of them to STRIFE to demonstrate that molecules similar to erdafitnib could be generated. An example of an elaboration that recovered the same binding mode as erdafitnib is shown in    1  14  194  2  237  2  3  196  23  218  3  24  215  24  195  4  23  186  14  154  5  14  183  0  0  6  23  186  24  195  7  17  194  0  0  8  5  196  4  116  9  1  89  27  199  10  6  205  13  181   b   Table 3. The number of fragments that were successfully elaborated on using PointVS and API hotspots for Mac1 (a) and NSP14 (b), and the mean number of elaborations generated per fragment.