From Proteins to Ligands: Decoding Deep Learning Methods for Binding Affinity Prediction

Accurate in silico prediction of protein–ligand binding affinity is important in the early stages of drug discovery. Deep learning-based methods exist but have yet to overtake more conventional methods such as giga-docking largely due to their lack of generalizability. To improve generalizability, we need to understand what these models learn from input protein and ligand data. We systematically investigated a sequence-based deep learning framework to assess the impact of protein and ligand encodings on predicting binding affinities for commonly used kinase data sets. The role of proteins is studied using convolutional neural network-based encodings obtained from sequences and graph neural network-based encodings enriched with structural information from contact maps. Ligand-based encodings are generated from graph-neural networks. We test different ligand perturbations by randomizing node and edge properties. For proteins, we make use of 3 different protein contact generation methods (AlphaFold2, Pconsc4, and ESM-1b) and compare these with a random control. Our investigation shows that protein encodings do not substantially impact the binding predictions, with no statistically significant difference in binding affinity for KIBA in the investigated metrics (concordance index, Pearson’s R Spearman’s Rank, and RMSE). Significant differences are seen for ligand encodings with random ligands and random ligand node properties, suggesting a much bigger reliance on ligand data for the learning tasks. Using different ways to combine protein and ligand encodings did not show a significant change in performance.


Supporting Information Available
for studying the impact of 1D protein encodings and ligand graphs.The neural network layers in both the modules are shown along with their input and output channel sizes, i.e., (input channel size, output channel size).The GCN module and prediction module architecture is the same as Figure S2.1D protein representations (in is 1785 for KLIFS and 1280 for ESM) and ligand graphs are passed through the CNN module and GCN module respectively.For the proteins, features are first extracted from three 1D CNN layers with stride as 1, and increasing kernel sizes [4,8,12].The output from the 1D CNN layers is then passed through the pooling layer and the pooling output is passed through two fully connected (FC) layers.The flattened feature vector of 128 dimensions is given as input to the first FC 1 layer in CNN module with 1024 neurons and outputs a feature vector of 1 × 1024 dimension.A dropout layer is added after the FC layer with a dropout rate of 0.2.The output from FC 1 layer is passed to FC 2 layer with 128 neurons.We obtain 128-dimensional protein and ligand encodings which are then combined to form a 256dimensional embedding.We implement the 1D-CNN layer with torch.nn.Conv1d(), pooling layer with torch.nn.AdaptiveMaxPool1d(), FC and output layers with torch.nn.Linear() and the dropout layer with torch.nn.Dropout() class.

Figure S2:
Graph convolutional network (GCN) and binding affinity prediction modules for studying the impact of protein and ligand graphs.The neural network layers in both the modules are shown along with their input and output channel sizes, i.e., (input channel size, output channel size).Protein and ligand graphs are passed through the GCN module, here features are first extracted from 3 GCN layers and then passed through the pooling layer to get a single column vector of dimension g (g is 216 and 312 for protein and ligand graphs, respectively).These vectors from the pooling layer are then passed through two fully connected (FC) layers.The flattened feature vector of dimension g is given as input to the first FC 1 layer with 1024 neurons and outputs a feature vector of 1 × 1024 dimension.A dropout layer is added after the FC layer with a dropout rate of 0.2 to avoid overfitting and co-adaptation issues.The output from FC 1 layer is passed to FC 2 layer with 128 neurons.We obtain 128-dimensional protein and ligand encodings which are then combined to form a 256-dimensional embedding.We pass the 256-dimensional embedding into the prediction module with two FC layers and one output layer.The FC 3 and FC 4 layers have 1024 and 512 neurons, respectively, and a dropout layer is added after them.The output layer is similar to an FC layer with one neuron, and it takes 512-dimensional embedding to predict the binding affinity score.We implemented GCN layer with torch geometric.nn.GCNConv(), pooling layer with torch geometric.nn.global mean pool(), FC and output layers with torch.nn.Linear() and the dropout layer with torch.nn.Dropout() class.For studying element-wise product, we take an element-wise product of the 128-dimensional protein and ligand encoding to obtain the a 128-dimensional vector which is fed to the FC 3 layer instead of 256-dimensional vector in the case of concatenation.Similarly, for combined concatenation and element-wise product vector, the FC 3 layer will get an input vector of 384 dimensions.

Node features in a protein graph
Position-Specific Scoring Matrix (PSSM) provides the per-residue evolution patterns in the sequence profile [1].By first counting the instances of each residue at each position, a position frequency matrix M PFM is generated using the equation below [ where S is a set of Z aligned sequences for a protein sequence with length of L p , k belongs to residue symbols set, i = (1, 2, . ..,Z), j = (1, . . ., L p ) and I(x) is an indicator function when the condition x is satisfied and 0 otherwise.A position probability matrix M PPM is then computed from the M PFM matrix using the following equation where c is the added pseudo count that is empirically set to 0.8 similar to [2] to avoid matrix entries with a value of 0 [3].The M PPM matrix is then utilized to compute 21 PSSM features.
For computing PSSM features, we need an aligned protein sequence in PSICOV [4] format.
We used the aligned protein sequences in PSICOV format provided by Jiang et al. [2].Table S1 below contains information about the rest of the node features.

Ligand randomizations
In the point randomization process, we enumerate the presence of certain atoms (such as Cl, F, Br, and ( − − O)) within the string, and selectively modify up to four atoms.This can involve substituting one halogen atom with another or removing a ( − − O) atom.In cases where none of the enumerated atoms exist, a Cl atom is appended at the beginning of the SMILES string.The appending of chlorine was influenced by its prevalent incorporation in druglike molecules due to its effects on lipophilicity, electronic distribution, and steric hindrance, impacting binding affinity and pharmacokinetics [5].While chlorine' capability to expand its octet is noteworthy [5], its frequent representation in medicinal chemistry primarily drove its selection.This variation helps us ascertain if small changes in ligand structure can influence binding affinity prediction and identify if the model accurately captures these structural

Contact map evaluation
The Matthews Correlation Coefficient (MCC) [7]

Figure S3 :
Figure S3: Mean squared error (MSE) curves on validation data during the training of DL models using various protein and ligand encodings.A, C MSE curves for protein encodings on the KIBA and Davis dataset.B, D MSE curves for ligand encodings on the KIBA and Davis dataset.
is a widely used metric to assess the quality of binary classifiers, including those used in protein contact prediction.In this context, MCC measures the agreement between predicted and true contact maps, which are binary matrices indicating the presence or absence of contact between pairs of residues in a protein.MCC can be calculated from a confusion matrix that summarizes the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) obtained by comparing the predicted contact map to the true contact map.The MCC is given by MCC = TP × TN − FP × FN (TP + FP) × (TP + FN) × (TN + FP) × (TN + FN)(4)MCC values range from -1 to 1 , with 1 indicating perfect agreement, 0 indicating random prediction, and -1 indicating complete disagreement between predicted and true contact maps.One advantage of using MCC in cases with imbalanced datasets is that it takes into account both true positives and true negatives, as well as false positives and false negatives, to provide a balanced assessment of the performance of the classifier.This is particularly important when the positive and negative classes are not balanced in the dataset, as metrics such as accuracy can be misleading.In the context of contact map prediction, the true contacts (positive class) are typically much rarer than the non-contacts (negative class), resulting in an imbalanced dataset.

Figure S4 :
Figure S4: Visual illustration of contact maps obtained from various kinases present in KIBA and Davis datasets as compared to the contact maps obtained from PDB structures of kinases.The true contacts, false contact and contacts lost are highlighted.

Figure S5 :
Figure S5: Correlation between binding affinity predictions given by the graph-DL model with different protein encoding methods-A, B, and C: KIBA and D, E, and F: Davis.The protein encoding based on random contact map is also strongly correlated with Pconsc4 methods.All these scatter plots show that the BA predictions are not impacted by the protein encodings obtained from various contact map methods and function in the same way on both the datasets.

Figure S6 :
Figure S6: Experimental vs. predicted binding affinities of DL model trained using four different 2D encodings generated from contact maps on A: KIBA and B: Davis datasets.

Figure S7 :
Figure S7: Euclidean distance between the ESM embeddings of proteins in the Davis and KIBA datasets.These embeddings capture 95% variance among the kinases in the Davis and KIBA datasets, and from the heatmap we can see that the Euclidean distance between the ESM embeddings of kinases is considerable, indicating a substantial distance between them in the Euclidean space.A: ESM embeddings from 188 KIBA proteins.B: ESM embeddings from 334 Davis proteins.

Figure S8 :
Figure S8: Experimental vs. predicted binding affinities of DL model trained using various perturbations of ligand graph encodings on both A: KIBA and B: Davis datasets.14

Figure S9 :
Figure S9: Comparing contact maps obtained from different x-ray structures-A: AAK1 and B: TYK2 kinases.For AAK1, we have used 5LQ4 as reference PDB structure and compared it with 5TE0 and 4WSQ PDBs.Similarly, in the case of TYK2, we have used 3LXP as reference PDB structure and compared it with 4GIH and 4GVJ PDBs.We have also shared MCC values comparing each PDB with the reference PDB.

Table S1 :
Residue node features in a protein graph Point Randomization of Ligands Input : SMILES string, S i Initialization: Get the count of Cl, F, Br, C and ( − − O) atoms from the input S i and set to #Cl, #F, #Br, #C and #( − − O) respectively.N ← #Cl + #F + #Br + #( − − O); if N > 0 and N ≤ 4 then Randomly select n changes to be made from the range of 1 to N possible changes; while n ̸ = 0 do Randomly select an atom from Cl, F, Br, and ( − − O) which are present in S i ; if Halogen (i.e., Cl) is selected then Replace selected halogen (Cl) from S i with one of other halogens (F or Br) randomly to obtain an updated SMILES string S Updated SMILES string, S o where H i and H d are the parameters that determine the weights of IC 50 in the model-based adjustments for K i and K d .
[6]A metricKinase inhibitor bioactivity (KIBA) score is a continuous value of binding affinity developed by Jing et al.[6]integrating the information from biological activity of kinase inhibitors from K i , IC 50 , and K d into a single bioactivity score[6].Lower KIBA score denotes higher binding affinity.The KIBA score can be defined based on K d or K i , or the average of them, i ;#Cl ← #Cl − 1 ; /* Update the Halogen count */ n ← n − 1; continue end if (=O) is selected then Remove one ( − − O) from S i to obtain an updated SMILES string S i ; #( − − O) ← #( − − O) − 1; n ← n − 1;

Table S2 :
Overview of various experiments performed in our study