Drug-target binding affinity prediction method based on a deep graph neural network

: The development of new drugs is a long and costly process, Computer-aided drug design reduces development costs while computationally shortening the new drug development cycle, in which DTA (Drug-Target binding Affinity) prediction is a key step to screen out potential drugs. With the development of deep learning, various types of deep learning models have achieved notable performance in a wide range of fields. Most current related studies focus on extracting the sequence features of molecules while ignoring the valuable structural information; they employ sequence data that represent only the elemental composition of molecules without considering the molecular structure maps that contain structural information. In this paper, we use graph neural networks to predict DTA based on corresponding graph data of drugs and proteins, and we achieve competitive performance on two benchmark datasets, Davis and KIBA. In particular, an MSE of 0.227 and CI of 0.895 were obtained on Davis, and an MSE of 0.127 and CI of 0.903 were obtained on KIBA.


Introduction
With the rapid development of machine learning in recent years, artificial intelligence has also been applied to various fields, most of the time achieving valuable results [1][2][3]. Computer-aided drug design is one of these areas and is of great interest. The high cost and time-consuming nature of drug development makes the research of new drugs extremely difficult [4]. Drug-target affinity (DTA) prediction is one of the important subtasks that helps to reduce the time-consuming pre-drug selection phase of drug development [5,6].
There are three main types of computational approaches for DTA prediction, including molecular docking [7], traditional machine learning [8][9][10], and deep learning [11][12][13]. Molecular docking is based on protein structure to explore the main binding modes of ligands when binding to proteins [14], but it requires the crystallized structure of proteins that are difficult to obtain, which also indirectly affects its final performance [15]. Traditional machine learning, on the other hand, uses onedimensional sequence representations in drug and protein sequences to train neural networks; however, these models represent drugs as strings. Such representations can reflect the atomic composition of molecules and the chemical bonds between atoms, but they cannot retain the structural information of molecules [16]. The structural information of the molecule, in turn, affects its chemical properties, which may impair the predictive power of the model as well as the functional relevance of the learned potential space. Deep learning models that have been widely used in recent years also perform well in the DTA prediction task, and learning using deep molecular modeling functions is gradually becoming more common because it can capture hidden information that is difficult to model by human experience.
One of the models that may be most suitable for the DTA prediction task is the graph neural network (GNN). The GNN can directly process graph data that can preserve structural information, and this approach has already made research progress. GraphDTA [17] introduces GNN into the DTA prediction task by constructing a drug molecule graph and performs feature extraction for drug molecules based on the graph data, while protein molecules as organic macromolecules can still only use CNN to extract features. DGraphDTA [18] builds graph data for protein molecules based on protein structure prediction, which allows both molecules to be represented by a graph and enables the application of GNN to extract features. However, it only compares two models, GCN [19] and GAT [20], and cannot fully evaluate the performance of GNN in the DTA prediction task.

Datasets
The data for training generally contains the dataset of drug and protein molecules and the values of corresponding drug-target affinity. This research applies Davis [21] and KIBA (Kinase Inhibitor BioActivity) datasets for specific experiments, and the two datasets are typically used as benchmarks. The Davis dataset completely covered the binding affinity between all 68 drugs and 442 targets included, which was measured by Kd values (kinase dissociation constants). The KIBA dataset integrates information from IC50, Ki (inhibition constant) and Kd (dissociation constant) measurements into a single bioactivity score containing a bioactivity matrix of 2111 kinase inhibitors and 229 human kinases. The drug and protein molecule entities in the two datasets are shown in Table 1. The PDBBind [22,23] dataset is a comprehensive database of drug-target 3D structure binding affinities in the Protein Data Bank (PDB). The PDBBind dataset provides 3D coordinates of target proteins and ligand structures. We use the general set as the training set and the refined set of the PDBBind dataset as the test set. In order to have a stable training process, we only select samples from the general set with a protein sequence length less than 1000 amino acids. The general set is divided into a training set and a validation set with a ratio of 4:1. Finally, we have 8391 training samples, 1680 validation samples and 3940 test samples. To reduce random errors, we trained and tested the model for performance evaluation using a 5-fold training set and test set of the benchmark dataset, while introducing various methods and metrics for comparison. The drug molecules are represented in the sequence format called SMILES (Simplified Molecular Input Line Entry System) [24], and this form is a chemical notation language that uses the element symbols of atoms and chemical bonds between them to represent molecules. However, because the sequence data only retain the structural information of molecules, the performance may be worse if the original SMILES string is directly applied. In addition, graph data can store more 3D features by comparison. Therefore, in this method, the molecular graph is constructed by sequence to meet the requirement for the model. Specifically, the molecular graph is constructed according to the drug SMILES string, with atoms as nodes and bonds as edges. In order to ensure that the features of the nodes are fully considered in the graph convolution process, self-loops are also added to the graph construction to improve the feature performance of the drug molecules. Protein molecules are also processed into graph data similarly. The raw data in the datasets are also sequence strings. However, the above idea that processing drugs does not work while proteins are biological macromolecules, the protein graphs will be too large to satisfy model conditions if atoms are constructed as nodes and chemical bonds are designed as edges. Owing to the development of protein structure prediction, it is feasible to utilize predicted structural information to approximate the real 3D structure of proteins. In this paper, the Pconsc4 [25] method was used to output contact map [26] is used to predict protein structure. The method assumes the residue of protein as nodes, and edges are determined by the Euclidean distance between the Cβ atoms (Cα atoms for glycine) of residue pairs below a specified threshold [26]. Because the graph is constructed with residuals as nodes, features are selected around the residuals, which exhibit different properties due to the R group. These features include polarity, chargedness, aromaticity, etc. And in this paper, we use PSSM [27] to represent the characteristics of the residue nodes. The selected node features of drug and protein molecules and detailed data preprocessing measures are invariable as those in DGraphDTA [18].

Method
Almost every drug development study focuses on how to deal with drugs and target molecules, and they preprocess data by applying other algorithms into special forms to fit their approach [28]. However, the models in these studies are relatively simple. DeepDTA only equips three CNN layers to extract molecular features, and DGraphDTA employs two GNN layers and 2 fully connected layers to extract node representations. With the aim of representing a large amount of knowledge and obtaining high accuracy, a model with high expressiveness requires a larger training set. To this extent, a model with more parameters and higher complexity is advantageous. On the other hand, however, an overly complex model may be difficult to train and may lead to unnecessary resource consumption [29]. So there is a need to balance the model complexity so that the model has a high expressive power while reducing unnecessary consumption.
CNNs and RNNs generally have advanced performance in handling Euclidean data, but the structural information of molecular graphs cannot be expressed in Euclidean space. This leads to CNN and RNN methods rarely achieving optimal results. Therefore, most researchers apply graph neural networks to extract the features of graph data.
Let G = (V, E) denote a simple and connected undirected graph with n nodes and m edges. Let A be the adjacency matrix, D be the diagonal degree matrix, and IN be an n-order identity matrix. Let = A + IN denote the adjacency matrix with self-loop added, and the corresponding diagonal degree matrix is denoted by . Subsequently, let ∈ * denote the node feature matrix, where the ith row of the matrix represents the d-dimensional feature vector of the ith node.
The GNN is a network designed to work directly with graph data and exploit its structural information, and there are now many variants available after years of development. The most wellknown one is the GCN, which is widely used in graph data problems. For the GCN, each layer will implement a convolution operation through Eq (1): where H l represents the lth layer of node embedding with H 0 = X, σ () is a nonlinear activation function, and W denotes a learnable parameter. Qimai Li et al. [30] prove that graph convolution is essentially Laplacian smoothing [31]; however, repeated application of Laplacian smoothing may mix the features of vertices from different clusters and make them indistinguishable [32]. In the case of symmetric Laplacian smoothing, they will converge to be proportional to the square root of the vertex degree [33]. This is why a deeper GCN will lead to performance decrease. However, we find that the models that apply GCN commonly set their number of GNN layers to 2 or 3, and consequently these models will have no ability to extract high-order neighborhood information. This is attributed to the notion that stacking more layers tends to degrade the performance of these models, and such a phenomenon is called "oversmoothing" [30]. In other words, this situation refers to the fact that during the training process of graph neural networks, as the number of network layers and iterations increase, the hidden layer representations of each node within the same connected component will tend to converge to the same value. This will result in the final node representation having no actual meaning in the case of deeper layers. Even the addition of residual connectivity, an effective technique widely used in deep CNNs, merely mitigates the oversmoothing problem in GCNs [19].
The purpose of the GNN is to extract the graph embedding of the entire molecular graph. However, if only two or three layers of GCN are used, then for a certain node, just 2-hop or 3-hop neighbor node information can be perceived, which is unfavorable for the information of the whole graph that needs to be obtained to construct the graph embedding [34]. Thus, the GCN operations of several layers can only be regarded as local information aggregation, and the obtained graph embedding is not sufficiently accurate to represent the graph.
Graph Diffusion Convolution (GDC) [35] replaces the multilayer convolution operation in GCN by using graph diffusion, and the graph diffusion process is given by the generalized diffusion matrix: , with the weighting coefficients θk and transition matrix T. The selection of θk and T must ensure that Eq (2) converges. One of the main options of weighting coefficients θk follow Eq (3) of Personalized PageRank [36]: where α PPR is the teleport probability, and the greater α PPR is, the more information is diffused. Simple Graph Convolution (SGC) [37] believes that the complexity of GCNs inherited from neural networks is burdensome and unnecessary for less demanding applications. Thus, SGC reduces the additional complexity of GCNs by removing the nonlinearity between the layers of GCNs and simplifying the resultant function to a single linear transformation. The experiments show that the final obtained model is comparable to GCNs with higher computational efficiency and fewer fitted parameters. Simple spectral graph convolution (S 2 GC) [38] further analyzes the Markov Diffusion Kernel [39] and improves the GCN to obtain the following iterative function: where is the special case of Tsym at loop = 1, i.e., D / A D / , the 0,1 parameter balances the self-information of the node with the consecutive neighborhoods, and the value of K represents the receptive field size of the model. For example, each node can update its own node feature representation based on the information of its farthest 4-hop neighbors if K = 4. The above iterative formula is designed by S 2 GC for the node classification task. In this task, GNN aims to extract node features and subsequently add fully connected layers to fit channels, and the last layer of the model uses softmax as a classifier. Therefore, when S 2 GC is applied to this paper, the softmax layer needs to be removed, and the iterative function only retains the part within the softmax function. Figure 1 illustrates the flow of the process used in this paper, which consists of four main parts: the raw molecular sequence data, the processed drug molecule graph and the predicted protein contact map, feature extraction using graph models, and finally DTA prediction.

Experiment setting
In this paper, we alter the GCN layer of the model in DGraphDTA with other graph neural networks and evaluate the impact of GNNs on the final results. However, limited by computational resources, we first compare the parametric size of various networks and identify some algorithms with lower complexity for subsequent experiments. The algorithms are built based on the PyTorch [40] library, and Pytorch geometric (PyG) [41] provides a variety of portable access algorithms from which to choose graph data tasks. The first step is to compare the graph models used parametrically due to the large amount of data and to keep the models from becoming too complex. Assume the input dimension and output dimension are m and n, respectively, k = m*n, and the results of the partial comparison in the case of a single-layer network are shown in Table 2. Table 2. Number of parameters of the single-layer model.

Model
The numbers of parameters GCN [19] k + n GAT [20] k + 3n GraphConv [42] 2k + n SAGE [43] 2k + n SGC [37] k + n From Table 2, we can obtain these models with a small difference in complexity. There are other models with complexity far beyond these, such as Molecular Fingerprints (MFConv) [44] with the number of parameters up to 20 times k. It is obvious that using this kind of algorithm will reduce the training efficiency of the model. Similar to S 2 GC, SGC is not applied to the model to compare the performance of various algorithms. As mentioned before, S 2 GC takes the adjacency matrix of the graph to construct the transfer matrix when calculating , and processing the transfer matrix will consume considerable time when applied to the DTA prediction task. Therefore, we extracted the features of drug molecules using S 2 GC, while the SAGE algorithm was used for the feature extraction of protein molecules in this experiment.
Training the model requires setting several parameters and tuning details within the model. One of these details is to adjust the proportion of drug features and protein features in the final execution of classification to adjust the influence of both molecules on the final prediction results. Since drug molecules and protein molecules are encoded as 54-and 78-dimensional features, respectively, and their feature dimension ratio is approximately 2:3, the feature dimension ratio of the two molecules produced by the graph neural network can be set to 2:3. In addition, we set the value of K, which is the number of iterations of the network, to 4 and the value of α to 0.05 in the S 2 GC model. Since S 2 GC requires processing adjacency and transfer matrices for graphs, DTA prediction is a task that includes multiple graph data. Therefore, model training is time-consuming and cannot be performed using a large number of different parameters, and the setting of various parameters needs to be based on experience. A few important hyperparameters used in the model are as follows: learning rate of 0.001, number of iterations of 2000, feature dimensions of 54 and 78 for drug and protein molecules, respectively, and corresponding output dimensions of 112 (16*7) and 144 (16*9) for the two types of molecules in the feature extraction phase, where 112:144 can be approximated as 54:78.

Metrics
The same metrics as in the benchmark are implemented to calculate the concordance index (CI) [45] and the mean squared error (MSE) [46]. The CI is essentially an estimate of the probability that the predicted value will be consistent with the actual value; it measures whether the predicted affinity values for two random drug-target pairs are predicted in the same order as the true values and is calculated by Eq (5).
where bx is the predicted value of the larger affinity αx, by is the predicted value of the smaller affinity αy, Z is the normalization constant, and h(x) is the step function: MSE measures the difference between the prediction and the label, and the formula is as follows: , where pi is the predicted value, yi is the corresponding actual label, and a smaller MSE means that the predicted value of the model is closer to the true value. In addition, another metric, Pearson correlation coefficient, is also used in some articles for performance comparison, calculated by Eq (8). where cov is the covariance between the predicted value p and the real number y, and σ () is the standard deviation. ,

Experimental results
In this study, we introduced S2GC and SAGE to extract drug and protein features, respectively, and we proceeded to use several layers of FC to make predictions. Figures 2 and 3 show the distribution between the labels and predicted values of the samples in the two datasets. There is a straight line with a slope of 1. When the sample points are closer to this line, the corresponding DTA predicted by the model is closer to the true value, which indicates better model performance. The distribution of sample points on these two plots also shows that our method is able to predict the DTA values more accurately, with the vast majority of the predicted values and actual labels having a difference of 1 or less.  We conducted several experiments to predict DTA and report the performance of certain models on two metrics, MSE and CI. Tables 3 and 4 show the MSE and CI values on the independent test sets of the two benchmark datasets, respectively. It can be observed in the Davis dataset that our model has the best performance on both MSE and CI metrics. GraphDTA, another model based on graph neural networks, also performs well, obtaining an MSE of 0.229, which is second only to the method in this paper. The rest of the models extract features mainly based on sequential data. These models are weaker than the graph model in terms of performance because they do not handle structural information well.
For the KIBA dataset, our model performs slightly worse than Affinity2Vec, obtaining only the second-best result. However, our model also obtained an MSE of 0.127 and a CI of 0.903, which are not far from Affinity2Vec's MSE of 0.124 and CI of 0.91. We argue that the main reason for this is the type of dataset; although this dataset is large, it does not contain all binding values, and known DTA only accounts for 24.4% of the affinity. Training on this dataset uses semi-supervised learning and is more suitable for Affinity2Vec's seq2seq [47] and ProtVet [48], which are two unsupervised datadriven models. On the other hand, this may be due to the more complex method in Affinity2Vec; in terms of the model, it uses multilayer GRU [49] and ProtVec to extract sequence features and XGBoost [50] for the final DTA prediction. In terms of features, it applies not only embedding features, but also drug-target meta-path score features [51] and both hybrid features. This fits the characteristics of the large data volume in the KIBA dataset, and there is enough data in KIBA to ensure a smooth fit of the Affinity2Vec training process. In other words, the model and complexity of Affinity2Vec are more suitable for training in a large dataset such as KIBA, which can also explain why the performance of Affinity2Vec on the Davis dataset is not optimal, i.e., due to the insufficient amount of data in Davis. As the PDBBind dataset, the method in this paper achieves 1.231, 0.756 and 0.683 on MSE, CI and Pearson indicators, respectively, which also has some performance improvement relative to some other methods. The detailed comparison is shown in Table 5.
As the benchmark model for this paper, DGraphDTA obtained better results on the Davis and KIBA datasets, achieving 0.904, 0.202, and 0.867 for CI, MSE, and Pearson metrics on Davis, and 0.904, 0.126, 0.903 for CI, MSE, and Pearson metrics on KIBA, respectively. However, DGraphDTA only uses three layers of GCNs for feature extraction, and it cannot continue to increase the number of layers subsequently, and simply increasing GCNs can also lead to oversmoothing problems and may also degrade the model performance. The S2GC algorithm used in this paper can be set up to k = 16, which is the effect of 16 layers of GCN. However, due to the lack of computational resources, this paper can only use k = 4 for experiments, and the difference is not obvious compared to the three-layer GCN of DGraphDTA, which is why the performance of this paper's method decreases instead of rising compared to DGraphDTA.
In general, GNN-based models commonly outperform other models in terms of performance with little difference in model complexity, and models that rely only on sequence information indeed struggle to outperform graph models in terms of feature extraction. Although other methods can stack models and features to obtain better results, graph models are certainly a more suitable class of architectures for the DTA prediction task. In addition, the performance of the same model may vary on different datasets. For the variation of molecular size and whether all DTA values are covered in both Davis and KIBA datasets, different model architectures can be subsequently set according to different datasets to meet the matching relationship between data volume and model complexity.

Conclusions
Predicting the strength of drug-target binding affinity is more informative and challenging than just classifying drug-target interactions. The prediction of DTA requires molecular graph data based on drugs and proteins. For such graph data, it is more appropriate to use graph neural networks instead of other neural network models to extract features. To solve the oversmoothing problem of the most widely used GCN when the model layers are deepened, we compared several networks used to alleviate the oversmoothing, from which we selected an S2GC model that performs well in the node classification task to extract molecular features. We also found that the graph neural network-based model outperformed the other algorithms with a small difference in model complexity, so we can continue to follow this idea in our subsequent work to explore the performance of graph neural networks in the DTA prediction task.