Drug Target Prediction Using Graph Representation Learning via Substructures Contrast

The prediction of drug–target interactions is always a key task in the field of drug redirection. However, traditional methods of predicting drug–target interactions are either mediocre or rely heavily on data stacking. In this work, we merged heterogeneous graph information and obtained effective node information and substructure information based on mutual information in graph embeddings. We then learned high quality representations for downstream tasks, and proposed an end–to–end auto–encoder model to complete the task of link prediction. Experimental results show that our method outperforms several state–of–art models. The model can achieve the area under the receiver operating characteristics (AUROC) curve of 0.959 and area under the precise recall curve (AUPR) of 0.848. We found that the mutual information between the substructure and graph–level representations contributes most to the mutual information index in a relatively sparse network. And the mutual information between the node–level and graph–level representations contributes most in a relatively dense network.


Introduction
The prediction of drug target interactions(DTIs) is a key task in the field of drug redirection [1]. Since biochemical experimental assays are extremely costly and timeconsuming, efficient methods for identifying new DTIs are essential and valuable. Two main methods for DTI prediction have been studied: molecular docking and machine learning. Molecular docking technology is widely used due to its reasonable accuracy. However, the performance of molecular docking is limited when large scale simulations take time [2]. Compared with the traditional molecular docking technology, machine learning method can conduct large scale testing of drug and protein data in a relatively short time. Several computing strategies have been introduced into machine learning methods to obtain high quality embedding for predicting DTIs.
Since the progress of deep learning, researchers have been able to develop deep neural network models. Deep learning methods are also widely used in feature mapping [3], classification task [4] and disease prediction [5]. Moreover, differentiable representation learning methods can be directly applied on low-level representations to enable the potential of interpretable DTI predictions. In particular, graph neural models can effectively combine feature information and structural information to obtain low-dimensional embedding [6]. Such method includes the random walk method [7] and Graph Convolutional Network (GCN) model [8]. Ordinary graph embedding tends to make the entire input graph smoother. The substructural information about the part of the input graph will be ignored. However, there have been few studies on preserving substructure information in Graph Representation Learning (GRL). A substructure is a set of subgraphs represented by a subset of vertices and edges, which is often able to express the unique semantics and fundamental expressions of the graph. More precisely, neighbor nodes in the graph (such as first-order neighbor nodes) are usually trained to obtain similar embedding representations [9]. However, nodes that are far apart in the graph have no similar representations, even if they are similar in structure. Preserving substructure information could effectively prevent such a situation from occurring.
Substructure in graph is generally used to solve three types of problems. One type can be utilized to accelerate large-scale graph training. Cluster-GCN [10] is an example of this. The core idea of Cluster-GCN is to apply the clustering algorithm to divide the large graph into multiple clusters. The division follows the principle of fewer connections between clusters and more connections within clusters. The simple method effectively reduces the consumption of memory and computing resources. At the same time, good prediction accuracy can be achieved. One type can be used for self-supervised learning. SUBG-CON [11] exploits the strong correlation between the central graph node and its sampled subgraphs to capture regional structure information. SUBG-CON is a self-supervised representation learning method based on subgraph contrast. The other type can be applied on denoising in a network, such as a graph. For instance, there are only three nodes around a node. The substructure embedding will select the most representative neighbor node, which can eliminate unnecessary confusion of neighbor nodes. We combine substructure embedding with mutual information, ie, adversarial learning, and apply it to DTI prediction to obtain more accurate embedding. To a certain extent, the application of substructures can improve the embedding effect of the sparse graph network.
In this paper, we propose an end-to-end network model that predicts DTIs from low level representations, called GraphMS. Specifically, the inputs of the model are three heterogeneous matrices, two homogeneous matrices and two characteristic matrices. As shown in Figure. 1, we apply to guarantee accountability in the node-level representation by maximizing mutual information between the node-level and graph-level representations to guide the encoding step. And then, we propose to preserve the substructure information in the graph-level representation by maximizing mutual information between the graph-level and substructure representations. The high quality embedded information learned by the model is useful for downstream tasks. Finally, combined with learning interpretable feature embedding from heterogeneous information, we use an auto-encoder model to achieve the task of link prediction.
To summarize, our major contributions include:

1.
We apply the substructure embedding to DTI prediction, and remove certain noise in the graph network. The subgraph comparison strengthens the correlation between graph-level representation and subgraph representation to capture substructure information; 2.
We maximize the mutual information of node representation and graph-level representation. This allows the graph-level representation to contain more information about the node itself, and it will be more concentrated on the representative nodes in the embedded representation; 3.
Case study and comparison method experiments also show that our model is effective.

DTI Prediction
In recent years, drug-protein targeting prediction has been widely investigated. The molecular docking method, which takes the 3D structure of a given drug molecule and the target as input, is widely used to predict binding patterns and scores. Although Overall data flow and drug-protein targeting model architecture molecular docking can provide visual interpretability, it takes time and is limited by the need to obtain the 3D structure of protein targets. [12] Much effort has been devoted to developing machine learning methods for computational DTI prediction. Wan et al. [13] extracted hidden characteristics of drugs and targets by integrating heterogeneous information and neighbor information. Faulon et al. [14] applied an SVM model to predict DTIs, based on chemical structure and enzyme reactions. Bleakley et al. [2] developed an SVM framework for predicting DTIs, based on a bipartite local model, named BLM. Mei et al. [15] extended this framework by combining BLM with a neighbor-based interaction-profile-inferring (NII) procedure (named BLM-NII), which is able to learn DTI features from neighbors.
As the amount of data on drugs and protein targets has increased, algorithms from the field of deep learning have been used to predict DTIs. Wen et al. [16] developed the deep belief network model, whose input is the fingerprint of the drug and composition of the protein sequence. Chan et al. [17] used a stacked auto-encoder for representing learning, and developed other machine learning methods to predict DTIs.
Recently, GRL has also been applied as an advanced method for identifying potential DTIs. The purpose of GRL is to encode the structural information into lowdimensional vectors and then quantify the graph. Gao et al. [18] and Duvenaud et al. [19] proposed graph convolutional networks with attention mechanisms to model chemical structures and demonstrated good interpretability. Che et al. [20] developed Att-GCN to predict drugs for both ordinary diseases and COVID-19.
Our work solves the problem of retaining the substructure information of a graph. We also obtain an explanatory DTI prediction from low-dimensional representations.

Graph Neural Networks
In recent years, graph embedding has become a hot research issue in network data analysis and application. DeepWalk [21] is the first network embedding method proposed to use technology that represents a learning (or deep learning) community. DeepWalk treats nodes as words and generates short random walks. Random walk paths are used as sentences to bridge the gap between network embedding and word embedding. Node2Vec [22] is an extension of DeepWalk. It introduces a biased random walk program that combines BFS style and DFS style neighborhood exploration. LINE [23] generates context nodes with a breadth-first search strategy. Only nodes that are at most two hops away from a given node are considered neighbors. In addition, compared to the hierarchical softmax used in DeepWalk, it uses negative sampling to optimize the Skip-gram model. GCN can capture the global information of graph, so as to well represent the characteristics of node. However, GCN needs all nodes to participate in training to get node embedding. There are many nodes in the graph and the structure of the graph is very complicated. The training cost is very high and it is difficult to quickly adapt to changes in the graph structure. Graph Attention Network (GAT) [24] uses the attention mechanism to perform weighted summation on neighbor nodes.
In traditional graph embedding learning, nodes are adjacent to each other in the input diagram, and embedded represents are similar. Although these methods claim that the snap nodes are close, they still suffer from some limitations. Most notably, they place too much emphasis on proximity similarity, making it difficult to capture inherent graphical structure information. Our work solves the problem of retaining the substructure information of a graph. We also obtain an explanatory DTI prediction from low-dimensional representations.

Problem Formulation
GraphMS predicts unknown DTIs through a heterogeneous graph associated with drugs and targets.

Definition 1. (Heterogeneous graph)
A heterogeneous graph is defined as a directed or undirected graph G = (V N , E R ), where each node v ∈ V and each edge e ∈ E. Each node v in the node set V belongs to the object type N, and each e in the relationship set E belongs to the type in the relationship set R. The node set N includes four types: drugs, targets, side effects and diseases data. The type of relation R includes protein-disease-interaction, drug-protein-interaction, drugdisease-interaction, drug-drug-interaction, protein-protein-interaction, drug-structure-similarity and protein-sequence-similarity.
In our current model, each node only belongs to a single type and all edges are undirected and non-negatively weighted.

Multiview Information Fusion of Heterogeneous Graph
The edges in the relationship set including protein-disease-interaction, drugprotein-interaction, and drug-disease-interaction will be converted into heterogeneous matrices. And the edges in the relationship set containing drug-drug-interaction, protein-protein-interaction will be converted into isomorphic matrices. The edges in the relationship type including drug-structure-similarity and protein-sequence-similarity will be converted into feature matrices. Specially, although the drug-protein matrix and drug-disease matrix are heterogeneous matrices, they can also be used as part of the feature matrices.
Assume that the feature matrix is represented by X = [X 1 , X 2 , X 3 . . . X n ]. n represents the number of matrices, X i represents the i-th matrix. For drug representation, the drug adjacency matrix, that is, homogeneous matrix is added to one of the identity matrix, and then Laplace decomposition is used to obtain the network matrix of the drug. Similarly, the protein representation vector is processed through the same steps.
whereÂ = A + I, I is the identity matrix, and D is the degree matrix. The corresponding degree matrix is D ii = ∑ j A ij . Following the structure of Graph Convolutional Network, we apply one GCN layer to encode the nodes in our graph, the node representation of the drug view is expressed as where X is the feature matrix of the drug, W 1 is the trainable weight matrix, A is the drug network matrix obtained by Laplace decomposition, and h is the node representation of the drug view. After that, we add up the resulting node representing of each data matrix mapping.

Mutual Information Between Node-Level and Graph-Level Representation
Although the graph is compressed and effectively quantified, the learned node representation should be consistent with the level representing the graph that contains global information. Relevance can be quantified by correlation. Similarly, the learned node representation should be highly relevant to the graph-level representation. This prompts us to use mutual information as a measure of quantifying the relationship between two random variables [25]. High mutual information corresponds to a significant reduction in uncertainty, whereas zero mutual information means that the variables are independent [26].
To ensure the reliability represented by the node, we use mutual information to measure the correlation between node-level and graph-level representations. Taking the drug representation vector as an example, we calculate the graph-level global representation of the drug view through the aggregation function: where h i is the i-th row vector of the node representation h, n is the number of row vectors, σ is the maximum pooling function, and s is the graph-level global representation.
We utilize the scrambling function to scramble the feature matrix X of the drug by shuffling on the row-wise dimensions and generate negative example, i.e.X → X. Similarly, we encode the disturbed feature matrix X according to the above formula. We then apply the bilinear function as the discriminator, namely where D is the discriminator function, W 2 is the trainable weight matrix, σ is the bilinear function, h T is the transpose of the node-level representation, and s is the graph-level global representation.
We calculate the cross-entropy loss between the node-level representation h and the graph-level global representation s. In the process of optimizing the loss function, the mutual information between the node-level representation of the drug view and the graph-level representation is captured.
where N is the number of positive pairs, M is the number of negative pairs, h i is the node representation of a positive example pair, h j is the node representation of a negative example pair.

Mutual Information Between Graph-Level Representation and Substructure Representation
Substructure, a subset of graph structure, is uniquely representative in distinguishing graphs. Therefore, to quantify the common goal of the graph, the representation of the substructure should be highly relevant to the graph-level representation. That is, maximizing the correlation between the graph-level and substructure representations helps to retain substructure information.
We use the Metis [27] algorithm to extract k subgraphs of the drug-drug relationship matrix. The purpose is to construct subgraphs on the vertices of the graph, so that there are more links between subgraphs than between subgraphs. This can better capture the clustering structure of the graph, and can effectively focus on the sparse structure. Intuitively speaking, each node and its neighbors are usually located in the same subgraph. After a few hops, the neighbor nodes are still in the same subgraph with a high probability. For the k-th subgraph, we obtain the graph-level representation s, and generate the substructure representation with the nodes related to the substructure.
For the k-th subgraph, we take (s, g i ) as a positive sample and (s, g j ) as a negative sample, where g j is a subgraph representation in which nodes are randomly selected from other graphs. In detail, for the k-th subgraph representation, our corruption function shuffles the other nodes and select k nodes randomly. In this way, the graph-level representation is closely related to the subgraph representation, while the correlation with other random negative subgraphs is weaker.
A neural network is used to maximize the mutual information between the graphlevel representation s and the substructure representation g, to ensure a high correlation. Using cross-entropy to calculate the loss function, the mutual information between the graph-level representation and the substructure representation of the drug view is captured by where g i is the substructure representation of the k-th subgraph in a positive example pair, and g j is the substructure representation of the k-th subgraph in a negative example pair.

Automatic Decoder for Prediction
According to the final learned drug-embedding and protein-embedding representations, the drug-protein relationship matrix is reconstructed by performing inverse matrix decomposition. We obtain the predicted drug-protein matrix, which is compared with the known drug-protein relationship matrix. We then integrate the loss function of the reconstructed drug-protein matrix and capture the cross-entropy loss function in the mutual information. Finally, we perform gradient update, and optimize the loss function: Reconstructing the final drug-protein matrix, we obtain where G is the original matrix, U is the final learned drug-embedding representation, V is the final learned protein-embedding representation, W 3 and W 4 are the learnable weight matrix, and M is the final predicted drug-protein matrix.

Datasets
We used datasets that were compiled in previous studies. These datasets include five individual drug-related and protein-related networks: drug-protein interaction and drug-drug interaction, protein-protein interaction, and drug-disease and proteindisease association networks. We also used two feature networks: the drug-structure similarity network and the protein-sequence similarity network.

Experimental Settings
We applied the Adam gradient descent optimization algorithm, with initial learning rate set to 0.001, to train the parameters. During training, the parameters were initialized randomly from a uniform distribution U ∼ (−0.08, 0.08). We trained the model for 10 epochs, where each epoch contained 100 steps. The model used a 10-fold crossvalidation procedure after each epoch, and the method with the best performance was reported. Performance was measured by the area under the Receiver Operating Characteristics (ROC) curve and the area under the Precision Recall (PR) curve.

Baselines
We compared our model with four baseline methods. Two of the comparison methods are DTI prediction methods including NeoDTI, DTINet, and two other graph embedding methods including LightGCN and GAT.
• NeoDTI [28] integrates the neighborhood information constructed by different data sources through a large number of information transmission and aggregation operations. • DTINet [29] aggregates information on heterogeneous data sources, and can tolerate large amounts of noise and incompleteness by learning low-dimensional vector representations of drugs and proteins. • LightGCN [30] simplified the design of GCN to make it more concise. This model only contains the most important part of GCN-neighborhood aggregation for collaborative filtering. • GAT [24] proposes to use the attention mechanism to weight and sum the features of neighboring nodes. The weight of features of neighboring nodes depends entirely on the features of the nodes and is independent of the graph structure. GAT uses the attention mechanism to replace the fixed standardized operations in GCN. In essence, GAT just replaces the original GCN standardization function with a neighbor node feature aggregation function using attention weights.

Comparative Experiment
In the chart, 1:10 means that the ratio of positive samples to negative samples was 1:10. 1:all means that all unknown drug-target interaction pairs were considered. Specially, the ratio between positive and negative samples was around 1:500. The whole network is sparse against the network with the ratio of 1:10. Single-view means that only part of the information was used. Multi-view means that all the information was used. The results are shown in Table 1 and Table 2.  The AUPR and AUROC metrics were used to evaluate the performance of the above prediction methods. Among the existing methods, LightGCN showed the best performance. Our model improved AUROC by nearly 3% and AUPR by nearly 5% against NeoDTI. In highly skewed datasets, AUPR is usually more informative than AUROC. Since drug discovery is usually a problem like finding a needle in a haystack, the high AUPR score also truly proves the superior performance of GraphMS, compared to other methods.

Multi-view
The multi-view model is better than the single-view model experimental results. This is because the multi-view model integrates more feature information of drug targets. Higher quality features are extracted through the framework to provide good results for subsequent reconstruction of the matrix. The graph convolutional neural network indicates that the feature prediction result of the node vector extracted by learning is higher than that of the ordinary auto-encoder. This also proves from the side that the graph convolutional neural network has stronger feature expression ability for non-Euclidean data. Compared with the GAT network index, LightGCN has little change, which proves that the graph embedding makes the input graph smoother. The indicators of our model perform well. On the one hand, the addition of mutual information allows the model to consider the strong correlation between the graph-level representation and its subgraph representation. On the other hand, the subgraph embedding can eliminate certain noise in the network.  We visualized the embedding learned by Graph-M and Graph-S to see if Graph-S could capture the structure of an interactive network better than Graph-M. Graph-M uses only mutual information between node-level and graph-level representations, whereas Graph-S uses only mutual information between substructure and graph-level representations. Then we visualize the learned embedding in Figure. 2. We could observe that there are potentially linked targets near some relatively marginal drug points in the embedded space which Graph-S learns. And the distribution of drug targets in the embedded space of Graph-M learning is more concentrated. We further corroboration with AUROC and AUPR indicators.

Ablation experiment
Our method uses the mutual information between the substructure and graphlevel representations and the mutual information between the node-level and graphlevel representations. It can be observed that the mutual information between the substructure and graph-level representations contributes more to the mutual information in a relatively sparse network. In a relatively dense network, the node-level and graphlevel representations contribute more to the mutual information (see Figure. 3).

Case Study for Interpretability
The network visualization of the top 30 novel DTIs predicted by GraphMS can be found in Figure. 4. Ethoxzolamide (EZA) interacts with only two types of drugs and EZA (drug) has a high link probability with DNA methyltransferases (DNMT1) . EZA is an FDA-approved diuretic as a human carbonic anhydrase inhibitor. After consulting relevant medical literature, it was found that EZA has potential to treat duodenal ulcer and will be developed into a new anti-Helicobacter drug [31]. Chronic inflammation is closely related to various human diseases, such as cancer, neurodegenerative diseases, and metabolic diseases [32]. Among these, abnormal DNA methylation occurs to some extent, and the enzymatic activity of DNMTs increases. This also shows that there is a nonzero probability of a link between EZA and DNMT1.

Discussion
This paper merges heterogeneous graph information and obtains effective node information and substructure information based on mutual information in heterogeneous graph. We apply the subgraph embedding to DTI prediction, and remove certain noise in the graph network. Then we present an end-to-end auto-encoder model to predict the interaction of drug targets. The overall experimental evaluation showed that the method was superior to all baselines and at a better level in sparse networks, which was essential for drug discovery. In the ablation experiment, the substructure representation is more important in a relatively sparse network, and some unnecessary noise information in the network can be eliminated. In addition, our model shows top30 DTI pairs and we have shown through a case study that our approach can understand the nature of predictive interactions from a biological perspective.
Our work provides solutions for drug redirection. At the same time, this work can also help medical staff provide new drug ideas for protein targets corresponding to some special diseases. Our work also has some flaws. Due to the large number of training parameters, when it comes to using GCN embedding to calculate the graph-level representation, the nodes of the entire graph will participate in the calculation. This also leads to a longer training time for the entire model. Therefore, in future work, we will refer to some graph network acceleration algorithms such as Cluster-GCN and some new deep learning algorithms, such as meta-learning, to improve the computational efficiency of our model.

Patents
Part of the work in this manuscript has been applied for China's national invention patents, and has passed the preliminary examination. The patent application number is 202011275141.6.