Gene based message passing for drug repurposing

Summary The medicinal effect of a drug acts through a series of genes, and the pathological mechanism of a disease is also related to genes with certain biological functions. However, the complex information between drug or disease and a series of genes is neglected by traditional message passing methods. In this study, we proposed a new framework using two different strategies for gene-drug/disease and drug-disease networks, respectively. We employ long short-term memory (LSTM) network to extract the flow of message from series of genes (gene path) to drug/disease. Incorporating the resulting information of gene paths into drug-disease network, we utilize graph convolutional network (GCN) to predict drug-disease associations. Experimental results showed that our method GeneDR (gene-based drug repurposing) makes better use of the information in gene paths, and performs better in predicting drug-disease associations.


INTRODUCTION
Drug discovery is time-consuming, costly, and laborious.Discovering a new drug normally takes 13-15 years and costs more than a billion dollars on average from development to clinical use. 1 Computational methods to identify drug-disease associations have attracted increasing attention in the pharmaceutical industry.In silico drug repurposing can identify new indications for existing approved drugs and suggest drug candidates for wet lab validation.Drug repurposing can narrow down the search space for the existing drugs and is thus an efficient and promising strategy for traditional drug discovery and development.
As deep learning developed rapidly, neural networks have been applied to drug repurposing, which is to predict the relation between drug and disease.Initially, feature based methods were widely used, which focus on feature extraction by combining multiple biological data related to drug or disease, such as DeepDR. 2 These data can be constructed as a complex network.Feature extraction methods generally translate the data to vector representations, whereas the topology of network is usually neglected.Graph neural network (GNN) is frequently applied to predict drug-disease relation over recent years, in which a drug or one disease is modeled as a node.However, the semantic information between drugs and diseases is rather complicated, and it cannot be entirely represented by a simple two-layer heterogeneous network.Some previous studies incorporated gene information into drug-disease network and applied graph convolutional network (GCN)-based model to perform drug-disease link prediction with moderate success.For instance, Yu et al. 3 and Coskun et al. 4 improved GCN-based drug-disease link prediction by incorporating drug-gene and disease-gene relations to calculate embeddings for drugs and diseases.Li et al. 5 and Meng et al. 6 introduced the similarity information to enhance link prediction.Long et al. 7 proposed a Pre-Training Graph Neural Networks based framework named PT-GNN to integrate gene relation data for link prediction in biomedical networks.PT-GNN uses a GCN-based encoder to effectively refine node features by modeling direct dependencies among nodes in the network.Xuan et al. 8 proposed GFPred, a method based on a graph convolutional auto-encoder and a fully connected auto-encoder with an attention mechanism.GFPred uses a graph convolutional auto-encoder module to calculate topology representations by integrating gene nodes into drug-disease heterogeneous networks.
These GCN-based models adopt the message-passing mechanism to learn node representations that capture both node features and graph topology information.The representation of a node is updated by its direct neighbors in one iteration.As a result, a k-layers GCN model would capture the information of the local graph containing k-hop neighbors of the central nodes.The pharmacological mechanism of a drug or a disease involves a series of gene nodes, which form as gene paths in a heterogeneous graph.The biological functions of the gene path are critical for drug-disease link prediction and also help to interpret prediction results.GCN-based models use multiple layers to aggregate distant node information.However, too many layers may result in limited distinguished information among nodes (i.e., oversmoothing).Some recent studies have made efforts to capture path information.Flam-Shepherd et al. 9 proposed a graph neural nets using path embedding to learn local substructure of the graph.They concatenated nodes and edges presentations in a path as path embedding.Kawichai et al. 10 constructed a network based on disease, drug and gene ontology information, and designed meta-path to calculate representations of drug-disease pairs.Zhou et al. 11 proposed a meta-path-based computational method called NEDD to predict novel associations between drugs and diseases from heterogeneous information, using meta paths of different lengths to explicitly capture direct relationships or high order proximity.Instead of path, subgraph extraction is also a strategy to focus on local topology of nodes.CoSMIG 12 extracted subgraphs by employing random walk, and improved message passing method by adding edges into nodes updating.
Besides, hypergraph construction is another strategy to capture high-order information.Feng et al. 13 transformed the graph into hypergraph by designing hyperedge connecting multiple nodes.This structure allows message passing between node sets connected by hyperedges even though these nodes are not directly connected in the graph.Pang et al. 14 propose a drug-disease association prediction method to extract high-order drug-diseases association information on hypergraph using hypergraph neural network (HGNN).As mentioned above, the pharmacological mechanism of a drug involves series of genes, since the metabolism process of drug is performed by combining with proteins which are gene products.The combined proteins subsequently effect their related proteins through biological processes.In our heterogeneous graph, we simplified them as the edge between gene nodes and drug nodes.The pharmacological mechanism presents as several paths from a drug node to series gene nodes.The same goes for pathological mechanism of disease.Therefore, gene paths represent biological functions of their connected drug or disease, which contributes a lot to drug-disease link prediction.Although some previous works have taken topology information or paths into node updating, it becomes problematic for longer path due to over-smoothing.
To tackle this, we proposed ac framework, GeneDR (Gene-based Drug Repurposing), to perform message passing along these biological functional series of genes to drug or disease.In our framework, as shown in Figure 1, the gene paths to drug/disease nodes are performed by Long Short-Term Memory (LSTM)-based message passing.LSTM is a special kind of recurrent neural network capable of handling long-term dependencies.Subsequently, the resulting information of gene paths is incorporated into drug-disease network, and GCN based message passing is used to predict drug-disease links.Our framework allows drug/disease nodes to aggregate information along gene paths.Experiment results showed that our method performed better in drug-disease link prediction.

Experiment settings
We performed 5-cross validation on two DD datasets, the statistics of which are shown in Table 1.Drug-disease pairs in drug-disease dataset were regarded as positive samples while drug-disease pairs not in drug-disease dataset were randomly chosen as negative samples.The proportion of positive and negative samples is 1:1.The maximal length of gene path was set as 4, and we extracted 100 paths at most for each drug/disease node during one iteration.The hidden size in LSTM and GCN was set as 128, and layer number in GCN was 3. The learning rate was 0.001.All the codes and data are available at github (https://github.com/Wang-yxing/GeneDR).

Comparison results
We compared our proposed GeneDR with several state-of-the-art methods for link prediction on two datasets.Among them, LAGCN and NIMCGCN are GCN-based methods, which integrate multiple additional data (e.g., entity similarity network) as the node feature.HINGRL utilizes drug structure and disease semantic information as additional features of drug and disease nodes, and calculates the topology feature after performing random walk on the drug-protein-disease heterogeneous graph.DRWBNCF focus on integrating neighborhood interaction of drugs and diseases.It uses localized information in similarity network and drug-disease association network.REDDA collected 5 types of entity and 9 types of networks to construct huge heterogeneous network.It designed topological subnet embedding block to learn node representation.These methods utilize different default data in addition to link prediction data.To optimize the performance for these methods, we used their default data in our experiment.Note that the comparison was based on the same drug-disease association.As shown in the Table 2, GeneDR performed the best.The result indicates that GeneDR makes better use of gene information.

Ablation study
We also conducted ablation studies to investigate factors that influence our performance as shown in Table 3.We designed two variants of GeneDR: GeneDR without GMP (w/o GMP) performs message passing as in Figure 2B; GeneDR without LSTM (w/o LSTM) uses GCN to aggregate gene message along the path instead of LSTM.GeneDR w/o LSTM performed better than GeneDR w/o GMP, suggesting that separating message passing of genes to drugs or diseases from message passing between drugs and diseases contributes to drug-disease link prediction.The two variants were inferior to GeneDR, indicating that LSTM-based message passing makes better use of gene path information probably by simulating flow of message along the gene path.

Case study
To demonstrate the practical ability of GeneDR for identifying drug-disease interactions, we conducted case studies by literature evidences (see Table 4 for some examples, the full list of predicted drug-disease interactions and the related gene paths was provided in GitHub).Interestingly, we found some predicted drug-disease links represent no therapy but side effect.6][17] These results suggest that our framework can predict the related drugs and diseases, but cannot distinguish between the therapeutic relation and side effect relation, which motivate us to take the up-or downregulation between genes in gene path into consideration in further work.

Conclusion
We propose a new framework to perform message passing along the gene paths to their connected drugs or diseases.Thus, the gene information of paths is aggregated to update the embeddings of the drugs and diseases, which is demonstrated to contribute to the link prediction between drug and disease.Furthermore, we believe that our identified gene paths of the drug and disease will be useful to explain the predicted drug-disease link.

Limitations of the study
As mentioned in Results, we did not introduce relation type between genes in gene paths.Relation types, such as upregulation and downregulation, are very important information when distinguishing the specific relation between drug and disease.For example, a disease and a drug are probably related when they are associated to same genes, but the up-or downregulations between them and genes determine whether the disease is treated by the drug or is a side effect of the drug.In our project, we only focus on whether there is relation between drug and disease instead of the type of the relation.It is worth considering gene relation type in our future work.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

Gene path
Gene paths for each drug and disease are extracted from DGD dataset by Random Walk. 38In each path, the start node is drug or disease and nodes are genes.The length of paths is set as 4, and we extracted 100 paths at most for each drug/disease.

Problem definition
In a graph G = ðV;EÞ, V is the set of nodes containing gene V g , disease V d and drug V r , while E is the set of edges among nodes.P denotes the entire set of gene paths, and P i denotes set of gene paths started with a disease or drug node i followed by a series of gene nodes, where P i 3P and i ˛fV d ; V r g.

Traditional message passing
In traditional message passing method, node embedding is updated by the directly connected neighbors during each iteration: (Equation 1) where h ðlÞ i is the embedding of the node i in l-th layer, N i is the direct neighbors of node i, h j is the embedding of the direct neighbors.m ðlÞ i is the message aggregated from the neighbors, which is used to update the node embedding.
Figures 2A and 2B shows node embedding in traditional message passing under two circumstances.Figure 2A illustrates the embedding of a drug/disease node is updated by the surrounded drug/disease nodes during alternate iterations in a drug-disease bigraph.Take the central disease node in Figure 2A as example, the information of the surrounded drug nodes is aggregated into the central disease node embedding in the first iteration, and in the next iteration, message passing will spread out to the further nodes.The aggregated nodes are homogeneous at each iteration in the bigraph, which is in accord with the mechanism of traditional message passing method.However, the message passing process becomes problematic when gene nodes are added into the graph.As shown in Figure 2B, the central disease node is surrounded by genes and drugs.When using traditional message passing methods, the messages from gene nodes and drug nodes are aggregated together at one iteration.Besides, the gene nodes in a path are separated by several iterations without making full use of their information.

Gene based path message passing
Taken gene path into consideration, we revised the message passing method (Figure 2C).Our proposed message passing framework contains two parts, one is gene based message passing which integrated node information along gene paths, the other is drug-disease message passing, which is the same as Figure 2A.Gene messages are aggregated as below: 3) where m ðlÞ i is the message aggregated from the set of the paths P i connected with node i, and a k is the trainable weight of the path p k among the paths in P i , p k ˛Pi .
H ðl À 1Þ is the node embedding matrix from last layer.We employ LSTM to perform message passing along the path.The hidden state of the terminal node in the path is regarded as the message vector aggregating all information of this path.In the path from genes to disease or drug, the hidden state of the drug or disease node can capture the information of all genes in the path.Since each drug or disease is generally connected with more than one path, we introduce path weight acted as attention mechanism to integrate the connected paths and to distinguish their respective importance.

The architecture of GeneDR
As shown in Algorithm 1 and Figure 1, the initial node embeddings, H TransE , are obtained by training TransE on DGD dataset, which can integrate global information for every node.Gene paths for each drug and disease are also extracted from DGD dataset by Random Walk.Gene based message passing (GMP) is then performed along paths to the connected drug and disease nodes through LSTM.The resulting embeddings, H GMP , are used to initialize drug and disease nodes in the drug-disease bigraph.
The information in drug-disease bigraph is aggregated by GCN-based message passing and is output as H 1 GCN .To better use the information, i.e., gene path and drug-disease bigraph, H GCN is back to LSTM-based layer to update around the workflow again.Eventually, after two round, drug and disease embeddings from final H GCN are concatenated and input into a fully connected layer to output the final link prediction.

QUANTIFICATION AND STATISTICAL ANALYSIS
We performed 5-cross validation on two DD datasets.Drug-disease pairs in drug-disease dataset were regarded as positive samples while drug-disease pairs not in drug-disease dataset were randomly chosen as negative samples.The proportion of positive and negative samples is 1:1.We model performance by using common metrics including: AUPR (area under the precision-recall curve), AUC (area under the curve) score, F1 score, Recall.

Figure 1 .
Figure 1.The architecture of GeneDR LSTM-based message passing is performed on extracted gene paths to update the connected drugs and diseases.Subsequently, the updated embeddings are passed on drug-disease bigraph in which GCN-based message passing is used to perform the drug-disease link prediction.

Figure 2 .
Figure 2. Comparison of message passing methodsThe yellow, red and blue nodes rep-resent drug, disease and gene, respectively.For convenience, we take disease as central node for illustration.The black straight lines represent relations between drugs and diseases, while the blue curly lines represent relations between genes and diseases.(A) The traditional message passing on drug-disease bigraph where the node embedding is updated by the surrounded homogeneous nodes.(B) The traditional message passing on drug-gene-disease heterogeneous graph where the node embedding is updated by the surrounded heterogeneous nodes.(C) Message passing separately on drug-disease bigraph (gray) and on gene path (blue).

Table 2 .
Comparison results for different methods

Table 1 .
The statistics of datasets The left part is the original data from Dataset 1 and 2. The right part is corresponded gene data that we collected from PharmKG and CTD.

TABLE
d RESOURCE AVAILABILITY B Lead contact B Materials availability B Data and code availability d METHOD DETAILS B Data overview and data preprocessing B Problem definition B Traditional message passing B Gene based path message passing B The architecture of GeneDR d QUANTIFICATION AND STATISTICAL ANALYSIS A B C

Table 3 .
Ablation experiment results Link prediction on combination of DD dataset and DGD dataset without the gene path extraction (GPE).b Gene path message passing on GCN instead of LSTM. a

Table 4 .
Same examples of the drug-disease prediction results and literature evidencesThe initial node embeddings are obtained by training TransE 37 on DGD datasets separately.TransE is a translation based model, which represents relations as translations in the embedding space.The basic idea of TransE is to learn entity and relation embeddings in triple with the condition that head entity embedding plus relation embedding approximately equals to tail embedding.Therefore, it can integrate global information for every node in DGD dataset.
fV d ; V r ; V g g.Output: Drug-disease link prediction value vðr; dÞ between drug r ˛Vr and disease d ˛Vd .Calculate the node embedding H TransE from G 2 by using TransE.Extract gene paths P for each node in V 1 by performing random walk on G 2 .for each epoch do for round = 2 do H GPEMP )GMPðH TransE ; PÞ with Equation 4. H GCN )GCNðH GPEMP ; G 1 Þ with Equation 2.