DEMLP: DeepWalk Embedding in MLP for miRNA-Disease Association Prediction

miRNAs significantly affect multifarious biological processes involving human disease. Biological experiments always need enormous financial support and time cost. Taking expense and difficulty into consideration, to predict the potential miRNAdisease associations, a lot of high-efficiency computational methods by computer have been developed, based on a network generated by miRNA-disease association dataset. However, there exist many challenges. Firstly, the association between miRNAs and diseases is intricate. These methods should consider the influence of the neighborhoods of each node from the network. Secondly, how to measure whether there is an association between two nodes of the network is also an important problem. In our study, we innovatively integrate graph node embedding with a multilayer perceptron and propose a method DEMLP. To begin with, we construct a miRNA-disease network by miRNA-disease adjacency matrix (MDA). Then, lowdimensional embedding representation vectors of nodes are learned from the miRNA-disease network by DeepWalk. Finally, we use these low-dimensional embedding representation vectors as input to train the multilayer perceptron. Experiments show that our proposed method that only utilized the miRNA–disease association information can effectively predict miRNA-disease associations. To evaluate the effectiveness of DEMLP in a miRNA-disease network from HMDD v3.2, we apply fivefold crossvalidation in our study. The ROC-AUC computed result value of DEMLP is 0.943, and the PR-AUC value of DEMLP is 0.937. Compared with other state-of-the-art methods, our method shows good performance using only the miRNA-disease interaction network.

Over the past years, to predict the potential association of miRNA disease, a lot of models based on the known miRNA-disease association network have been developed. These methods mainly consisted of two categories: the score function-based algorithms and machine learning.
Some methods based on matrix completion also have good results in the prediction of miRNA-disease association. Jiang et al. (2010) generated a model by giving priority to the entire human microRNA for diseases of interest [24]. Differing from common local network similarity measures, a method named RWRMDA is generated by Chen et al. (2012), which employ overall measurement of network similarity and adopt node embedding presentation method random walk to infer the potential association of miRNA disease [25]. Another method that extends the process of the Random Walk algorithm is generated by Xuan et al. (2015) [26]. Ji et al. (2015) propose a method called SVAEMDA, which considers the miRNA-disease association prediction as a semisupervised learning problem, then trains a variational autoencoder based predictor to solve the problem [27]. By updating the association adjacency matrix of miRNA disease through the matrix completion algorithm, a method named MCMDA to predict the associations of miRNA disease is generated by Li et al. (2017) [28]. By integrating known human miRNA-disease associations, miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity for miR-NAs and diseases, a method named PBMDA is proposed by . This method constructs a heterogeneous graph consisting of three interlinked subgraphs and further adopts a depth first search algorithm to infer potential miRNA-disease associations [29]. According to Matrix Decomposition and Heterogeneous Graph Inference, a method named MDHGI is generated by . This method predicts associations of miRNA disease by predicted association probability [30].  propose a model of Inductive Matrix Completion for MiRNA-disease Association prediction (IMCMDA), which completes the missing miRNA-disease association based on the known associations and the integrated miRNA similarity and disease similarity [31]. Combining neighborhood constraint with matrix completion, a model named NCMCMDA is proposed by Chen et al. (2021) [32].
To predict the potential association of miRNA disease, a lot of machine learning methods have been proposed. To predict miRNA-disease associations, a novel model framework is adopted by . Furthermore, they construct a bipartite network, which is used to analyze the peculiarity of miRNA regulating disease genes. This work provides an original perspective for the discovery of genetic disease associations and may contribute to future research on miRNA involvement in disease pathogenesis [33]. According to the miRNA functional similarity, which is defined by measuring the similarity between genetically related diseases, a model named DHMP is generated by Xuan et al. (2013). The similarity of miRNA is effectively evaluated by measuring the semantic similarity of their associated diseases [34]. With the continuous development of support vector machine and k-nearest-neighbor technology, a novel method named RKNNMDA is generated by , which combinates SVM and KNN and achieves good performance in the prediction task [35]. To predict the potential miRNA-disease association, a model named ABMDA is proposed by Zhao et al. (2019) based on adaptive boosting [36]. By adopting a graph embedding representation learning algorithm and neural network method, an original method named CNNMDA is generated by  to predict the associations between miRNA and disease [37]. To predict the potential associations of miRNA disease, Peng et al. (2019) propose a novel learning-based framework, MDA-CNN, which constructs a three-layer network and uses an autoencoder and convolutional neural network to catch the essential feature and predict the final lable, respectively [38]. Based on inductive matrix completion and graph convolutional network, a model named NIMCGCN is generated by Li et al. (2020). This method generates node embedding feature representations from a network, and then they put the learned features into a matrix completion model to predict miRNA-disease associations [39]. Based on integrating the matrix factoriza-tion and multilayer perceptron, Liu et al. (2018) propose a method NCFM [40]. Based on node embedding, pair embedding, and multilayer perceptron, Liu et al. (2021) propose a method named CEMDA to predict miRNA-disease associations [41].
It is of great importance to discover the potential miRNA-disease association. So far, most methods use known miRNA-disease and similarity information of disease and miRNA to predict potential associations. However, with the continuous enrichment of human databases and the continuous improvement of high-throughput technology [42], a large number of databases about miRNA-disease association are available for free. More efficient prediction methods only utilized the miRNA-disease association information urgently needed [40]. Many new related methods in the field of drug target prediction have also attracted our attention [43][44][45][46][47][48][49]. In this paper, we propose an effective method named DEMLP to predict the miRNA-disease association that only uses the known miRNA-disease association network. In our method, we use DeepWalk to generate the node embedding using the information of their neighborhood and the network structure. After that, we use MLP to calculate the association score of the miRNA disease. In the main experiment, the receiver operating characteristics (ROC) and precision-recall (PR) area under the curve (AUC) of our proposed method are 0.943 and 0.937, respectively. Then, we compare our model with other four known stateof-the-art models, and the receiver operating characteristics (ROC) area under the curve (AUC) of our proposed method is 0.923, which is far more than any other model using the same data HMDD v2.0.

Dataset for miRNA-Disease Association Prediction.
Data of miRNA-disease associations that we use in our experiment are obtained from HMDD [50][51][52]. To construct the interactive network of miRNA disease, we download the whole dataset from the HMDD database of the online website. Moreover, in the case study, we use the data of dbDEMC v3.0 [53] to verify the effect of our model.

Data Preparation.
The online website of HMDD v3.2 shows that there are 35547 miRNA-disease association entries which include 1206 miRNA genes and 893 diseases from 19280 papers. We need to clean the original dataset for constructing the miRNA-disease interaction network. Then, we need to cleanse this data set by looking for typography errors, letter capitalization errors (e.g., hsa-mir-200C should be has-mir-200c and Has-mir-93 should be hasmir-93), removing duplicates, and so on. After cleaning the original dataset, we find 1206 miRNAs, 893 diseases, and 18732 miRNA-disease associations. We construct the miRNA-disease network by the miRNA-disease association adjacency matrix (MDA) that is generated from the data obtained by cleaning. The element MDA ij = 1, if miRNA v m i is associated with disease v d j , on the contrary, if MDA ij = 0 means there is no association between miRNA v m i and disease v d j , v is a vertex of the network. The data cleaning 2 Journal of Sensors and constructing miRNA-disease network process are shown in Figure 1.

Prediction of the miRNA-Disease
Association. The goal of DEMLP is to predict potential associations. First, we generate the node embedding by DeepWalk [54], and then we concatenate each miRNA embedding and disease embedding as a new dataset for the MLP (multilayer perceptron) model. To evaluate the effectiveness of DEMLP in the miRNA-disease network from HMDD v3.2, we apply fivefold crossvalidation in our research.

DeepWalk on miRNA-Disease Network.
DeepWalk is a method for generating potential representations of nodes in a network. These potential nodes representations consist in a continuous vector space, which could be effectively used in some machine learning methods. DeepWalk considers a set of short truncated random walks on our miRNAdisease network as its corpus and the network vertices as its vocabulary. The random walk generator samples uniformly a random vertex vi as the root of the random walk W v i ∈ R r×d (γ is the walks per vertex). A walk sample coequally from the neighbors of the last vertex visited until the hyperparameter length t that we set before is reached. In our method, we set that the window size w is 5, walk length t is 10, walks pervertex γ are 80, and the embedding size d is 128. We iterate all over the nodes in our miRNAdisease network. For each node vertex vi of the miRNAdisease network, we generate a random walk jW v i j = t, then use it as our corpus to update our node vertex representations. We use the Skip-Gram algorithm to adjust these node vertex representations based on our objective function in Eq. (1).
where Φ v ∈ V ⟶ R jVj×d is a mapping function, and this mapping function Φ represents the potential node embedding representation of each vertex v in the network.

Multilayer Perceptron.
The multilayer perceptron is the classical machine learning model. We splice the representa-tions of each miRNA Φðv m i Þ and disease Φðv d j Þ as the dataset of our MLP model.
where v m i is a vertex index of miRNA i, v d i is a vertex index of disease j, and X ij ∈ R 1×2d is the concatenating representation of miRNA i and disease j. jXj = jv m j × jv d j is the order of matrix X.
We use a part of the dataset X train as the training data of the MLP model. The dimension of the MLP input layer is 128, and the dimension of the MLP hidden layer is 4. We use tanh as the activation function to compute the hidden layer values [55]. The calculation process is shown in Eq. (3).
where tanh is the activation function of the hidden layer, and W d×4 and b are learnable parameters of our MLP model.
There is only one value for the output layer of our MLP model. We choose the sigmoid function as the output layer activation function.
where score ij represents the correlation score between miRNA vertex v m i and disease vertex v d j .
We use binary cross entropy between our target and output score. Then, we use stochastic gradient descent to optimize our parameters.
The framework of our method is in Figure 2. 2.6. DEMLP-PLUS. To further verify the validity of the model in the task of association prediction of miRNA diseases, we add the similarity information of miRNAs and diseases separately into the model to observe whether the prediction results would be improved. The new framework is named DEMLP-PLUS. We refer to the IMCMDA [31] model for the integration network construction process after 3 Journal of Sensors adding the similarity of genes and diseases, respectively, and the integration network is a fusion of genetic associations, disease associations, and gene-disease associations, which construction process is shown in Figure 3.

Experiments and Result
There are a large number of negative samples in the whole data, and we use the undersampling method to make the positive sample and negative sample reach 1 : 1 equilibrium. Firstly, to verify the validity of the model in the task of association prediction of gene-disease, we apply fivefold crossvalidation to evaluate the availability of our method in the miRNA-disease network from HMDD v3.2. Then, we compare DEMLP with the other three baseline models: LineMLP, node2vecMLP, and SDNEMLP. In the training process, we applied fivefold crossvalidation on each model and performed 100 iterations to find the optimal parameter with the smallest error. Secondly, we add the similarity information of miRNAs and diseases separately into the model to observe whether the prediction results would be improved. In this section, we use HMDD v2.0 as our dataset. Based on the HMDD v2.0 dataset, we compared the performance of the model in a fusion network with gene-disease similarity information and a normal network. At the same time, we compare our model with other known state-ofthe-art models based on the network from HMDD v2.0's associated data set of miRNA disease. Thirdly, we did a case study of lung tumors and breast tumors, and we examine the miRNAs in HMDD v2.0 for these diseases and used dbDEMC v3.0 to verify the top 20 rankings association we predicted.
3.1. Performance Evaluation. We use effective indicators ROC-AUC and PR-AUC to estimate the association prediction effect. A test example is labeled as a positive example while the prediction score of miRNA-disease association is more than θ (θ is a tℎresℎold). If not, it is considered as a negative example [56]. We use TN and TP to represent the numbers of correctly identified negative and positive examples, respectively. FN and FP, respectively, represent the  ð6Þ Table 1 shows the average ROC-AUC and PR-AUC of the 5-fold crossvalidation of our experiments.
Then, we plot the ROC and PR curve of each method in Figure 3 and the ROC-AUC and PR-AUC bar graph in Figure 4.

DEMLP-PLUS Experiments.
To further verify the validity of the model in the task of association prediction of miRNA diseases, we add the similarity information of miR-NAs and diseases separately into the model to observe whether the prediction results would be improved. In the first step, we construct the integration network using HMDD v2.0; then, we apply fivefold crossvalidation on the model DEMLP and DEMLP-PLUS. Table 2 indicates that there is only a small improvement in the model after adding add the similarity information of miRNAs and diseases separately into the model DEMLP. In the face of higher time and space complexity, our model has better generalization and application advantages when using only the association network of genes and diseases.    To verify the validity of the model in the task of association prediction of miRNA diseases, we compare our model with other four known state-of-the-art models (NCFM [40], CGMDA [57], metapath [58], and BNPMDA [59]) based on the network from HMDD v2.0's associated dataset of miRNA disease by the 5-fold crossvalidation. Table 3 indicates that our model has a good performance in predicting genetic disease association.
The prediction results show that our model has a good effect on predicting the unknown association between genes and disease.

Conclusions
Taking expense and difficulty into consideration, to predict the potential miRNA-disease associations, a lot of highefficiency computational methods by computer have been developed, based on a network generated by miRNA-disease association dataset. More efficient prediction methods which only utilize miRNA-disease association information are urgently needed. To predict the potential miRNA-disease associations in this research, we innovatively integrate graph node embedding with multilayer perceptron and propose a method DEMLP. DEMLP can predict the miRNA-disease association effectively utilizing only the miRNA-disease association information. Through the combination of the random walk and the multilayer perceptron, DEMLP can learn the node embedding representation which is rich in network structure information and heighten nonlinear fitting ability. Compared with other models, DEMLP achieves the best result in the task of miRNA-disease association prediction. Moreover, in the future, we will consider referring to models such as EGES [73] to solve the problem of cold start in the prediction of genetic disease association.

Data Availability
The data used to support the findings of the study are available in the public database http://www.cuilab.cn/hmdd.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.