Introduction

MicroRNAs (miRNAs) are one type of small non-coding RNA with length of 20–25 nucleotides1. They normally influence their target messenger RNAs (mRNAs) by base pairing binding to the 3′ untranslated region (UTR) sites of mRNAs2. These small molecules could function as negative regulator of target gene expression in post-transcriptional3. With the development of molecular biology, increasing miRNAs have been detected4. To date, the famous miRbase database have collected 48,860 mature miRNAs from 271 organisms containing more than 1000 human miRNAs5. In addition, researchers have found that miRNAs are related with multiple significant cell biological activities, involving diffusion, aging, development, death and so on6,7,8,9.

In recent years, an increasing number of experiments have demonstrated that there are close relationships between miRNA with disease10,11,12,13. In particular, miRNAs have been new biomarkers for human cancer, which is important to cancer preventions and treatments14. Therefore, identifying the miRNA-disease associations has gradually become a hot topic in biology15. Early traditional biological experiments identified the disease-related miRNAs by detecting the expression level of miRNAs in biological disease process16. For example, Yohei et al. found that miR-200c could build a molecular link between breast cancer cells and normal cells17. Liu et al. point out that many miRNAs are disordered in cancer and this situation occurs because miRNAs participate in tumorigenesis and function as oncogenes18. Thum et al. reported that miR-21 adjust expression of the ERK-MAP kinase to effect on structure and function of heart19. Traditional experiments achieve high accuracy, while it has the limitations of long experimental time, high cost, and low success rate20. To resolve these issues, for effectively and accurately predict potential miRNA-disease associations, increasing researchers adopted computational model and select the most possible related miRNAs for further traditional biological experiments21.

With the development of biotechnology, some databases were constructed by collecting these biological data. These datasets provide the possibility to classify associations of miRNA-disease through computational methods20,22,23,24,25. Over the years, these methods mostly are according to the assumption that these functionally similar miRNAs tend to be related with semantically similar diseases2,26,27,28. These models could be split into under similarity network models and machine learning models29. For example, Jiang et al.22 presented a computational model to speculate the relationship between miRNA and disease based on a hypergeometric distribution model. This is an early calculation model by fusing multiple sources of information. However, this method built the miRNA-related network by functional similarity, which is limited by the relationship between miRNAs. Based on random walk method, Xuan et al.30 presented MIDP and MIDPE, an extension method of MIDP. MIDP constructed the network by combining the information of each node including similarity, prior information and various ranges of topological structure. This model could effectively reduce noise from data by restarting the walk. Furthermore, You et al.31 proposed PBMDA constructed a heterogeneous graph including three sub-graphs. PBMDA is a depth-first algorithm based on path, which could fully use the topology information of heterogeneous network. In particularly, the priority of new associations between diseases and miRNAs could be identified by evaluating the score of the path. Chen et al.32 proposed a computational method adopted the extreme gradient boosting named EGBMMDA. This is the first learning method based on decision tree for classifying miRNA-disease relationships. EGBMMDA built a comprehensive feature vector by various methods such as statistical, graph theory and matrix factorization. These studies have continually improved the performance of computational method and played an important guiding role in traditional biological experiments33. Therefore, accurately and effectively predict associations between miRNA-disease through computational method become urgently demanded34.

In this study, based on the assumption of molecules are related to each other in human physiological processes, we developed a structural deep network embedding-based model (SDNE-MDA) for predicting miRNA-disease association using molecular association network. The flow chart of SDNE-MDA is shown as Fig. 1. Specifically, we first constructed the molecular association network (MAN)35 by combining multiple different molecules with edges of them. This study extracted behavior information from the heterogeneous network by the structural deep network embedding (SDNE)36, which could maintain the overall structure of large network to the greatest extent. Secondly, SDNE-MDA obtained the miRNA attribute information by the chaos game representation (CGR) algorithm and disease attribute information by disease semantic similarity. After then, we formed the feature descriptor by fusing the behavior information and attribute information of miRNAs and diseases. Finally, these feature descriptors are trained and classified by the CNN to predict miRNA-disease associations. Five-fold cross validation experiment was carried out for SDNE-MDA to verify the performance of prediction and achieved the AUC of 0.9447 with the prediction accuracy of 87.38%. To further evaluate SDNE-MDA, we contrasted the proposed model with two feature extraction models and classifier models. Besides, we carry out SDNE-MDA with three significant human diseases involving breast cancer, kidney cancer and lymphoma. And as a result, 47, 46 and 46 out of top-50 candidate related miRNAs are confirmed by known databases and recent literature, respectively. These experiment result demonstrated that SDNE-MDA is a precisely and effectively computational method for predicting potential associations between miRNA with disease.

Figure 1
figure 1

Flowchart of SDNE-MDA to predict potential miRNA-disease associations.

Materials and methods

Benchmark database

Human miRNA-disease associations benchmark database HMDD v3.037 was adopted as data support in this paper, which collected 32,281 confirmed miRNA-disease associations, involving 1102 miRNAs and 850 diseases. Here, after data processing, we chose 16,427 known miRNA-disease associations as positive samples including 1023 miRNAs and 850 diseases. What’s more, we defined the adjacency matrix \(AM\) to represent the miRNA-disease associations. When the miRNA \(mi(a)\) have a verified association with the disease \(di(b)\), we set \(AM(mi(a),di(b))=1\), otherwise \(AM(mi(a),di(b))=0\). In this paper, we introduce two other independent databases (dbDEMC38 and miR2Ddisease39) to verified the result of case study.

Molecular associations network

In this study, we combined multiple biological molecular information according the Molecular association network (MAN). The MAN is a heterogeneous information network proposed by Guo et al.40. Currently, this complex network consists of five types of molecular (miRNA, lncRNA, protein, disease, drug) and associations between them. The heterogeneous information network MAN provided a new comprehensive view to explore the complex physiological process and human disease. The structure diagram of molecular association network is as shown in Fig. 2. In this study, we download the information of molecular and associations between them from multiple databases. The number of different molecules is shown in Table 1, and the associations between them are shown in the following Table 2.

Figure 2
figure 2

Structure diagram of molecular association network.

Table 1 The number of different types of nodes in MAN.
Table 2 The number and database of different types of associations in MAN.

Chaos game representation (CGR) algorithm

MiRNA sequences contain a lot of complex information. However, most of the existing sequence feature information extraction algorithms only quantify one of position information and nonlinear information. In order to measure the similarity of these information contained in the miRNA sequences comprehensively. In this study, we chose chaos game representation (CGR)50 to quantize position and nonlinear information to calculate miRNA sequence similarity by pearson coefficient. Firstly, the positions of four nucleotides of miRNA are mapped to Euclidean space by the following formula:

$${T}_{i}={T}_{i-1}+c*\left({T}_{i-1}-{G}_{i}\right)$$
(1)
$$G_{i} = \left\{ {\begin{array}{*{20}l} {\left( {0,0} \right),} \hfill & {if\;type\;of\;nucleotide\;is\;A} \hfill \\ {\left( {0,1} \right),} \hfill & {if\;type\;of\;nucleotide\;is\;C} \hfill \\ {\left( {1,0} \right),} \hfill & {if\;type\;of\;nucleotide\;is\;U} \hfill \\ {\left( {1,1} \right),} \hfill & {if\;type\;of\;nucleotide\;is\;G} \hfill \\ \end{array} } \right.$$
(2)

where \({T}_{i}\) is the position of \(i\)th nucleotide, and it is related to the position of the previous nucleotide \({T}_{i-1}\) and the nucleotide coefficient \({G}_{i}\). In this paper, the contribution parameter \(c\) is equal to 0.5 and \({T}_{0}\) is \((0.5, 0.5)\).

Secondly, we divided the CGR space into 64 subspaces as shown in Fig. 3. The attribute information of each subspace \({SS}_{i}\) would be represented by integrating the position information \({X}_{i}, {Y}_{i}\) and nonlinear information \({Z}_{i}\) by the following formula:

Figure 3
figure 3

The CGR of has-mir-3976 plotted in \(8\times 8\) subspaces and the matrix of its nucleotides with probabilities for chaos game representation.

$$X_{i} = \sum x ,\quad if\;point\;in\;subspace\;SS_{i}$$
(3)
$${Y}_{i}=\sum y, \quad if\;point\;in\;subspace\;{SS}_{i}$$
(4)
$${Z}_{i}=\frac{{num}_{i}-\frac{{\sum }_{t=1}^{64}{num}_{t}}{64}}{\sqrt{\frac{1}{64}{\sum }_{r=1}^{64}{({num}_{r}-\frac{{\sum }_{t=1}^{64}{num}_{t}}{64})}^{2}}}$$
(5)
$${SS}_{i}=\left({X}_{i},{Y}_{i},{Z}_{i}\right), i={1,2},\dots ,64$$
(6)

where \({num}_{i}\) is the number of points in subspace \({SS}_{i}\).

Finally, each miRNA sequence information could be represented by the descriptor \(m(i)\). And we calculate sequence similarity \({M}_{sim}(m\left(i\right),m(j))\) by Pearson correlation coefficient.

$$m\left(i\right)=({SS}_{i},{SS}_{2},\dots ,{SS}_{64})$$
(7)
$${M}_{sim}\left(m\left(i\right),m\left(j\right)\right)=\frac{Cov(m\left(i\right),m(j))}{m\left(i\right)\times m(j)}$$
(8)

Disease semantic similarity

In this study, the Directed Acyclic Graph (DAG)51 of diseases could be obtained from the Medical Subject Headings (Mesh)52. In the system, a disease \(d(a)\) could be defined by \(DAG(d(a)) = (L(d(a)), E(d(a)))\), where \(L(d(a))\) is a node set including \(d(a)\) and ancestor nodes of \(d(a)\), and \(E(d(a))\) indicates directed edge set of all relationships from ancestor node to child node. The semantic value of \(d(a)\) was contributed by term \(T\) as the formula:

$$\left\{ {\begin{array}{*{20}l} {D_{d\left( a \right)} \left( T \right) = 1} \hfill & {if\;T = d\left( a \right)} \hfill \\ {D_{d\left( a \right)} \left( T \right) = max\left\{ {\vartheta {*}D_{d\left( a \right)} \left( {T^{\prime}} \right)|T^{\prime} \in children\;of\;T} \right\}} \hfill & {if\;T \ne d\left( a \right)} \hfill \\ \end{array} } \right.$$
(9)

where \(\vartheta\) is a parameter of semantic contribution, and \(\vartheta\) is equal to 0.5 as previous study. Therefore, \(DV\left(D\right)\) of \(D\) could be calculated as follows:

$$DV\left(D\right)={\sum }_{T\in {A}_{D}}{D}_{D}(T)$$
(10)

According the assumption that two diseases should have higher similarity if they hold more same parts in DAG, the similarity of the diseases \(d(a)\) with \(d(b)\) could be obtained as follows:

$$S\left(d(a),d(b)\right)=\frac{\sum_{T\in {A}_{d(a)}\cap {A}_{d(b)}}({D}_{d(a)}\left(T\right)+{D}_{d(b)}(T))}{DV(d(a))+DV(d(b))}$$
(11)

Structural deep network embedding

Since existing network embedding algorithms could not keep the high-order proximity of large-scale networks, this paper adopted the structural deep network embedding (SDNE) to extract the behavior information of miRNAs and diseases. Many existing network embedding models are shallow model (e.g. Laplacian Eigenmaps53, Graph Factorization54), which are unable to validly extract the highly non-linear structural information of network. SDNE is a semi-supervised model for network embedding. For the part of supervised, first-order similarity based on Laplacian matrix would be adopted to preserve local network information. And the part of unsupervised, SDNE used deep autoencoder modeling second-order similarity to save the global network information. Therefore, the loss function of SDNE is divided into two parts, i.e. Laplacian matrix model and Deep autoencoder model.

First-order similarity

To make adjacent nodes of graph closer in the latent space, the loss function of first-order similarity could be obtained as following formula:

$${L}_{1st}={\sum }_{i,j=1}^{n}{s}_{i,j}{\Vert {y}_{i}^{(k)}-{y}_{j}^{(k)}\Vert }_{2}^{2}={\sum }_{i,j=1}^{n}{s}_{i,j}{\Vert {y}_{i}-{y}_{j}\Vert }_{2}^{2}$$
(12)

where \({s}_{i,j}\) is the adjacency matrix for heterogeneous information network and \({y}_{i}^{(k)}\) indicates the node \(i\) of \(k\)-th layer.

Second-order similarity

For the capturing of global structure information, SDNE construct the deep autoencoder model. Any given \({x}_{i}\) could be convert into the latent representation of \(k\)th layer as:

$${y}_{i}^{\left(1\right)}=\sigma \left({W}^{\left(1\right)}{x}_{i}+{b}^{\left(1\right)}\right)$$
(13)
$${y}_{i}^{\left(k\right)}=\sigma \left({W}^{\left(k\right)}{y}_{i}^{\left(k-1\right)}+{b}^{\left(k\right)}\right), k=2,\dots , K$$
(14)

here \({W}^{\left(k\right)}\) is the \(k\)th layer weight matrix and \({b}^{\left(k\right)}\) as a parameter. According the optimization goal of the autoencoder is to reduce the reconstruction error in input and output, therefore, we could define the loss function as follows:

$$L={\sum }_{i=1}^{n}{\Vert \widehat{{x}_{i}}-{x}_{i}\Vert }_{2}^{2}$$
(15)

The adjacency matrices are often very sparse, which means zero elements are far more than non-zero elements. Therefore, the loss function would be optimized as:

$${L}_{2{\text{nd}}}={\sum }_{i=1}^{n}{\Vert (\widehat{{x}_{i}}-{x}_{i})\odot {b}_{i}\Vert }_{2}^{2}={\Vert (\widehat{X}-X)\odot B\Vert }_{\text{F}}^{2}$$
(16)

where \(\odot\) is the Hadamard product (multiplying the corresponding elements).

Integrating the first-order similarity and second-order similarity, the finally loss function of SDNE is shown as follows:

$${L}_{mix}={L}_{2nd}+{\upalpha }{L}_{1st}+\upsilon {L}_{reg}={\Vert (\widehat{X}-X)\odot B\Vert }_{\text{F}}^{2}+\alpha {\sum }_{i,j=1}^{n}{s}_{i,j}{\Vert {y}_{i}-{y}_{j}\Vert }_{2}^{2}+\upsilon {L}_{reg}$$
(17)

where \({L}_{reg}\) is a regularization term, and \(\alpha\) is a parameter to control the loss of the first-order similarity. The regularization term is shown as:

$$L_{reg} = \frac{1}{2}\sum\limits_{k = 1}^{K} {\left( {W_{F}^{\left( k \right)2} + \hat{W}_{F}^{\left( k \right)2} } \right)}$$
(18)

Integration of feature information

In this study, we firstly obtained miRNA sequence similarity and disease semantic similarity and convert them into attribute feature information \({M}_{sim}(i)\), \({D}_{sim}(j)\) of same dimension by stacked autoencoder. The dimension of \({M}_{sim}(i)\) and \({D}_{sim}(j)\) is 64. After then, the behavior feature information of miRNAs \({M}_{b}(i)\) and diseases \({D}_{b}(j)\) were extracted by the structural deep network embedding based on the molecular association network. The dimension of \({M}_{b}(i)\) and \({D}_{b}(j)\) is 128. Finally, a complete sample feature descriptor is constructed by fusing above information based on the HMDD v3.0 database. The feature descriptor was a 384-dimensional vector as follows:

$$FD\left(i,j\right)=\left[{M}_{b}\left(i\right),{M}_{sim}\left(i\right),{D}_{b}\left(j\right),{D}_{sim}\left(j\right)\right]$$
(19)

Convolutional neural network algorithm

Convolutional neural network (CNN) is a deep-structured feedforward neural network with convolution calculations. CNN could shift-invariant classify the input information based on layer structure by representation learning capability. With the development of research, CNN has been successfully utilized in bioinformatics55. Therefore, in this paper, we adopted the CNN to train and predict potential miRNA-disease association. Specifically, CNN has a multi-layer structure including input, convolutional layer, pooling layer, fully-connected layer and output as shown in Fig. 4. The input layer is a matrix of all feature descriptor \(FD\left(i,j\right)\) with size \(26284\times 384\). Two convolutional layers \(C1\) and \(C2\) are obtained by 32 filters with \(3\times 1\) convolution kernel and 64 filters with \(3\times 1\) convolution kernel. In this study, we adopted max-pooling \(2\times 1\) kernel to subsample the \(C2\). After repeatedly convolution and pooling, CNN classifies the features from fully-connected layer and output the probability distribution.

Figure 4
figure 4

Structure of the CNN algorithm.

Results and discussion

Performance evaluation

In this experiment, we implemented the five-fold cross validation to evaluate the performance of proposed model under HMDD v3.037. These known miRNA-disease pairs would be randomly split into five subsets with no intersection. Each cross validation, one of five subsets would be set as test set and remaining data sets as train set. To avoid the revelation of test data, we constructed the heterogeneous information network by only training data and extract the behavior information. In this study, a class of evaluation criteria were used to assess SDNE-MDA, including accuracy (Acc.), sensitivity (Sen.), specificity (Spec.), precision (Prec.), Matthews Correlation Coefficient (MCC) and area under curve (AUC). As a result, the average Acc, Sen, Spec, Prec, MCC and AUC achieved 87.38%, 87.28%, 87.47%, 87.45%, 74.76% and 0.9447 with standard deviations of 0.44%, 0.93%, 1.01%, 0.82%, 0.88% and 0.0027, respectively as shown in Table 3. In addition, the receiver operating characteristics (ROC) curve and area under precision-recall (PR) curve by SDNE-MDA based on HMDD are shown in Fig. 5.

Table 3 Five-fold cross validation results performed by SDNE-MDA on HMDD v3.0.
Figure 5
figure 5

The ROC and PR curves performed in terms of five-fold cross validation by SDNE-MDA on HMDD v3.0.

Comparison with different feature extraction methods

In this study, these nodes in the network could be represented by the attribute and behavior information. Both types of information may influence the result of prediction, so we compared the different feature extraction methods including SDNE-MDA_AI composed of attribute information, SDNE-MDA_BI composed of behavior information and SDNE-MDA composed of both them. In addition, attribute information of other nodes has scarcely effect on prediction of potential miRNA-disease relationships. For reducing the redundancy of model, we only considered the attribute information of miRNAs and diseases. The detail result of comparison between proposed model with different feature extraction models are shown in Table 4. The accuracy of SDNE-MDA is 7.78% and 3.43% higher than that of SDNE-MDA_AI and SDNE-MDA_BI, respectively. In addition, the AUC of proposed model is 0.0811 and 0.0260 higher than SDNE-MDA_AI and SDNE-MDA_BI. The ROC curves and PR curves of three experiments are shown in Fig. 6. These results indicated that integrating the two kind of information to represent the node achieved more distinguished performance.

Table 4 The comparison results between SDNE-MDA_AI model, SDNE-MDA_BI model and SDNE-MDA model based on HMDD database.
Figure 6
figure 6

ROC and PR curves performed by SDNE-MDA_AI, SDNE-MDA_BI and SNDE-MDA model in terms of five-fold cross validation based on HMDD database.

Comparison with different classifier models

In this study, the CNN was adopted to train and identify potential relationships between miRNA and disease. To further evaluate SDNE-MDA, we compare proposed model with Bagging, Logistic Regression, Naive Bayes and Adaboost classifier model. In this experiment, we implemented the five-fold cross validation in these different classifier models based on the HMDD v3.0. Finally, the proposed model yielded average AUC of 0.9447 based on five-fold cross validation and outperformed Bagging (0.8998), LogisticRegression (0.9270), Naive Bayes (0.8881), Adaboost (0.9226) and MLP (0.9320). The AUC of CNN is 0.0259 higher than the mean AUC of all five model, and the accuracy is 1.60% higher than that of the second highest methods. The detail results of the comparison between SDNE-MDA and other four classifier models are shown in Table 5, and we drew the ROC curves as shown in Fig. 7. Therefore, CNN algorithm is the optimal selection for the proposed model to predicting potential miRNA-disease associations.

Table 5 The comparison results between SDNE-MDA with other four different classifier models in terms of five-fold cross validation based on HMDD v3.0 database.
Figure 7
figure 7

Performance comparison between SDNE-MDA with other four different classifier models based on HMDD v3.0 database.

Comparison with related work

An increasing number of researchers have focused on the prediction of miRNA-disease associations, and a mass of model have been proposed. To further evaluate the predictive performance of our method, the SDNE-MDA was compared with six state-of-the-art classical methods under five-fold cross validation, including RWRMDA56, MTDN57, EGBMMDA32, LMTRDA58, DBMDA59 and PBMDA31. Since these algorithms have not calculated multiple evaluation criteria, we only compare the AUC on the terms of five-fold cross validation based HMDD database. The detail results of the comparison between SDNE-MDA and other six related works are shown in Table 6. The proposed method is 0.0399 higher than the average AUC of all algorithms, and 0.0275 higher than that of the second highest methods. This is mainly due to SDNE-MDA integrated two types of information of miRNAs and diseases, and extract the feature more comprehensively. Therefore, the proposed model is an effective and reliable computational tool for predicting potential miRNA-disease associations.

Table 6 The comparison results between SDNE-MDA with other related works.

Case studies

For further evaluating the prediction ability of SDNE-MDA, we implemented case studies based on three significant human diseases (Breast Neoplasms, Kidney Neoplasms, Lymphoma). In this study, these known miRNA-disease associations based on HMDD v3.0 database would be the training set. To avoid the overlap in the train data and prediction list, the test set is the unknown relationship pairs between three diseases and all possible miRNAs. As a result, 47, 46 and 46 of top-50 candidate related miRNAs were confirmed by independent databases. Therefore, SDNE-MDA is a feasible and reliable model for predicting potential relationships between miRNA and disease.

Breast Neoplasms is the most universal neoplasms in female and the risk of breast cancer is up to 13% in the United States. Although men may also develop breast cancer, 99% of patients are women. There are approximately 276,480 novel cases in women and 42,170 were die from breast cancer in 202060. In previous few years, studies had indicated the expression level of miRNA have strong impact to growth and division of breast tumor cell61. Therefore, we implemented a case study of Breast Neoplasms-miRNA associations by SDNE-MDA. In the prediction list shown as Table 7, 47 of top 50 predicted Breast Neoplasms related miRNAs were verified based on independent databases.

Table 7 Prediction of top 50 miRNAs related to Breast Neoplasms based on known miRNA-disease associations in HMDD V3.0 database.

Kidney Neoplasms is a novel cancer with higher adult incidence60. In the past few years, however, morbidity and mortality of kidney neoplasms have been increasing. There are about 73,750 novel cases in kidney neoplasms with about 45,520 in male and about 28,230 in female in United States and about 14,830 deaths for this cancer (9860 men and 4970 women) in 2020. Recently, increasing researchers have indicated miRNAs are related with kidney neoplasms62. Thus, we take Kidney Neoplasms as a case study for SDNE-MDA and prioritize the candidate miRNAs. In the prediction list shown as Table 8, 46 of top-50 potential kidney neoplasms-related miRNAs were confirmed by independent databases.

Table 8 Prediction of top 50 miRNAs related to Kidney Neoplasms based on known miRNA-disease associations in HMDD V3.0 database.

Lymphoma is one of the most common malignant cancers (~ 4% of all new cancer) especially in teenagers in United States60. Lymphoma mainly contains two types of Hodgkin Lymphoma (HL) and non-Hodgkin Lymphoma (NHL). In 2020, it is estimated that about 85,720 new cases of Lymphoma (47,070 of men and 38,650 of women) and 20,910 deaths for HL and NHL (12,030 of men and 8,880 of women). Therefore, we implemented SDNE-MDA to prioritize possible miRNAs for Lymphoma based on HMDD v3.0. As shown in Table 9, 46 out of top 50 predicted Lymphoma candidate miRNAs were verified by independent databases.

Table 9 Prediction of top 50 miRNAs related to Lymphoma based on known miRNA-disease associations in HMDD V3.0 database.

Conclusion

In previous few years, accumulating number of researches demonstrated that miRNAs have closely link with diseases. Various of biological experiments and computational methods are committed to classify the association of them. In this paper, we proposed a structural deep network embedding-based model SDNE-MDA to predict miRNA-disease associations. This model constructed a complex network MAN by fusing miRNAs, diseases and three related molecular (lncRNA, drug and protein) with their relationships. Through the comprehensive heterogeneous information network, potential miRNA-disease associations could be predicted more accurate and efficient. And CNN is utilized to train and classify the potential miRNA-disease associations. Compared with other classifiers and feature extraction models, SDNE-MDA showed outstanding performance. In addition, case studies were implemented on three significant human disease for further validate performance of SDNE-MDA. As a result, 47, 46 and 46 of top-50 predicted miRNAs have been confirmed by independent databases. These results demonstrated that SDNE-MDA is a reliable computational tool for predicting miRNA-disease associations.