Inferring pseudogene–MiRNA associations based on an ensemble learning framework with similarity kernel fusion

Accumulating evidence shows that pseudogenes can function as microRNAs (miRNAs) sponges and regulate gene expression. Mining potential interactions between pseudogenes and miRNAs will facilitate the clinical diagnosis and treatment of complex diseases. However, identifying their interactions through biological experiments is time-consuming and labor intensive. In this study, an ensemble learning framework with similarity kernel fusion is proposed to predict pseudogene–miRNA associations, named ELPMA. First, four pseudogene similarity profiles and five miRNA similarity profiles are measured based on the biological and topology properties. Subsequently, similarity kernel fusion method is used to integrate the similarity profiles. Then, the feature representation for pseudogenes and miRNAs is obtained by combining the pseudogene–pseudogene similarities, miRNA–miRNA similarities. Lastly, individual learners are performed on each training subset, and the soft voting is used to yield final decision based on the prediction results of individual learners. The k-fold cross validation is implemented to evaluate the prediction performance of ELPMA method. Besides, case studies are conducted on three investigated pseudogenes to validate the predict performance of ELPMA method for predicting pseudogene–miRNA interactions. Therefore, all experiment results show that ELPMA model is a feasible and effective tool to predict interactions between pseudogenes and miRNAs.


Materials and methods
Gold standard data set. The pseudogene-miRNA associations are obtained from starBase v2.0, in which very high stringency of pseudogene symbol is selected 22 . After screening and removing redundancy, 1570 experimentally supported pseudogene-miRNA associations is sorted out, covering 318 pseudogenes and 260 miRNAs. In this study, a pseudogene-miRNA adjacency matrix PM(i, j) is constructed based on the validated associations between pseudogenes and miRNAs. If there is an association between pseudogenes p(i) and miRNAs m(j), PM(i, j) is assigned as 1, otherwise 0.
Expression similarity for pseudogenes. The expression level of pseudogenes in various cancers and normal tissues is obtained from dreamBase database 23 . In dreamBase database, expression information of pseudogenes is selected as the characteristic information of pseudogenes. When two pseudogenes have a higher correlation score tend to be more similarity expressed. The pseudogene expression profiles are measures as follows: where N is the number of properties of the expression profiles, x k and y k denote the expression values in different cancers and normal tissues.
Function similarity for miRNAs. Given that miRNAs targeting more of the same genes tend to be involved in similar biological function. The interactions between miRNA and target gene information are obtained from miRTarBase 24 . The miRNA-target interactions are employed to measure the miRNA function similarity for each pair of miRNAs. If two sets of target genes (say G i and G j ) respectively have relationship with miRNA M i and miRNA M j , the miRNA function similarity is calculated as follows: where G i and G j represent the sets of target gene that related with miRNAs. GIP kernel similarity for pseudogenes and miRNAs. The GIP kernel similarity is applied to calculate the similarity between pseudogenes and miRNAs based on the known pseudogene-miRNA association adjacency matrix 25 . The GIP kernel similarity for pseudogenes can be calculated as follows: www.nature.com/scientificreports/ where p(i) represents the pseudogene interaction profiles, which is a binary vector that encode the interaction between pseudogene i and all miRNAs, i.e., the i-th row of the gold standard pseudogenes-miRNA adjacency matrix PM. The parameter γ p controls the kernel bandwidth. n p is the number of pseudogenes. Similar to pseudogenes, the GIP kernel similarity for miRNAs is defined as: where m(i) represents the miRNA interaction profiles, which is a binary vector that encode the interaction between miRNA i and each pseudogene, i.e., the i-th column of adjacency matrix PM. The parameter γ m is also used to control the kernel bandwidth. n m is the number of miRNAs.
Hamming profile similarity for pseudogenes and miRNAs. Given the length for a pair of vectors are same, hamming profile is the number of elements of which corresponding values are different. The higher Hamming profile value represents the two vector has lower similarity. Hamming profile similarity for pseudogenes is calculated as follows: where IP(p i ) is the i-th row of the pseudogene-miRNA adjacency matrix PM. Similarly, the hamming profile similarity for miRNA is defined as follows: where IP(m i ) is the i-th column of the pseudogene-miRNA adjacency matrix PM.
Cosine similarity for pseudogenes and miRNAs. Cosine similarity algorithm has been widely used in the collaborative filtering recommendation algorithm. Here, based on known pseudogene-miRNA associations, the similarity of pseudogenes p i and p j is defined as follows: where r represents the number of pseudogenes. The binary vector PM(p i ) indicates whether exist an association between pseudogene p i and each miRNA (the row i of the PM matrix, if p i is related to miRNA, otherwise 0). Meanwhile, SP_cos(p i , p j ) represents the cosine similarity between pseudogene p i and p j . SP_cos is the pseudogene cosine similarity matrix. Similarly, the cosine similarity of miRNA m i and miRNA m j is computed as follows: where MP(m i ) denotes whether there is an association between miRNA m i and each pseudogene (the column of MP matrix, if m j is related to pseudogene, otherwise 0). SM_cos(m i , m j ) is the cosine similarity between miRNA m i and miRNA m j . The SM_cos is the miRNA cosine similarity matrix. n is the number of miRNAs.
Integrated similarity by similarity kernel fusion method. In this study, four kinds of pseudogene similarities and five miRNA similarities are calculated. The integrated pseudogene similarity is measured by combining pseudogene expression similarity, pseudogene GIP kernel similarity, pseudogene hamming profile similarity, pseudogene cosine similarity. The integrated miRNA similarity is calculated by combining miRNA function similarity, miRNA GIP kernel similarity, miRNA hamming profile similarity and cosine similarity. Here, similarity kernel fusion method is used to fuse the four pseudogene similarities and five miRNA similarities 26 . Let S p,r (r = 1,2,…,4) represents the four pseudogene similarities and S m,n (n = 1,2,…,5) represents the five miRNA similarities, respectively. Firstly, each original kernel for pseudogenes is normalized by Eq. (9).
(3) where F c,m is a sparse kernel and it satisfies c j ∈C F c,m (c k , c j ) = 1. N i is a set of p i 's neighbors including c i itself. Therefore, four pseudogene similarities could be computed as Eq. (11).
where SP t+1 p,r is the status matrix of r-th pseudogene similarity kernel after t + 1 iterations.SP 0 p,k denotes the initial status of S p,k .
After t + 1 steps, the overall kernel for pseudogenes is calculated as Eq. (12).
Finally, a weight matrix w p is used to remove the noise in the matrix S p .
The fused pseudogene similarity is computed as Eq. (14).
Similarly, the integrated miRNA similarity as S m * is computed, in which involved five miRNA similarities to be fused.
Ensemble learning framework with resampling method. To predict the potential pseudogene-miRNA associations, an ensemble learning framework with similarity kernel fusion method is proposed. Inspired by the previous research 27,28 , ELPMA model is proposed through the following steps: (1) using the resampling method to obtain multiple different training subsets, and the diversity of individual learners is increased; (2) to integrate the prediction results of individual learners, soft voting is employed to obtain the final prediction. The process of constructing the ensemble learning framework is shown in Fig. 1.

Resampling strategy.
There are 1570 experimentally confirmed pseudogene-miRNA associations as positive samples, and 81,110 unconfirmed pseudogene-miRNA pairs as unlabeled samples. So only a small part of experimentally confirmed pseudogene-miRNA associations. To settle the problem caused by the imbalanced dataset, the resample strategy is employed to build multiple different balanced training subsets. The negative samples are guaranteed to have the same number with positive samples. When construct a subset, all positive samples are sort out, and same unlabeled samples are randomly selected as negative samples. Then, the negative samples and positive training sample are combined to balance the positive and negative samples. The training set of positive sample P and the unlabeled sample set U are defined as follows: www.nature.com/scientificreports/ where P represents the positive samples, and U denotes the unknown pseudogene-miRNA association samples.
In each training subset, the number of unlabeled pseudogene-miRNA associations is the same as the number of positive samples. The set N (N ∈ U) represents the negative samples selected from U, and the number of N is same as the number of P. The set of T = P ⋃ N is the training set in base learning.

Sample representation.
To learn the pseudogenes and miRNAs potential feature representation, multiple data source is incorporated to obtain the integrated similarities for pseudogenes and miRNAs. Here, a pseudogene-miRNA pair was taken as a sample. The feature vector of i-th pseudogene, FP(p(i)), is defined as follows: where N p represents the number of pseudogenes. Similarly, the feature vector of jth miRNA, FM(m(j)), is defined as follows: where N m represents the number of miRNAs. Then, the feature vector of each pseudogene-miRNA pair (p(i),m(j)) is defined by combining the FP(p(i)) and FM(m(j)) as follows: Soft voting for pseudogene-miRNA association prediction. Ensemble learning combines multiple individual learners to increase the prediction performance compared to individual models. Owing to the training subsets are different and the feature spaces of the subsets are heterogenous, the trained individual learners are also different from each other. In this study, an ensemble learning framework is developed by using the XGBoost as individual learner on the multiple sample subsets. XGBoost is a machine learning algorithm in which regression trees is used as functions in gradient boosting to optimize trees 29 .
Set the output of a tree as shown below: where x i is the input vector, q represents the structure of each tree and w q represents the score of the leaf node q. The output of the set of K trees is: where K is the number of regression functions, the objective function for learning the set of f k is shown as follows: where l represents the loss function between the observed value y i and predict value ŷ i . Ω(f k ) is the regularization term to avoid overfitting. γ is the pseudo-regularization hyperparameter. λ is the L2 norm for leaf weights. T is the total number of leaf nodes. The optimal objective function value could be written as: where I is the set of leaf nodes, g i is the first derivative of l and h i is the second derivative of l.
Here, the outputs of XGBoost are taken as primitive results. Then, the soft voting is used to make the final decision. The prediction scores of individual learners are averaged, and confirmed whether the pseudogene is associated with each other. Take an unknown pseudogene-miRNA association as sample input, n individual learners could produce n prediction results, and then the n prediction results are integrated by using the soft voting strategy 30 . Specifically, the output of the i-th sample by soft voting is defined as follows: where O(i,j) is the prediction scores of the j-th individual learners for the i-th sample. n represents the number of training subsets. O(i) > 0.5 represents the pseudogene-miRNA pair is associated; otherwise, it is considered to be not associated with each other.

Results
Performance evaluation. In this work, k-fold cross validation is employed to evaluate the performance of the ELPMA model. The validated pseudogene-miRNA associations are regarded as the positive set, and equal number of samples are randomly selected from the negative sample set as negative samples. For each cross validation, (k-1) positive subsets and the same number of negative subsets took from k subsets to train the models; the remaining one positive subset and one negative subset are used for testing to evaluate the prediction performance. Specifically, fivefold and tenfold cross validation are used to evaluate the prediction performance of ELPMA model. Moreover, several metrics are used to measure the prediction performance of ELPMA method, including precision (Pre), sensitivity (Sen), accuracy (Acc), F1-score, AUC (Area under the receiver operating characteristic curve), AUPR (Area under the precision-recall curve), and MCC (Matthews's correlation coefficient). The calculation formulas of these metrics are shown as follows: where TP and TN represent the number of true positives and true negatives, respectively. FP and FN represent the number of positives and negatives, respectively, that are wrongly predicted.

Performance analysis of ELPMA method with different individual learners.
To assess the ability of the ELPMA method to predict the associations between pseudogenes and miRNAs, fivefold cross validation is implemented on the gold standard data set. In the ensemble framework, different individual learners could affect the prediction performance. In addition, different hyper-parameters of ELPMA-AB and ELPMA-RF model are selected to obtain optimal performance. Finally, the prediction performance of the ELPMA model that using different individual learners is listed in Table 1. When the number of individual learners, n_estimators, learning rate are respectively set as 10, 400, 0.2, ELPMA-XGB yields the Precision of 0.9716, the Recall of 0.9369, the F1-score of 0.9540, the Acc of 0.9548, the AUC of 0.9897, the AUPR of 0.9914. As shown in Table 1, ELPMA-XGB is higher than other models in these seven metrics.
In addition, the ROC curves of the k-fold cross validation are plotted by the proposed ELPMA-XGB method, respectively. The experimental results show that ELPMA-XGB achieves mean AUC values of 0.9897 and 0.9906 for the fivefold and tenfold cross validation (Fig. 2). Therefore, ELPMA-XGB model is appropriate as the individual learners of ELPMA method for the prediction of pseudogene-miRNA associations.
Influence of training data on model performance. In Table S1. The results shown that the performance of ELPMA model getting better with the training data increasing. Therefore, the size of the training data has a great influence on the prediction performance of ELPMA model. With the number of training data increasing, the prediction performance of is also increased.

Effectiveness of soft voting for the ensemble learning framework.
To demonstrate the effectiveness of the soft voting for the ensemble learning method, the soft voting performance is compared with individual learners on ELPMA model. Detailed results of the comparison are shown in Fig. 3. In the figures, the horizontal axis represents the index number of individual learners, and the vertical axis are the AUC values and AUPR values. From the Fig. 3, we also seen that the AUC of individual learners is between 0.9823 and 0.9849, and the AUPR of individual learners is between 0.9849 and 0.9873 under fivefold cross validation. The results indicate that soft voting in the proposed method could improve the prediction performance of ELPMA model. It also indicates that ELPMA is an effective framework to predict the pseudogene-miRNA interactions.
Comparison with other existing methods. To comparatively illustrate the superiority of ELPMA method, GBDT-LR 10 , ABMDA 31 , CD_LNLP 17 , and LAGCN 20 are compared with ELPMA method to predict the pseudogene-miRNA interactions. These five methods are individual evaluated based on gold standard data set with k-fold cross validation and recommended hyperparameters. As show in Fig. 4, ELPMA shows the best performance in term of the average AUC values under fivefold and tenfold cross validation. It shows that the ROC curves of ELPMA model is above those of GBDT-LR, ABMDA, CD_LNLP and LAGCN method in most cases. The average AUC scores of ELPMA method are up to 0.9897 and 0.9906 for the fivefold and tenfold cross validation, respectively, which is superior to the other four methods (Fig. 4). In addition, the results of performance evaluation indicators such as F1-score, Acc, MCC are shown in Table 2 for fivefold and tenfold cross validation. Although the Precision of ELPMA is inferior to ABMDA and Acc of ELPMA is inferior to CD_LNLP and LAGCN, the evaluation metrics of ELPMA are higher than others (Table 2). Furthermore, we used the paired  www.nature.com/scientificreports/ t-test based on 10 runs of fivefold and tenfold cross-validation to test the performance of the ELPMA method and the comparison methods. Table 3 shows that ELPMA is significantly preferred to other computational methods in terms of Sensitivity, F1-score, AUC, AUPR and MCC (Table 3). Therefore, all the above results show that ELPMA method provides a great improvement in predict the pseudogene-miRNA interactions.   www.nature.com/scientificreports/ Case studies. To illustration the prediction performance of ELPMA method in screening pseudogene-miRNA interactions, case studies of three pseudogene related miRNA are conduct for further validation. Given the investigated pseudogene-miRNA interaction to be unknown in all known associations. In this section, the pseudogene MSTO2P, MTND4P12 related miRNAs are removed in the known associations, and then use other associations to train the model and predict the probability of all miRNAs associated with the investigated pseudogenes. Through the calculation of ELPMA method, the candidate associations between pseudogene and miR-NAs are sorted in descending order. Then, the top 10 rank results are selected with high probability scores for the three investigated pseudogenes, and the predicted associations are verified with the starBase database. Pseudogene MSTO2P is found to be implicated in several diseases including lung cancer 32 , colorectal cancer 33 , etc. MSTO2P could function as a miR-128-3p sponge in non-small cell lung cancer cells (NSCLC), and MSTO2P/ miR-128-3p to regulate coptisine sensitivity of NSCLC cells via TGF-β pathway. In addition, MSTO2P related top 10 miRNAs, in which 9 of the top10 is proved by starBase (Table 4).
MTND4P12 is considered as an oncogenic pseudogene upregulated in skin cutaneous melanoma, and it can upregulate the expression of oncogene AURKB by serving as ceRNA 34 . Hsa-let-7e-5p is also identified as candidate miRNA that regulated by MTND4P12, hsa-let-7e-5p and MTND4P12 is co-expression in skin cutaneous melanoma. As shown in Table 4, the MTND4P12 related top 10 miRNAs is supported by starBase.

Conclusion
Increasing evidences show that both pseudogenes and miRNAs play oncogenic or tumor-suppressive roles in disease progression. Predicting pseudogene-miRNA associations will contribute to understanding the pathological mechanisms, diagnosis, and treatment of diseases. In this work, a computational method is proposed to infer the associations between pseudogenes and miRNAs, which employed an ensemble learning framework with similarity kernel fusion, named ELPMA. By comparing with other four models, the prediction performance of our proposed method is powerful to predict the pseudogene-miRNA interactions. The case study of investigated MSTO2P and MTND4P12 related miRNAs also proved the ELPMA method is reliable and effective.
The good performance of ELPMA method is attributed to three main factors: (1) ELPMA integrates the biological information including pseudogene expression profiles and miRNA-targets interactions. (2) ELPMA introduces the resampling method to settle the problem caused by the imbalanced pseudogene-miRNA dataset.
(3) The application of XGBoost as individual learner of the ensemble learning framework guarantees the effectiveness of learning the meaning of combinations of features from feature representation.
There are also some limitations in the ELPMA method. First, the gold standard pseudogene-miRNA associations may have nosy, and the negative samples are randomly selected from the unconfirmed associations, limiting the prediction performance. In addition, the ELPMA method relies on the known pseudogene-miRNA interaction network, and it could not predict novel pseudogene-miRNA interactions without any known associations. Therefore, developing more effective framework is essential to infer the associations between pseudogenes and miRNAs.

Data availability
The data will be made available on request from the corresponding author.