MRDPDA: A multi‐Laplacian regularized deepFM model for predicting piRNA‐disease associations

Abstract PIWI‐interacting RNAs (piRNAs) are a typical class of small non‐coding RNAs, which are essential for gene regulation, genome stability and so on. Accumulating studies have revealed that piRNAs have significant potential as biomarkers and therapeutic targets for a variety of diseases. However current computational methods face the challenge in effectively capturing piRNA‐disease associations (PDAs) from limited data. In this study, we propose a novel method, MRDPDA, for predicting PDAs based on limited data from multiple sources. Specifically, MRDPDA integrates a deep factorization machine (deepFM) model with regularizations derived from multiple yet limited datasets, utilizing separate Laplacians instead of a simple average similarity network. Moreover, a unified objective function to combine embedding loss about similarities is proposed to ensure that the embedding is suitable for the prediction task. In addition, a balanced benchmark dataset based on piRPheno is constructed and a deep autoencoder is applied for creating reliable negative set from the unlabeled dataset. Compared with three latest methods, MRDPDA achieves the best performance on the pirpheno dataset in terms of the five‐fold cross validation test and independent test set, and case studies further demonstrate the effectiveness of MRDPDA.

et al. [4][5][6] systematically evaluate 29 state-of-the-art MDA prediction models, which utilize data and model fusion.They also speculate the trend about MDA works, such as MDA multiclass classification and so on.
CircRNA and lncRNA are two important kinds of ncRNA, which have attracted widespread attention in gene regulation and disease research.More and more CDAs and LDAs have been identified by biological experiments.In recent 10 years, scientists have proposed a number of computational CDA and LDA models.Wang et al. 7 introduce 27 CDA models, which are divided into two categories, namely network algorithm-based and machine learning-based models.Meanwhile, these two classes of methods are also widely used in LDA studies. 8Nowadays, deep learning techniques, particularly graph neural network (GNN), are becoming increasingly prevalent, such as graph auto-encoder. 9In summary, MDA, CDA and LDA studies have become matured after a long period and researchers have started to focus on PDA studies.piRNAs are the largest and most heterogeneous class of small non-coding RNAs (sncRNAs) with 24-32 nt and only discovered in animals. 10Different from the above-mentioned ncRNAs, piRNAs bind to PIWI protein to form RNA-protein complexes, which are important players in the regulation of various biological processes, involving the silence of transposable elements (TEs), 11,12 the expression of mRNAs and lncRNAs 13,14 and so on.In the past decade, growing evidences show that piRNAs are expressed in cancer cells 15 and regulation of PIWI-piRNA complexes is related to human complex diseases.Ding et al. 16 find piR-823 induces cancer cell stemness in the luminal subtype of breast cancer cells by increasing the expression of gene called DNMTs.Amaar et al. 17 identify specific piRNAs that play oncogenic (piR-34871 and piR-52200) and tumour suppressor (piR-35127 and piR-46545) roles in lung cancer.In short, piRNAs are potential biomarkers and drug targets in diagnosis, prognosis and therapeutic efficacy of complex diseases.

So far, many biological experimental techniques have been used
to explore associations between piRNA and specific human diseases, like small RNA sequencing and ribosome experiments.Although these experiments can accurately identify genuine PDAs, they are time consuming and laborious.With the development of piRNA studies and accumulation of data, public databases about PDAs have been established, which makes it workable for detecting PDAs by computational methods.Current computational models for other ncRNAs are not completely applicable to PDA research for two reasons.Firstly, piRNA research started relatively late and PDA data is accumulating which is not as abundant as MDA data.Secondly, the length of piRNA is much shorter than circRNA and lncRNA length, which cause piRNA sequence with low information and weak noise resistance.
Consequently, researchers have started to develop the specialized PDA method.
To the best of our knowledge, the first PDA prediction model was proposed in 2020 and since then several machine-learning algorithms have been developed.Eighteen existing PDA methods (before December 2023) are listed in Table 1, which are grouped into three categories: traditional machine learning methods, GNNs and other deep learning models.
In the early stage of PDA study, traditional machine learning methods were mainly used such as random forest, SVM, logistic regression and so on.iPiDi-PUL 18 is a representative method, which is the first PDA predictor by positive unlabeled (PU) learning based on random forest and key features extracted by PCA.Additionally, this group also includes SPRDA (2020), 19 iPiDA-LTR 20 and SPRDA (2022). 21These methods are clear in theory, but the performance needs to be further improved and limited web services are available.
Recently, deep learning models with excellent representation learning ability have been widely applied in PDA studies and GNN is the most popular technology.Methods include GAPDA, 22 GAPDA- LGAT, 23 iPiDA-GCN, 24 PDA-PRGCN, 25 ETGPDA, 26 iPiDA-SWGCN 27 , PDA-GCN 28 and CLPiDA. 29These studies treat PDA data as dichotomous networks or heterogeneous networks and use GNNs to study the PDA problem as the link prediction task to capture the nonlinear relationship between piRNA and disease.However, some of these methods achieve the good performance based on a tailored PDA dataset where degree of all piRNA nodes is greater than 1.Furthermore, methods based on GNN may cause a memory explosion problem due to the full graph training.
In addition to GNN, other deep learning technologies are used in PDA studies, such as CNN, stacked autoencoder and so on.Methods are iPiDA-sHN, 30 APDA, 31 iPiDA-GBNN, 32 DFL-PiDA, 33 piRDA 34 and MSRDA. 35Although these methods have made contributions to the discovery of PDAs, the model's generalization ability needs to be strengthened and no code is available.
Besides the limitations in existing PDA methods, the quality and quantity of PDA data should be paid attention.There are only three PDA specific databases (piRDisease 1.0, piRPheno 2.0 and MNDR 3.0) and just two of them (piRDisease 1.0 and MNDR 3.0) are used for constructing PDA dataset.Among them, the datasets based on piRDisease are popular.However, the number of samples in piRDisease are limited and it has not been updated for 5 years.In addition, due to experimental conditions and literature records, there are bias and uncertainty in these databases.The datasets are highly imbalanced and only a very small number of diseases are associated with piRNAs.It is worth noting that there are strong inclusion relations between diseases which do not appear in piRNAs.
In this study, we design a novel method, MRDPDA, for predicting PDAs.Specifically, MRDPDA effectively integrates a deep factorization machine (FM) model with regularizations of several separate Laplacians calculated from multiple yet limited datasets.MRDPDA utilizes three types of piRNA information (piRNAs k-mer, piRNA local alignment and Gaussian interaction profile (GIP) kernel information) and three types of disease information (diseases semantic information, diseases symptomatic information and GIP kernel information).
Instead of simply calculating an average similarity network from these multiple similarities, 36  embedding is suitable for the prediction task.In addition, a relatively balanced benchmark dataset is built from piRPheno, where a deep autoencoder is used to construct a reliable negative set from unlabeled samples.Experiments with the cross-validation test, independent test and case studies fully demonstrated that MRDPDA outperforms the competing methods for predicting the potential PDAs.The major contributions of this study are enlisted as.
• Propose a novel method (MRDPDA) to predict PDAs, where a deepFM is regularized by multiple Laplacians.
• Construct a relatively balanced benchmark dataset based on piR-Pheno and build a reliable negative set by a deep autoencoder.
• Demonstrate that MRDPDA can effectively integrate multisource yet limited data.

| Benchmark dataset construction
With the development of PDA studies, three PDA specific databases (piRDisease 1.0, piRPheno 2.0 and MNDR 3.0) with experimental validation have been created.Wei et al. 18  Where D all is the union of all piRNAs associated among all disease with the total samples, D P is the positive set and D U is the unlabeled set.The relationship of the mentioned three PDA datasets are represented by Venn diagram 37 as shown in Figure 1.
In our study, PDA data is represented as a PDA matrix A. Specifically, if the i -th piRNA is experimentally-validated to be associated with the jth disease, the value of a i,j is equal to 1 and 0 otherwise.

| Similarity calculation
To obtain both piRNA-related and disease-related information from comprehensive views, piRNA-piRNA and disease-disease similarities were calculated with various datasets, which are introduced as follows:

| piRNA-piRNA similarity
By mapping name, we collected piRNA sequence from piRBase V3.0 and calculate three piRNA similarities.These similarities are described in the sequel.

• piRNA sequence local alignment similarity
The sequence information contains the attribute information of piRNAs and the Smith-Waterman alignment algorithm can effectively capture the functional similarities among piRNAs.In our study, the piRNA sequence similarity PS 1 p i , p j between the i -th and jth piRNAs is computed as: Where the SW p i , p j is the sequence alignment score between the i -th and jth piRNAs, which is calculated by the Smith-Waterman alignment algorithm.
• piRNA sequence k-mer similarity The k-mer term frequency (TF) typically refers to the specific kgram frequency of nucleic acid or amino acid sequences which has been widely used to describe the components of biosequences.Inspired by (1) natural language processing methods, we treat a piRNA sequence as a short text and a k-mer sequence as a word.The word frequency refers to the times of a word which occurs in a corpus of texts.Therefore, we use a sliding window to get a k-mer frequency vector of one piRNA as a word frequency vector.The cosine similarity is used to measure the piRNA sequence similarity between two vectors and is computed as: Where v p i andv p j denote the k-mer frequency vector of the i th and j th piRNAs.
• piRNA GIP kernel similarity The GIP kernel aims to measure the similarities of biological entities based on their interaction profile information and it has been successfully applied to PDA studies.We compute the GIP kernel similarity as follows: Where A is the association adjacency matrix between m piRNAs and n diseases, A(i, ) is the i -th row of A representing the diseaseinteraction profile of the i -th piRNA and the parameter p ′ for controlling kernel bandwidth is set as 1.

| Disease-disease similarity
By collecting disease information from MeSH and Human symptomsdisease network, we calculate three disease similarities, which are described in sequel.

• Disease symptom similarity
The disease symptom similarity is downloaded from Human symptoms-disease network data in Zhou et al. 38 Here, the weight DS 1 d i , d j between the i -th disease and jth disease quantifies the similarity of their respective symptoms.where Δ is the semantic contribution factor (∆ = 0.5 is the default in this study).
The disease semantic similarity can be defined as follows: • Disease GIP kernel similarity The disease GIP kernel is computed by the same way as piRNA GIP kernel similarity as follows: Where A(, i) is the i -th column of A representing the piRNAinteraction profile of the i -th disease.The parameter d ′ for controlling kernel bandwidth is set as 1 too.

| Selection of reliable negative samples
If negative samples are randomly selected from unlabeled samples, noise may be caused by the fact that not are all unlabeled samples negative samples.To building an accurate model, reliable negative samples are chosen with a deep autoencoder-based negative sample selection model in this study.
First, each row of PDA matrix A is used as the piRNA feature representation, while each column of matrix A is intended for the disease feature representation.A deep autoencoder is trained by learning the latent features of all PDA samples from D P .The i -th piRNA-disease sample is defined as x i = D d , P p ∈ R (m+n) , where D d is feature representation of the disease in x i and P p is the feature representation of the piRNA in x i .Using x i as input, the encoder extracts the features in a low-dimensional latent space z i and the decoder reconstructs the input x i based on the latent representation z i from the encoder.The loss of the autoencoder is the average of the reconstruction errors of all the training samples, which describes as follows: (3) Where x ′ i is the reconstruction of input x i , K is the total number of PDA samples in D P , m + n is the length of the representation vector of piRNA-disease pair, x ij is the jth factor of vector x i .All the parameters in the autoencoder are updated iteratively by minimizing the loss above.
Finally, all the unknown PDAs are input into the well-trained model and the reconstruction error is calculated for each unknown piRNA-disease pair.The reconstruction error scores are sorted in descending order and the unlabeled samples divided into three clusters of nearly the same size according to the scores.The second cluster of unlabeled samples are considered as samples with minimum chance to the false negative.To keep the balance of negative and positive training samples, samples randomly selected from the second cluster with the same size as positive set are judged as reliable negative samples.

| Prediction model construction
DeepFM is an end-to-end learning frame combining FM and Multi-Layer Perceptron (MLP), which has the strong capability of automatic learning low-and high-order feature interaction and was successfully used in Click Through Rate (CTR) prediction 39 and MDA prediction. 36To further enhance the prediction performance, we present an improved DeepFM model, MRDPDA, which is regularized separately by each of multiple Laplacian instead of one Laplacian from simple superposition of multiple similarities about piRNAs and diseases.
The proposed MRDPDA includes three parts, which are embedding layer, FM and MLP components.FM and MLP components share the same inputs from the embedding layer and final output is calculated according to the outputs of FM and MLP: The proposed framework is shown in Figure 2 and details are described in the sequel.

| Embedding layer
Ding et al. 36 demonstrate the embedding layer plays an important role in DeepFM, so we aim to improve the embedding layer by Laplacian regularization of separate similarities.The structure of the embedding layer about similarity-level data fusion is shown in where D p is a diagonal matrix with D p ii = ∑ N j=1 PS 1 ij , ‖ ‖ 2 is the square of 2 norm, Tr is the matrix trace, Q 1 is the latent feature matrix whose i th column is the corresponding latent vectors of the ith piRNA and In the same way, the Laplacian regularization for the kth piRNA similarity and the tth disease similarity can be represented as minimum R k and R t optimization problem as follows: Where Q k i and Q k j are the column latent vectors of the i -th and jth piRNA about the kth piRNA similarity,U t i and U t j are the column latent vector of the i -th and jth disease about the tth disease similarity, Q k is the latent feature matrix whose i -th column is the corresponding latent vectors of the i -th piRNA, U t is that of the tth disease, Following the precious study practices, 36 low dimensional feature representations from the Laplacian eigenmaps of each node is used for initializing node's corresponding weights in the dense embedding layer.

| FM and MLP components
A standard FM and MLP components are used in our model and it is responsible for the interaction between low-order features.
The structure of the embedding layer about similarity-level data fusion.
| 9 of 13 As shown in Figure 2, the output of FM component is the sum of the addition unit and the inner product units.The addition unit reflects the linear interactions among features and the inner product represents the effect of order-2 feature interactions.The output of FM is as follows: Where x = x Field_piRNA, x Field_disease is the ddimensional vector which contains one-hot raw feature vectors of a piRNA and a disease, w ∈ R d is network parameters to measure impact of order-1 features' interaction, V i and V j is the corresponding latent vectors of raw features x i and x j , respectively.
MLP component is a feed-forward multi-layer neural network and used to learn high-order feature interactions according to the embedding vectors.The comprehensive embedding vectors about piRNA and disease are denoted as e p and e d , respectively.The input of MLP component x (0) , the forward process x (l+1) and the output of MLP y MLP are as follows: Where is the active function, W (l) and b (l) are the weights and bias at the lth layer, W (H) and (H) are that at the H hidden layers in MLP.

| Evaluation methods and hyper-parameters
In our study, five-fold cross-validation and independent test are employed to evaluate the performance of models.All experiments are run on a computer with an NVIDIA GeForce RTX 4070 graphics card, i7-10700KF CPU, 32GB of RAM and 4.5 TB of storage.The area under the receiver operating characteristics curve (AUC) and under the precision recall curve (AUPR) are used as performance metrics.The higher the AUC and AUPR values, the better the predictive performance of the model is.
Many hyper-parameters influence the prediction performance.
These parameters were empirically selected or chose based on performance evaluation in this study.In the step of similarity calculation, the hyper-parameter k denotes the length of k-mer fragment.
If the choice of k value is too small, k-mer fragments will not fully reflect the characteristics of the piRNA sample and affect the prediction performance.If the choice of k value is too large, it will bring the large computational overhead.The predictive results with different number of k are illustrated in Figure 4. Taking into account both prediction performance and computational efficiency, the k value is empirically set to 3. In

| Effectiveness of the improvements
MLRDFM is an improved deepFM model which has successfully been used in MDA prediction. 36 16) The predictive results of various ranges of k.

| Performance comparison with existing methods
Various PDA methods have been proposed and the three PDA methods with public code (iPiDA-GCN, 24 GAPDA-LGAT 23 and ETGPDA 26 ) are chosen to compare.iPiDA-GCN is a PDA predictor based on graph convolutional networks (GCN) for learning hidden association patterns in different biological networks.GAPDA-LGAT is a PDA method based online graph attention networks by extending heterogeneous networks to replace dichotomous networks.ETGPDA is a method based on GCN with the embedding transformation, which coverts piRNA and disease embeddings into the same space.
First, five-fold cross validation (5-CV) test on the benchmark dataset proposed in our paper are used and computation time of them are measured respectively.Furthermore, the independent test dataset was downloaded from Hou et al. 24 for performance comparison.Table 2 demonstrates the results of performance comparison.
From the 5-CV results, it can be seen that MRDPDA achieves the highest AUC (95.33%) and AUPR (94.83%).Additionally, our method takes141,202 ms, which is the fastest of four methods.It is slightly faster than ETGPDA (142,418 ms), is approximately twofold faster than GAPDA-LGAT (289,607 ms) and iPiDA-GCN (343,612 ms).This reveals our method is more effective and faster than the other three methods.
From independent test results, AUC of MRDPDA is still the highest and its AUPR is slightly lower than that of iPiDA-GCN.It is worth noting that GAPDA-LGAT is constructed for limited dataset and only predict the piRNAs associated with 13 diseases in the independent test dataset, so the prediction result is evaluated by these associations.

| Case studies
To evaluate the ability of MRDPDA for predicting associated piRNAs for new diseases, case studies of two important diseases 'Alzheimer's disease' and 'Gastric cancer' were reported.As in iPiDA-sHN, 30 we remove the pairs related to Alzheimer's disease and re-trained the model with the remaining samples.The trained model is used to identify the missing known associations of Alzheimer's disease.The top 10 piRNA associated with these two diseases are listed in Table 3. Six identified piRNAs associated with 'Alzheimer's disease' and eight identified associations of 'Gastric cancer' have been validated by literature mining.For examples, by high throughput technologies, the upregulation of piR-hsa-22564 and hsa-14621, the downregulation of piRhsa-14962 may be correlated with the increased risk of Alzheimer's disease. 40piR-hsa-1282 dysregulation associated with gastric cancer was validated by RT-PCR and the animal model experiments in vivo. 41ese prediction results demonstrate the effectiveness of the proposed method.The unverified piRNA-disease relationships can be considered as candidates and investigated in the further research.

| D ISCUSS AND FUTURE D IREC TI ON S
3][44] Compared with the available MDA, CDA and LDA computational tools, PDA research is still in its initial phase. 45Challenges remain in representation learning capacity, the quality and quantity of the benchmark set.In our study, we have developed a novel method, MRDPDA, for predicting PDAs based on Moreover, a unified objective function is proposed to combine embedding loss about similarities to ensure that the embedding is suitable for the prediction task.In addition, we have constructed a relatively balanced benchmark from piRPheno with a deep autoencoder for selecting reliable negative samples.This method has achieved the best performance on 5-CV and independent test among latest computing methods.Case studies have further demonstrated the effectiveness.
Clearly, there is still some room for improvement in the future:

| Data perspective
The quality and quantity of current PDA datasets still need to be improved.Because of the limited number and strong inclusion relations of diseases, we would try to design rationally merged method and build PDA data management platform.In addition, we would try to train a unified model based on three or more datasets in the setting of meta learning.

| Algorithm perspective
Future PDA algorithm research can be considered from two ways.
On the one hand, we would try to develop the new prediction model which is suitable for highly imbalanced PDA data.It is still noteworthy that PDA databases have highly abundant one-to-one associations.
88.62% and 86.13% piRNAs are associated with only one disease in piRDisease and MNDR, respectively.On the other hand, we will consider various learning strategies to improve the PDA prediction performance.For example, boosting models have obtained better performance in prediction of ligand-receptor interactions [46][47][48] by combining multiple weak learners.Once the PDA data is sufficient, ensemble learning strategy will be applied.

TA B L E 2
The performance comparison of the four methods.

Figure 3 .
Figure 3. Three types of piRNA similarities (piRNAs k-mer, piRNA local alignment and GIP kernel) and three types of disease similarities (diseases semantics, diseases symptom and GIP kernel) are taken into consideration in this study.One-hot encoding representations of a piRNA and a disease are utilized in the proposed model as inputs.Six Laplacian regularization of separate similarities are applied and Laplacian eigenmaps are used to initialize the weights for the embedding layer.As described previously, PS 1 ij , PS 2 ij and PS 3 ij are local alignment, k-mer and GIP kernel similarities of the i -th and jth piR-NAs, respectively.Q 1 i and Q 1 j are the column latent vectors of the i -th and jth piRNAs about the piRNA sequence local alignment L k p and L t d are the Laplacian matrixes.Suppose the loss function of MRDPDA is J Q k , U t , O , where Q k and U t are the latent feature matrices of piRNAs and diseases in the embedding layer, respectively.O is the set of all the rest parameters and J is the cross-entropy loss function.We propose a unified and extended objective function F Q k , U t , O , which combines the crossentropy loss and Laplacian regularizations as follows: where p k and d t are the regularization parameters that are used to balance the MRDPDA loss term and regularization term.It is still noteworthy that sparse disease similarity matrix from MeSH cause the failure of Laplacian regularization.If a disease has no similarities information, a small perturbation of 0.001 is set in the column and row of this disease to ensure the regularization operation.Different view similarity values can reflect the associations from different criteria and latent vectors by individual regulation of each similarity are set for the associations between the nodes' weights in the different fields of embedding layer.The comprehensive embedding vectors about piRNA and disease are integrated by the average of the obtained embedding vectors.
MRDPDA improves MLRDFM performance on PDA by structural modification of the embedding layer and selection of reliable negative set.To illustrate the effectiveness of these two improvements, we respectively compare the proposed method with MLRDFM with random unlabeled samples and reliable negative samples.Five-fold cross-validation experiments are conducted and the benchmark dataset proposed in this paper are used.The AUC and AUPR of those three models' prediction results are shown in Figure 5, from which we can see the following: (1) Compared with the orange curves (MLRDFM with random unlabeled samples), the red curves (MLRDFM with reliable negative samples) achieve the better performance, indicating that selection of reliable negative set by learning from positive samples contributes to the progress; (2) Compared with the simply superimposition and random initialization, six Laplacian regularization of separate similarities and initialization of Laplacian eigenmaps in the embedding layer help increased performance; (3) Overall MRDPDA yielded the best identification performance with the AUC of 0.9533 and AUPR of 0.9483.(

F I G U R E 5
The comparison of AUC and AUPR of the three models.| 11 of 13 LIU al. limited data from multiple sources.This method effectively integrates a deepFM model with regularizations of several separate Laplacians calculated from multiple yet limited datasets, instead of simply calculating an average similarity network to regularize the deepFM.
, where T d i indicates the set of nodes including all ancestors of d i as well as itself and E d i stands for the corresponding direct edges.D d i (t) denotes the semantic contribution of disease term t ∈ T d i related to the i -th disease and can be computed as follows: • Disease semantic similarity Disease semantic similarity has successfully been used in various ncRNA-disease association prediction studies.Using MeSH information, a directed acyclic graph (DAG) is constructed based on hierarchical descriptors, in which nodes represent disease terms and the edges represent the relationship between the current node and its ancestors.DAG d i = T d i , E d i describes the structure of the i -th disease d i the step of reliable negative samples selection, autoencoder model with a latent size of 32 are used.The hidden layer dimensions are 256, 64, 32, 64, 256 and input and output dimensions are 564.Mini-batch gradient descent with the batch size of 64 and Adam optimizer are applied.In prediction model construction part, the embedding size embedding _ size is the first key hyper-parameter, which is searched from {5,10,15, … , 55, 60}.The results on five-fold CV show that the optimal embedding_size value is 20.The regularization coefficients p k and d t are set as 0.01.The hidden layer dimensions of MLP part are 64 and 32, the activation function is ReLU.The learning rate is 0.001, the batch size is 64 and the dropout rate was set to 0.5.
Significance for bold values is the best performance.The top 10 piRNAs associated Alzheimer's disease (AD) and Gastric cancer (GC) predicted by our study.
TA B L E 3