piRNA-disease association prediction based on multi-channel graph variational autoencoder

Piwi-interacting RNA (piRNA) is a type of non-coding small RNA that is highly expressed in mammalian testis. PiRNA has been implicated in various human diseases, but the experimental validation of piRNA-disease associations is costly and time-consuming. In this article, a novel computational method for predicting piRNA-disease associations using a multi-channel graph variational autoencoder (MC-GVAE) is proposed. This method integrates four types of similarity networks for piRNAs and diseases, which are derived from piRNA sequences, disease semantics, piRNA Gaussian Interaction Profile (GIP) kernel, and disease GIP kernel, respectively. These networks are modeled by a graph VAE framework, which can learn low-dimensional and informative feature representations for piRNAs and diseases. Then, a multi-channel method is used to fuse the feature representations from different networks. Finally, a three-layer neural network classifier is applied to predict the potential associations between piRNAs and diseases. The method was evaluated on a benchmark dataset containing 5,002 experimentally validated associations with 4,350 piRNAs and 21 diseases, constructed from the piRDisease v1.0 database. It achieved state-of-the-art performance, with an average AUC value of 0.9310 and an AUPR value of 0.9247 under five-fold cross-validation. This demonstrates the method’s effectiveness and superiority in piRNA-disease association prediction.


INTRODUCTION
The PIWI-interacting RNAs (piRNAs) are non-coding small RNAs with 24-32 nucleotides (nt) (Ghosh et al., 2022), which interact with P-element-induced wimpy testes (PIWI) genes among proteins of the Argonaute family (Yang, Cho & Zheng, 2020).Recent research evidence demonstrates that piRNA plays an important role in many biological processes such as maintaining germline DNA integrity (Wang et al., 2023), epigenetic regulation (Sun, Lee & Li, 2022), transcriptional silencing, and heterochromatin formation learning To rank (LTR) model, to identify the potential association between piRNA and disease.It uilitized gradient increasing decision tree to acquire feature representations of piRNAs and diseases, and then employed multiple models, i.e., Random Forest and support vector machine (SVM), to compute the association score between the unknown piRNA and each known disease.
In recent years, more and more deep learning computational methods have been introduced into this research field.Wei, Ding & Liu (2020) utilized a convolutional neural network (CNN) model to get features of piRNA and disease and adopt a two-step PUL method for association prediction.At the first step, an SVM is employed to get highquality negative samples, while these samples and known positive samples were fed into another SVM classifier to predict potentially disease-related piRNAs in the second step.Ali, Tayara & Chong (2022) utilized one-hot encoding method to encode piRNA sequences and disease semantics, and employed CNN to learn feature representation.In addition, some researchers also use graph embedding methods to obtain the features of the piRNAdisease association network.For example, Zheng et al. (2020b) proposed to utilize graph attention network (GAT) (Veličković et al., 2017) to learn the feature representation of piRNAs and diseases.Hou, Wei & Liu (2022) introduced a multiple-view learning framework by graph convolution network (GCN) (Kipf & Welling, 2016a).They constructed two GCN models, one was Asso-GCN, which learned feature representation of association information from heterogeneous nodes in the association network, and the other was Sim-GCN, which captured similarity features from homogeneous nodes of piRNA or disease.The experimental results showed that graph models helped capture non-linear association information over piRNA-disease association networks.
On the other hand, although existing graph-based methods achieve good performance, they still have limitations.Firstly, it is challenging to fuse multi-layer similarity information between piRNAs and between diseases.Current methods always combined similarity features with an average operation (Wei, Xu & Liu, 2021;Ji et al., 2021), which may be detrimental to those important association information of piRNAs and diseases.Secondly, current representation learning methods always employ sophisticated neural networks such as CNN and GAT, which requires rich labeled data, whereas piRNA data are sparse and imbalanced.In other words, such models are difficult to get underlying feature representation by leveraging on unknown association information in graphs.
In this article, a novel model named the Multi-Channel Graph Variational Autoencoder (MC-GVAE) is proposed to predict potential piRNA-disease associations.The motivation lies in two folds: (1) the graph VAE model learn feature representation in an unsupervised manner (Kipf & Welling, 2016b), which helps get underlying relationships in sparse piRNA-disease association networks; (2) the multi-channel method helps fuse multiple similarity information.Specifically, MC-GVAE first constructs four similarity networks by measuring sequence or semantic similarity and GIP kernel similarity of piRNA and disease (Van Laarhoven, Nabuurs & Marchiori, 2011).Then, the graphs are modeled via a graph VAE to obtain deep feature representations of piRNAs and diseases, and the homogeneous similarity networks are integrated by a multi-channel method, where parameters are shared over different graph VAEs for homogeneous similarity networks.Finally, the feature representations are concatenated and fed into a full-connected neural network to prediction.The contributions of this article are listed as follows: 1) This article adopts a graph variational autoencoder (graph VAE) model to learn the feature representation of piRNA and disease, trying to capture the data distribution features in the graph.
2) This article proposes a multi-channel method to fuse information in homogeneous similarity networks in order to better acquire the relations in piRNAs and diseases, respectively.
3) Computational experiments show that the state-of-the-art performance for piRNAdisease association prediction is achieved by this model.
In this study, we used the same data preprocessing steps as iPiDA-sHN (Wei, Ding & Liu, 2020).These steps include filtering redundant and non-human piRNAs, removing piRNAs that have no sequence information in the database.After applying these steps, we obtained a benchmark dataset that is identical to the datasets used by iPiDi-PUL (Wei, Xu & Liu, 2021), iPiDA-sHN (Wei, Ding & Liu, 2020), and DFL-PiDA (Ji et al., 2021), containing 5,002 experimentally validated associations with 4,350 piRNAs and 21 diseases.The benchmark dataset is a publicly accessible resource at http://bliulab.net/iPiDi-PUL/dataset/.

Computation method
This study proposes a MC-GVAE model to predict the potential associations between piRNAs and diseases.The model is a pipeline with four sequential steps: data preprocessing, similarity network construction, representation learning based on Multichannel graph VAE, and prediction.In the step of data preprocessing, the piRNA sequence expression and the adjacency matrix of piRNA-disease associations are extracted from the piRDisease V1.0 database.The adjacency matrix is as follows: where a i;j ¼ 1 if the i-th RNA is associated with the j-th disease, otherwise a i;j ¼ 0.Then, the k-mer algorithm is used to calculate the sequence features of each piRNA, and the Pearson product-moment correlation coefficient is used to obtain the sequence similarity matrix of piRNAs.Meanwhile, based on the adjacency matrix of piRNA-disease associations, the Gaussian kernel function is used to calculate the GIP similarity matrix of piRNAs and diseases.In addition, the directed acyclic graph (DAG) of each disease in the associated matrix is obtained from the disease descriptor vocabulary MeSH, and the semantic value of each disease is calculated, resulting in the semantic similarity matrix of diseases.In the step of similarity network construction, four different similarity subnetworks are constructed based on the four different similarity features obtained in the previous step.In the step of representation learning, four identical VGAE models are used to separately learn the feature representations of piRNAs and diseases from four different similarity subnetworks, and then concatenate the four different feature representations to better mine the associations between piRNAs and diseases.In the step of prediction, a three-layer fully connected neural network is used to classify and predict all sample pairs, obtaining the final prediction results.The flowchart of MC-GVAE's computation process is shown in Fig. 1.The following sections will give a detailed introduction to each step's specific implementation.

Similarity network construction
To construct the similarity network, we computed the similarity features between piRNAs and diseases from four different perspectives, namely piRNA sequence similarity, disease semantic similarity, piRNA GIP kernel similarity, and disease GIP kernel similarity.These similarity features can reflect the different correlation patterns between piRNAs and Step 2 involves constructing four similarity networks based on piRNA sequence similarity, piRNA GIP kernel similarity, disease semantic similarity, and disease GIP kernel similarity.
Step 3 introduces the Multi-Channel Graph Variational Autoencoder, where graph VAE learns feature representations in an unsupervised manner and integrates various similarity networks using a multi-channel approach.In Step 4, the feature representations are put into a three-layer neural network for predicting piRNA-disease associations.
Full-size  DOI: 10.7717/peerj-cs.2216/fig- 1 diseases, which can help our model to better mine the associations between piRNAs and diseases.
piRNA sequence similarity network piRNA sequence similarity network is an undirected graph, where each node represents a piRNA sequence and each edge represents the relation of two nodes.Generally, the relation can be measured by the similarity of features of two sequences.One sophisticated method to obtain the features of piRNA sequences is K-mer (Li et al., 2016), an algorithm to extract sequence-derived features using the overlapping shifting window.In this study, K-mer is utilized to divide RNA nucleotide sequences into subsequences of length ð and count their occurrence frequencies (Kirk et al., 2018), to obtain the sequence feature vectors of RNA.Following the algorithm of Ji et al. (2021), the value of K was set to 3 in Kmer to obtain a 64-dimensional vector.
Pearson product-moment correlation coefficient is employed in this study to calculate the similarity of piRNA sequence features.The calculation method is as follows: where F P i and F P j denotes the sequence feature vector of i-th and j-th piRNA, respectively.Cov F P i ; F P j À Á denotes covariance between F P i and F P j .r F P i and r F P j denotes the standard deviation between F P i and F P j .

Disease semantic similarity network
In this study, the disease semantic similarity is computed via the disease database MeSH (Lipscomb, 2000).The knowledge of disease in such database can be organized as a directed acyclic graph G ¼ V; E ð Þ, where V is the set of nodes represent diseases and E the set of edges represent the relations of diseases.The semantic contribution value D d t ð Þ of the disease t to the disease d is measured as follows: where a denotes the semantic contribution factor, with a value of 0.5 (Wang et al., 2010).t denotes a disease node in the node set V of the directed acyclic graph, and t' denotes a child node of t.For example, in the directed acyclic graph of Brain Neoplasms, Central Nervous System Disease is a subclass of Brain Diseases.Therefore, Central Nervous System Disease represents t' and Brain Diseases represents t in this case.
The disease semantic similarity D d i ; d j À Á is based on an assumption that the more DAGs shared by disease pairs, the larger their similarity (Wang et al., 2010).The similarity is defined as follows: where D d i ð Þ and D d j À Á denotes the semantic value of disease d i and d j , respectively, and can be obtained as follows:

GIP kernel similarity networks
Experimental evidence shows that similar piRNAs usually show similar correlation patterns with diseases (Chen et al., 2016).On the other hand, such patterns always mean nonlinear relationship between piRNAs and diseases.Therefore, The GIP kernel similarity was utilized to measure the association of piRNAs and diseases, based on the topology of the known association knowledge of piRNAs and diseases and the interaction profile corresponding to each piRNA and disease (Van Laarhoven, Nabuurs & Marchiori, 2011).
In other words, piRNAs having higher GIP kernel similarities with those disease-related ones could be the positive candidates in piRNA-disease association prediction.The GIP kernel similarity is defined as follows: where A p i ð Þ denotes the association between piRNA p i and all diseases in the sample.A d i ð Þ denotes the association between disease d i and all piRNA in the sample.N p and N d denotes the number of piRNA and disease in the sample, respectively.

Representation learning based on multi-channel graph VAE
A graph VAE is a kind of variational autoencoder applied to graphs composed of GCN and VAE and is an unsupervised learning framework.Therefore, the graph VAE not only learns graph structure features but also generates data distribution features of piRNA and disease.In the section that follows, GCN and graph VAE will be briefly introduced.

GCN
Graph convolution network (GCN) is a powerful representation learning method in graph structure data.Here a GCN model was employed to extract four feature vectors of piRNA and disease from four similarity networks, including piRNA sequence, disease semantics, piRNA GIP kernel, and disease GIP kernel similarity networks.Firstly, an adjacency matrix was constructed with the weight based on piRNA sequence similarity.Then, the piRNA-disease association matrix A 2 R 4,350 Â 21 is set as graph features, in which each row of the piRNA-disease association matrix is initialized as each piRNA feature H. Finally, a two layers GCN is utilized to obtain a low-dimensional feature matrix.H i and W i denotes the node feature and the parameter matrix of the i-th layer.Therefore, the node features vectors H lþ1 in each layer is represented as follows: where the ReLu() function serves as an activation mechanism to introduce non-linearity into the model.

Graph VAE
The graph VAE model is an end-to-end framework, including an encoder and a decoder.
In the encoder stage, a GCN is employed to learn the feature representation vector.In the decoder one, the similarity matrix was reconstructed by calculating the inner product of the vector.The encoder contains two layers.The first layer is to obtain the low dimensional vector of each node, while the second layer generates its distribution feature.The two layers are formalized as follows: where GCN l ðÞ and GCN r ðÞ are two GCNs and share the weight, and X is the feature matrix of nodes.The latent representation Z is defined as follows: where e $ N 0; 1 ð Þ. Optimal parameters were obtained by reconstructing the piRNA similarity matrix through the calculation of the inner product between the feature embedding F and its transpose F T .During the learning time, consideration was given to two loss functions, binary cross-entropy and KL-divergence, to minimize the distance between the target and the reconstructed matrix.The loss function L is denoted as follows:

Multi-channel representation learning
Essentially, the piRNA sequence similarity network and the piRNA GIP similarity network demonstrate the relevance of piRNAs with different perspective.That is the same situation between the disease semantic similarity network and the disease GIP similarity network.Therefore, A tensor can be used to represent such similarity networks, each slice denoting one kind of relevance meaning.A multi-channel feature representation model can also be built to learn from such tensor network (Chen et al., 2022).The aim of the multi-channel method is to model information from multiple aspects, which may lead to sufficient feature representation learning.Average pooling, parameter sharing and cross-attention are always leveraged for multi-channel modeling.In this study, a multi-channel learning model is built, where the module of each channel learns the representation of piRNA by leveraging on one similarity network respectively.

Predicting piRNA-disease association based on neural network
Firstly, the representation of piRNA-disease association pairs is obtained by concatenating four features Z PS , Z PG , Z DS , and Z dG , where Z PS denotes the distribution representation learned from the piRNA sequence similarity network, Z DS from the disease semantic similarity network, Z PG from the piRNA GIP kernel similarity network, and Z dG from the disease GIP kernel similarity network: Then, the final representation F was fed into a three layers full connection neural network classifier, and get the prediction score ŷi as follows: where sigmoidðÞ is an activation function to transform ŷi to be a number between 0 and 1.In this study, the piRNA-disease association pairs with ŷi > 0:5 as 1 (positive sample).The otherwise labeled as 0 (negative sample).However, unlike graph VAE, only the binary cross-entropy was utilized as loss function for prediction.

Experimental setup and hyperparameter configuration
In the experiments, a five-fold cross-validation is conducted in the training stage in order to make this model more reliable.More specifically, samples in the training data are divided equally into five subsets, four of which are utilized for training and one of which for testing.Note that each subset has one chance to be the test data.The evaluation metrics that are mostly employed in association prediction tasks, including accuracy, recall, precision, F-1 score, and Matthews correlation coefficient (MCC).
To visually illustrate the performance, this study employs the receiver operating characteristic (ROC) curve and the precision-recall (PR) curve.The ROC curve reflects the relationship between the true positive rate and the false positive rate of the model at different thresholds, while the PR curve reflects the relationship between the precision and recall of the model at different thresholds.In addition, the Area Under Curve (AUC) and the area under PR curve (AUPR) were also employed as evaluation metrics, which are frequently used to measure the performance of classification or prediction tasks.AUC and AUPR are obtained by calculating the area under the ROC curve and the PR curve, respectively, and they can comprehensively reflect the performance of the model at different thresholds.
The experiment is based on the Python deep learning framework PyTorh v1.11.0, and the GCN model is performed by adopting Deep Graph Library (DGL) v0.8.1, a highlyperformant package for graph neural networks (Wang et al., 2019b).The computational method of this article was performed on the Ubuntu 20.04 with two 1080Ti GPUs.Moreover, several important hyperparameters settings of this study were presented.
According to the previous work (Kipf & Welling, 2016b), GCN was also adopted as an encoder, and the experiment demonstrates that if the number of hidden layers is over 2, the prediction performance of the model will be declined.The dimension of the vector in hidden layer1 and hidden layer2 were set as 48 and 16, and the learning rate is determined as 0.001.The training epoch is detected as 50, and the Adam optimizer and loss function with cross entropy and KL divergence were employed to update the parameters of MC-GVAE in each epoch.The following Table 1.provides a detailed overview of the model architecture and hyperparameter settings for GVAE and the classifier.

Predicting piRNA-disease association based on neural network
A benchmark dataset (Wei, Xu & Liu, 2021) based on the piRDisease database v1.0 was made use of.The benchmark dataset included 10,004 piRNA-disease association pairs, in which the positive sample pairs contained 5,002 associations with 4,350 piRNAs and 21 diseases.In addition, 5,002 negative sample pairs were randomly selected from unknown piRNA-disease association sample pairs.Finally, the detail of the results were shown in Table 2, and the results of AUC and AUPR are shown in Figs. 2 and 3, respectively.Table 2 shows the average performance of model MC-GVAE on the benchmark dataset based on five-fold cross-validation.
Table 2 shows that MC-GVAE achieved high levels of performance on all metrics, with average values of 0.9310, 0.9247, 0.9020, 0.9390, 0.8601, 0.8977, and 0.8070, respectively.This indicates that MC-GVAE can effectively learn the complex features of piRNA and disease, and use the graph structure information to improve the prediction accuracy.From the description of Fig. 2, the ROC curve is close to the upper left corner, the P-R curve is close to the upper right corner, and the curves of each fold are very close, which shows that MC-GVAE has excellent accuracy and robustness in distinguishing positive and negative samples and precisely retrieving positive samples.

The effect of GCN encoding module
In order to explore the impact of different encoders, we make a comparative experiment, where two encoders are built: one is based on a GCN model, and the other is implemented with a linear layer neural network.The decoder and the training objective are kept It can be seen from the Figs. 4 and 5 that the AUC and AUPR scores of MC-GVAE are higher than those of VAE, respectively, under the same benchmark dataset based on 5-fold cross validation, which also means that the prediction performance of MC-GVAE is significantly better than that of MC-GVAE.It indicates that the GCN encoder can effectively obtain the structure information of four views graph, and have an important impact on MC-GVAE model.

Performance comparison of graph representation learning
To investigate the performance of different graph neural networks in modeling the similarity knowledge with four similarity networks, three models, GCN, GAT and Graph Transformer model-Graphormer (Ying et al., 2021), were employed.A two-layer GCN model was built to obtain low-dimensional feature representations of each view, and the numbers of the hidden units in the first and the second layer were 48 and 16, respectively.A single-layer GAT with an eight-head self-attention mechanism was also applied to learn the feature representations of four different views.Furthermore, the Graphormer model was also applied to learn feature representations, utilizing the GraphormerLayer provided in the DGL library.The hidden layer was designed with 48 units, num_heads set to 3, and a Dropout rate of 0.1 for the feature layer.
The number of parameters in all models is the same in the hidden layer, and the dimension of the input and the output vector were set to 48 and 16, respectively.The learning rate of the model is 0.001 during the training process.
Using a five-fold cross-validation method, four GNN models were trained on the benchmark dataset and their prediction performance was compared using common evaluation indicators, including AUC, AUPR, precision, recall, F1-scores, and MCC, for the purpose of comparing the performance of representation learning.Finally, the average of these indicators was listed as experimental results in Table 3.
Experimental results indicate that the graph variational autoencoder (GVAE) model surpasses other models across all evaluation metrics, particularly in terms of area under the curve (AUC), area under the precision-recall curve (AUPR), precision, recall, F1 score, and MCC.These findings affirm the efficacy and superiority of the graph VAE in the task of predicting piRNA-disease associations.Although the Graphormer, as a Graph Transformer model, demonstrates potential abilities in processing graph-structured data, it remains inferior to the GVAE models, which is built for capturing complex relations over graphs.Moreover, the generative learning approach of the graph VAE offers enhanced interpretability in understanding the associations between piRNAs and diseases.Consequently, this research underscores the significance of graph VAE in such tasks and provides new directions for future studies.Ablation study The feature representation of two different views was concatenated as the final classification feature of piRNA-disease association pairs in this model.In order to explore the contribution of the features of two views to the overall performance of MC-GVAE, an ablation study was performed under the benchmark dataset on five-fold cross validation.Some variants of MC-GVAE model were as follows: . MC-GVAE without graph VAE: the graph VAE is removed, and four similarity features are directly fed into a three-layer full-connected neural network to predict the piRNAdisease association.
. MC-GVAE without NN: it only uses the inner product to reconstruct two prediction matrices based on the features of two views and gets the final prediction matrix by averaging two re-construction matrices.
. MC-GVAE without association: only two similarity calculation data, piRNA sequence and disease semantics, are employed to build the input data to learn node features of piRNA and disease.
. MC-GVAE without attribution: only the GIP similarity data for piRNA and disease are employed as the input data to learn node features of piRNA and disease.
Figure 6 shows the results of the models mentioned above.It can be seen that the original model still achieves the best performance, while MC-GVAE without graph VAE has the lowest AUC and AUPR performance.It can also be seen that, the performance of MC-GVAE without association view and MC-GVAE without attribution view based on different views by using the same computing framework is similar, and both achieve remarkable results.In addition, the best performance based on two different view features was achieved by this model MC-GVAE.
To quantify the impact of each component, statistical significance tests were conducted for the four variants and MC-GVAE.As demonstrated by the T-test results in Table 4, the MC-GVAE without graph VAE component and the MC-GVAE without NN component resulted in a significant decrease in performance, with statistical values of 61.905 and 68.509, respectively, and p-values of 0.000, indicating the importance of these two components to the model's performance.In contrast, the variants without association and without attribution did not show statistically significant performance decreases, with statistical values of 0.849 and −0.382, respectively, and p-values of 0.396 and 0.703.These results underscore the significance of the graph VAE and n three-layer fullconnected neural network components in our model.The original MC-GVAE model achieved optimal performance in integrating feature representations from two different views, demonstrating that our proposed multi-view computational framework can significantly enhance the performance of piRNA-disease association prediction tasks.
. iPiDi-PUL is a widely used and cited method for piRNA-disease association prediction, which also adopts four types of features like our method.Specifically, it adopts an ensemble learning method, which employs multiple random forests with different depths to predict piRNA-disease association.It extracts key features by principal component analysis (PCA) (Abdi & Williams, 2010), and gets various prediction scores via multiple random forests classifiers.Finally, averaging all classifier scores is used by the model.Different from our study, it does not use a graph-based learning model to capture the complex relationships between piRNAs and diseases, but relies on PCA to extract features.Choosing iPiDi-PUL as a comparative method may help to study the performance of graph-based learning models against models of reduction dimensions in representation learning.
. iPiDA-sHN is a state-of-the-art method for piRNA-disease association prediction.It is a two-step PUL model.It trains the initial classifier with positive samples and randomly selected negative samples, and then trains another classifier using positive samples and negative samples acquired in the previous step.Support Vector Machine (SVM) is leveraged to build the classifier.Different from our method, it does not use a graph variational autoencoder, but a CNN, to learn the latent representations of piRNAs and diseases.Choosing iPiDA-sHN as a comparative method may help to study the performance of graph variational autoencoder against other graph-based learning models.
. DFL-PiDA is constructed from a convolutional denoising autoencoder neural network combined with an extreme learning machine (ELM) to predict disease-related piRNAs.
It utilizes a single hidden layer feedforward neural network to build the ELM.Different from our method, it use a convolutional denoising autoencoder to learn feature representations.Choosing DFL-PiDA as a comparative method may help to study the performance of graph variational autoencoder against other autoencoder models.
Our MC-GVAE method was performed on the Ubuntu 20.04 with two 1080Ti GPUs environment.The results are shown in Table 5.We can see that, the MC-GVAE model outperforms other methods in terms of both average AUC and AUPR metrics, demonstrating that it can effectively predicts piRNA-disease associations and has higher prediction accuracy and stability.Meanwhile, there are some possible reasons of low performance of the three models: (1) iPiDi-PUL employs PCA to extract key features, which may overlook some relevant information; (2) iPiDA-sHN adopts SVM to construct the classifier, which may be influenced by the sample imbalance; (3) DFL-PiDA merely fuses four similarity matrices, which fails to learn the rich feature representations of piRNA and disease.In addition, iPiDA-sHN and DFL-PiDA outperform iPiDi-PUL, which show that neural networks are effective than traditional models.

Case study
In this part, the prediction performance of MC-GVAE was further evaluated, three diseases, renal cell carcinoma, Alzheimer's disease, and cardiovascular diseases with the largest number of associations in the association matrix were selected as case studies.Specifically, the association information of three diseases-related was firstly removed from the benchmark dataset, and the data remained was employed as the positive samples for training.Then, three diseases-related associations were used as positive samples in each prediction process, and the negative samples from unknown associations of three diseases with other piRNAs.Finally, the prediction scores between three diseases and other known piRNAs were obtained, respectively, and the top 20 piRNAs-related with the highest prediction score were selected for analysis.
Tables 6-8 shows the top 20 renal cell carcinoma-related, the top 20 predicted Alzheimer's diseases-related, and the top 20 predicted cardio-vascular diseases-related piRNAs which achieves the high scores of disease-associated prediction, respectively.As

CONCLUSIONS
Being a significant biomarker for the diagnosis of diseases, piRNA has attracted much attention of academic community.Compared to traditional biological experiments, computational methods are more effective to find potential associations between piRNAs and diseases.In this article, a novel computational model named MC-GVAE was proposed to predict disease-related piRNAs based on a multi-channel graph VAE.An end-to-end computational framework was adopted, and four similarity networks were employed in order to learn latent association knowledge.For homogeneous networks, a multi-channel method was proposed to integrate them for learning.The experimental results showed that the proposed model was effective to predict piRNA-disease association.
Overall, the advantage of MC-GVAE lies in two folds: (1) it use a graph variational autoencoder, which helps to learn feature representations over graphs; (2) it use a multichannel method, which helps to capture semantic relations of multiple aspects.On the other hand, there are also potential limitations of MC-GVAE: (1) it depends on rich and high-quality data, which are not always available; (2) it is based on the assumption of static piRNA-disease associations like other GNN-based models, whereas the knowledge could be updated for new biological findings; (3) it lacks the ability of casual inference, which is useful for some applications.
There are some directions for the further improvement of this model.Firstly, the similarity knowledge derived from computational methods, such as sequence similarity, disease semantic similarity and GIP kernel similarity does not always accurate, and the incorrect knowledge maybe noisy to the model.Therefore, more accurate knowledge from biological experiment evidence is needed, such as position-specific information of piRNA sequence thermodynamic and physicochemical properties of piRNAs etc.Secondly, there are few ground-truth negative samples in the current knowledge bases of piRNA, and pseudo negative samples via random selection may lead to poor robustness of the model.In the furure work, more reasonable sampling methods, such as positive unlabeled learning (PUL), or non-sampling learning methods, could be adopted to make a more robust model.Thirdly, considering that our model is independent from concrete RNA sequences and diseases, it can be easily transferred to other prediction tasks of RNA-disease association.To this end, more experiements with different ncRNA databases will be made to validated the performance of the proposed model in the future work.

Figure 1
Figure1The flowchart of MC-GVAE.The diagram illustrates the core processes of the proposed MC-GVAE model.In the diagram, blue dots represent piRNA entities, and green diamonds represent disease entities.In Step 1, data is pre-processed from the piRDiseaseV1.0database to form an association Matrix.Step 2 involves constructing four similarity networks based on piRNA sequence similarity, piRNA GIP kernel similarity, disease semantic similarity, and disease GIP kernel similarity.Step 3 introduces the Multi-Channel Graph Variational Autoencoder, where graph VAE learns feature representations in an unsupervised manner and integrates various similarity networks using a multi-channel approach.In Step 4, the feature representations are put into a three-layer neural network for predicting piRNA-disease associations.Full-size  DOI: 10.7717/peerj-cs.2216/fig-1

Table 2
The performance of MC-GVAE based on five-fold CV.

Table 3
Performance of graph representation learning models.

Table 4
Significance testing results for models of ablation test.

Table 5
The performance comaprison of MC-GVAE and other methods.

Table 7
Top 20 predicted Alzheimer's diseases-related piRNAs.Sun et al. (2024), PeerJ Comput.Sci., DOI 10.7717/peerj-cs.2216shown in Table6, only one association is still unconfirmed in renal cell carcinoma.In Table7, eight piRNA candidates of the top 20 in Alzheimer's disease are verified by piRDisease v1.0 database, and the other five candidates' piRNAs are confirmed by published literature.However, seven piRNA candidates with the possible association are unconfirmed.In Table8, all piRNA candidates in top 20 are predicted for cardio-vascular diseases.In other words, those piRNA-disease associations predicted only according to the known knowledge by our model are verified by biological experiments.It indicates that iPiDA-VAGE is effective in piRNA-disease association prediction.