Projection-Based Neighborhood Non-Negative Matrix Factorization for lncRNA-Protein Interaction Prediction

Many long ncRNAs (lncRNA) make their effort by interacting with the corresponding RNA-binding proteins, and identifying the interactions between lncRNAs and proteins is important to understand the functions of lncRNA. Compared with the time-consuming and laborious experimental methods, more and more computational models are proposed to predict lncRNA-protein interactions. However, few models can effectively utilize the biological network topology of lncRNA (protein) and combine its sequence structure features, and most models cannot effectively predict new proteins (lncRNA) that do not interact with any lncRNA (proteins). In this study, we proposed a projection-based neighborhood non-negative matrix decomposition model (PMKDN) to predict potential lncRNA-protein interactions by integrating multiple biological features of lncRNAs (proteins). First, according to lncRNA (protein) sequences and lncRNA expression profile data, we extracted multiple features of lncRNA (protein). Second, based on protein GO ontology annotation, lncRNA sequences, lncRNA(protein) feature information, and modified lncRNA-protein interaction network, we calculated multiple similarities of lncRNA (protein), and fused them to obtain a more accurate lncRNA(protein) similarity network. Finally, combining the similarity and various feature information of lncRNA (protein), as well as the modified interaction network, we proposed a projection-based neighborhood non-negative matrix decomposition algorithm to predict the potential lncRNA-protein interactions. On two benchmark datasets, PMKDN showed better performance than other state-of-the-art methods for the prediction of new lncRNA-protein interactions, new lncRNAs, and new proteins. Case study further indicates that PMKDN can be used as an effective tool for lncRNA-protein interaction prediction.


INTRODUCTION
RNA represents the direct output of genomic encoded genetic information, and a large part of the regulatory capacity of cells focuses on the synthesis, processing, transportation, modification, and translation of RNA. With the continuous improvement of RNA analysis, cell type isolation, and culture technology, people's understanding of many biological functions of RNA is also getting higher and higher (DjebaliDavis and Merkel et al., 2012). Studies have shown that up to 85% of human genes are transcribed, but the proportion of RNA transcriptional codons encoding proteins is extremely low, suggesting that most RNA transcripts are non-coding (Fang and Fullwood, 2016). A large part of human genes plays their functions through non-coding RNA (ncRNA) (Mattick, 2005). Transcriptional ncRNA has similar chromosome modification functions to protein-coding genes. In multiple sites of human genome, the deletion of ncRNA will lead to the decline of the specificity of adjacent protein-coding genes (Ulf Andersson ørom et al., 2010). Long non-coding RNA (lncRNA) is an important type of ncRNA, which has more than 200 nucleotide transcripts and no obvious protein coding function (Volders et al., 2013). With the development of biological information, people are becoming more and more aware of the important role of lncRNA in various biological processes; lncRNA is involved in the regulation of gene expression and function of multiple networks, affects the formation of the kernel structure domain and whole chromosome state of transcription, and participates in the interaction of two different chromosomal regions through direct mechanisms regulating the chromosome structure (Batista and Chang, 2013). In addition, a growing number of studies have shown that mutations and disorders of lncRNA are associated with different human diseases. The primary structure, secondary structure, expression level of lncRNA, and changes in its homologous binding protein can lead to a variety of diseases ranging from neuropathy to cancer (Wapinski and Chang, 2011). Currently, more and more lncRNA have been discovered, but their functions and mechanisms are still poorly understood. In general, almost all lncRNA functions are expressed through the interaction with the corresponding RNA-binding proteins, and their functions and mechanisms depend on their interaction with various protein complexes in cells (Khalil and Rinn, 2011). Therefore, it is important to determine the potential interactions between lncRNAs and proteins to study the functions of lncRNA. It is expensive and time-consuming to detect large-scale lncRNA-protein interactions by experimental means, so a large number of computational models are proposed based on existing experimental data (Suresh et al., 2015).
Based on the physicochemical properties of peptide chains and nucleotide chains, Bellucci et al. (2011) proposed catRAPID in 2011, which combined secondary structure, hydrogen bonding, and van der Waals to predict the interactions between lncRNAs and proteins. Subsequently, Lu et al. (2013) proposed the lncPro model, which used the secondary structure, hydrogen bonds, van der Waals, and other features to encode nucleotide and amino acid sequences into feature vector, and calculated the interaction scores between lncRNAs and proteins by Fisher's linear discriminant method. Suresh et al. (2015) proposed the RPI-Pred to predict the interactions between lncRNAs and proteins, which combined the secondary structural feature of RNA sequences with the three-dimensional structural feature of proteins and used support vector machine (SVM) model for prediction. Xiao et al. (2017) proposed a PLPIHS model, which constructed a heterogeneous model by using lncRNA-lncRNA similarity network, lncRNA-protein interaction network, and protein-protein interaction network, and then established a SVM classifier to predict lncRNA-protein interaction by HeteSim score. Subsequently, Deng et al. (2018) improved on PLPIHS and proposed a PLIPCOM model, which simultaneously obtained the low-dimensional features of lncRNA (protein) by restarted random walk and singular value decomposition on heterogeneous networks, and then used the gradient asymptotic tree algorithm to predict by combining the HeteSim score and low-dimensional features. Both algorithms achieved high AUC values, but they used the known lncRNA-protein interaction information to construct heterogeneous network, which also led to the reuse of the known interactions. Recently, Hu et al. (2018) proposed an ensemble strategy to predict potential lncRNA-protein interactions (HLPI-Ensemble), which used the strategy of random pairing to generate negative samples of lncRNA-protein interactions, and integrated support vector machine (SVM), random forest (RF), and extreme gradient enhancement (XGB) three mainstream machine learning algorithms to predict interaction scores. This ensemble learning strategy can not only improve the prediction performance of the model, but can also prevent the over-fitting of the model to some extent. Pan and Shen. (2017) used hybrid convolutional neural network and deep belief network to predict RNA-protein binding sites on RNAs, which used multimodal deep learning to fuse shared features of different sources of data, and found the explainable binding motifs. The above supervised learning method has achieved certain effects in predicting lncRNAprotein interactions, but there are still some problems. First, the key to supervised learning is to construct as balanced as possible positive and negative samples, but at present, most databases only provide lncRNA-protein interaction information, while the construction of negative samples is still a problem. Second, lncRNA-protein interaction prediction problem is a serious unbalanced classification problem, and the known interaction accounts for less than 1% of the total lncRNA-protein pairs, while many supervisory models often choose the same number of positive and negative samples as training set and test set, which artificially reduces the prediction range of the model to some extent. Finally, both lncRNA and protein exist in a whole biological network, and the rational use of lncRNA (protein) network topology can greatly improve the predictive performance of the model.
Recently, many network-based models have been proposed for predicting lncRNA-protein interactions. Li et al. (2015) proposed a heterogeneous network model to predict lncRNA-protein interactions, which constructed a lncRNA similarity network using lncRNA expression profiles and protein similarity network using weighted protein-protein interactions (PPIs), then combined with known lncRNA-protein interaction network uses the restart random walk model to make predictions. Ge et al. (2016) proposed a binary network inference algorithm (LPBNI) using only the known lncRNA-protein interactions to infer potential lncRNA-associated proteins. Zheng et al. (2017) predicted potential lncRNA-protein interactions by fusing multiple network information. Specifically, based on protein sequence, protein domain, protein GO term and STRING dataset, the method constructed four protein similarity networks, respectively, and integrated with similarity network fusion algorithm (SNF), and then used random walk algorithm to calculate the score. Recently, Zhang et al. (2018a) proposed a linear neighborhood propagation algorithm (LPLNP) to predict the potential lncRNA-protein interactions. Specifically, based on various feature extracted, LPLNP calculated the linear neighborhood similarity of the corresponding lncRNA (protein), and used the label propagation algorithm to calculate the interaction scores, and finally the linear combination of all prediction scores as the final result. Subsequently, Zhang et al. (2018b) proposed a sequencebased feature projection ensemble learning algorithm (SFPEL-LPI). Specifically, based on lncRNA sequences, protein sequences, and known lncRNA-protein interactions, SFPEL-LPI extracted a variety of lncRNA (protein) features and similarity information, and uses feature projection ensemble learning framework to predict lncRNA-protein interaction scores. Compared to LPLNP, SFPEL-LPI has fewer parameters and higher precision and can predict new lncRNAs and new proteins. Most network-based models build similarity networks by mining lncRNA (protein) related information and use their network topological structure and known lncRNA-protein interaction information for prediction and have the advantage of not requiring negative sample construction. In addition, this type of method is also global; based on the prediction results, we can get the prediction ranking of all unknown interaction pairs, which is more convenient for us to study the higher-ranking unknown interaction. However, in addition to SFPEL-LPI, other network-based methods only focus on the construction of similarity networks and ignore important feature information. Although SFPEL-LPI makes use of both feature information and similarity information, it separates the lncRNA network and protein network for prediction, which also limits the improvement of model performance.
Based on this, this study proposes a projection-based neighborhood non-negative matrix factorization (PMDKN) to predict potential lncRNA-protein interactions in heterogeneous omics data, which is also applicable to the prediction of new lncRNAs and new proteins. First, based on the lncRNA sequences, lncRNA expression profile, and protein sequences, we extracted a variety of features of lncRNA and protein. Second, based on multiple features of lncRNA and protein, lncRNA sequences, gene ontology annotation of the protein and the modified lncRNA-protein interaction network, we calculated multiple similarities of lncRNA and protein and fused to obtain more accurate lncRNA (protein) similarity network. Finally, PMDKN uses these features and fused similarity network to predict lncRNA-protein interaction scores. The results indicate that PMDKN exhibits higher predictive performance than other state-of-the-art methods for the prediction of lncRNA-protein interactions, new lncRNAs, and new proteins. Case study further demonstrates that PMDKN can be an effective tool for lncRNAprotein interaction.

Dataset
The noncoding RNAs and protein related biomacromolecules interaction database (Npinter) (Wu et al., 2006) provides a large number of experimentally verified interactions between non-coding RNA and other biomolecules. So far, Npinter has been updated to version 3.0, which includes more lnRNA-protein interactions than the previous version (Hao et al., 2016). In order to evaluate the predictive performance of the algorithm, we performed cross-experiment using the interactive data provided in Npinter v2.0  as the benchmark dataset and used Npinter v3.0 to test the final prediction ability of the model. Li et al. ( 2015) extracted interactions from Npinter v2.0 by limiting the organization to 'Homo sapiens' and ncRNA to 'NONCODE' and processed 4,870 interactions between 1,113 lncRNAs and 96 proteins. On this basis, Zhang et al. (2018a) deleted lncRNAs and proteins with no sequence information and only one interaction, resulting in 4,158 interactions between 990 lncRNAs and 27 proteins. Meanwhile, various features and similarity information were extracted based on the sequence data of lncRNAs and proteins. In order to facilitate the experimental comparison, we used the dataset provided by Zhang et al. (2018a) as the benchmark DATASET 1 for verification.
In benchmark DATASET 1, all lncRNAs (proteins) interact with at least two proteins (lncRNAs), and the number of lncRNA-protein interactions is relatively dense. To investigate the predictive ability of the algorithm for sparse interactions, lncRNAs without sequence information were deleted from the data provided by Li et al., and a total of 4,679 interactions between 1,068 lncRNAs and 90 proteins were finally obtained. Meanwhile, sequence information of lncRNA and expression profile information of lncRNA in 24 human tissues and cells were extracted from the integrated knowledge database of noncoding RNAs database (NONCODE) (Liu, 2004;Xie et al., 2013;Fang et al., 2018), and sequence information of protein and gene ontology annotation of protein were extracted from the proteinprotein interaction networks dataset (STRING 9.1) (Franceschini et al., 2012). Based on the relevant information of lncRNA and proteins, multiple features and similarities of lncRNA (proteins) were calculated to construct benchmark DATASET 2. , , ,  represent the set of N l lncRNAs and N p proteins obtained, respectively. In this section, we introduce the three features of lncRNA, the two features of the protein, and the similarity of lncRNA and the similarity of protein.

Features of lncRNA
We extracted three features of lncRNA, namely expression profile feature and two sequence-based features: pseudo-k-tuple nucleotide composition (PseKNC)  and parallel related pseudo dinucleotide composition (PCPseDNC) (Guo et al., 2014). For lncRNA, k-mer (nucleotide sequence of length k) is generally used to describe the short-term ordered information of the sequences, while the overall or long-term information of the sequences is described by the physicochemical properties of nucleotides. PseKNC and PCPseDNC describe the lncRNA by integrating the short-term and long-term features of the sequences . We calculated the PseKNC and PCPseDNC of lncRNA using python "repDNA" package .

Features of Protein
The hydrophilicity and hydrophobicity of proteins play an important role in protein folding, environmental and molecular interactions, and catalytic effects. Combining the frequency of regularization of 20 amino acids in the protein sequence and the distribution pattern of hydrophilicity and hydrophobicity along the protein chain, we calculated the characteristics of the two proteins, which are the amphiphilic pseudo amino acid composition (APseAAC) (Chou, 2001;Chou, 2005) and the combined triad descriptor (CTriad). Among them, Ctriad was proposed by Shen et al. (2007) to predict protein-protein interactions. First, in order to reduce the size of the feature space, 20 amino acids were grouped into 7 classes according to the dipole and volume of the side chains. Second, using the classes of amino acids to distinguish any conjoint triad (combination of any three consecutive amino acids) and counting the frequency. f(v i ) i=1,2,.···,7 3 of the occurrence of the conjoint triad in the amino acid sequence, where v i represents the i-th conjoint triad. Finally, normalizing f(v i ), we could get the conjoint triplet descriptor feature CTriad(P)=[q 1 ,q 2 ,···,q 343 ] of protein P as follows: represent the minimum and maximum frequencies of all conjoint triads, respectively. It should be noted that in order to prevent the overfitting problem caused by the lncRNA (protein) feature due to the high dimension, we use the PCA for dimensionality reduction on the high-dimensional features.

similarities for LncRNAs and Proteins
In this section, we introduce the lncRNA-lncRNA similarity and the protein-protein similarity.
lncRNA-lncRNA Sequence Similarity Kirk et al. (2018) found that lncRNAs with related functions, although lacking linear homology, often have a similar k-tuple spectrum, which is related to lncRNA binding protein and its subcellular localization. Song et al. (2014) introduced a variety of alignment-free genome and metagenome comparison methods based on word frequency and proved that d 2 * has a stronger statistical ability to measure sequence correlation. Therefore, d 2 * was used in this study to calculate the sequence similarity between lncRNAs. For any two lncRNA sequences L 1 and L 2 with m and n nucleotides, respectively, the dissimilarity d L L ( ) represents the D 2 * statistic of L 1 and L 2 , and p w X and p w Y respectively represent the probability of k-tuple w occurring in L 1 and L 2 of lncRNA under the background model.
, where X w and Y w represent the frequencies at which the k-tuples in the sequences L 1 and L 2 occur, respectively. Further, the similarity of L 1 and L 2 is ( ( ) , ) * 1 2 1 2 − d L L . We used the program provided by Ahlgren et al. (2016) to calculate the d 2 * similarity of lncRNA.

Protein-Protein Semantic Similarity
The semantic comparison of gene ontology annotations provides a quantitative method for calculating the semantic similarity of gene products (Yu et al., 2010). There are currently two classic methods for computing the semantic similarity of GO annotation items: information-based methods (Jiang and Conrath, 1997;Lin, 1998;Resnik, 1999) and graph-based (Wang et al., 2007) methods, respectively. In this study, the graph-based method was first used to calculate the semantic similarity of GO items, and then the semantic similarity of proteins was calculated according to the association between protein and GO items. Specifically, any GO item A could be expressed as DAG(A)=(A,T A ,E A ), where T A represents the set containing item A and all its ancestor items in the GO diagram, and E A represents the set connecting all edges of GO item in DAG(A). Then, for any two GO annotation items A and B, their semantic similarity could be defined as: Where, S A (t) and S B (t) represent the S-value of GO item t related to item A and item B respectively, and SV A S t t T the semantic value of GO item A. At this point, according to the correlation between protein and GO term, we can get the semantic similarity of protein. We use R package "protr" to obtain semantic similarity of proteins; more details are shown in (Xiao et al., 2015).

Kernel Neighborhood Similarity
In Section "Features for lncRNAs and Proteins", we obtained three features of lncRNA and two features of protein, and the known lncRNA-protein interaction network also contains important lncRNA (protein) feature information. Based on these feature vectors, there are many methods for calculating similarities, such as Gaussian, linear neighborhood similarity (Zhang et al., 2018a) (LNS), and so on. Here, we adopt kernel neighborhood similarity (KSNS) (Ma et al., 2018a;Ma et al., 2018b), which not only considers the neighbor and nonneighbor similarity of samples hierarchically, but also explores nonlinear relations, which was well applied to a variety of biological problems. It should be noted that the currently known lncRNA-protein interaction matrix is incomplete. Therefore, in order to reduce the error caused by information loss, we first use the Weighted K nearest neighbor profiles (WKNNP)  to complete the known interaction matrix, and then calculated the KSNS of lncRNA(protein) interaction profile. Based on the above steps, we obtained a total of 5 similarities of lncRNAs and 4 similarities of proteins, which reflected the similarity relationship of lncRNAs (proteins) from different perspectives. Due to the limitations of data and the selection of computational methods, these similarity networks may contain noise. Hence, we adopted a clusDCA proposed by Wang et al. (2015) for similarity network fusion, which can not only eliminate network noise and effectively capture network topology, but also have high computational efficiency in large-scale networks. The general procedure for predicting lncRNA-protein interaction using PMDKN is shown in Figure 1.

Prediction of lncRNA-Protein Interaction
Based on various features of lncRNA (protein) and the integrated lncRNA (protein) similarity network, we proposed projectionbased neighborhood non-negative matrix factorization (PMDKN) to predict potential lncRNA-protein interactions.
2 represents the N 2 feature matrices of protein, similarity matrix of lncRNA and protein are SL and SP respectively, A represents known lncRNAprotein interaction matrix, and A represents lncRNA-protein interaction matrix completed by WKNNP.
First, we mapped lncRNA and protein to the common nonnegative space R d , that is, any lncRNA l i and protein p j can be represented by non-negative latent vectors u R i d ∈ ×1 and , then, the product of the U and V can be used to approximate the modified interaction matrix Ā. Since the observed interactions have been verified by experiments and have higher reliability than the unknown interactions, the observed lncRNA-protein interactions are assigned a higher level of importance and can be obtained as follows: where C is the importance level distribution matrix, that is, if there is interaction between the lncRNA l i and the protein p j , C i,j = δ, otherwise, C i,j = 1, where δ > 1 is an important level parameter. ||·|| F denotes the F-norm and γ denotes the regularization parameter of latent vectors. In addition, in order to integrate different types of lncRNA features, we project all lncRNA features onto the non-negative space R d , and required the difference between it and U to be as small as possible, so as to obtain: Similarly, for proteins, we have: represents the j-th feature matrix of the protein, and non-negative matrix PP R j r d p i ∈ × represents the corresponding projection matrix. The weight vector β β β β = ( ) 1 2 2 , , ,  N controls the effect of feature projection on V. It is generally believed that lncRNAs with higher similarity are more likely to interact with the same protein, but due to the incomplete data set, the similarity network of lncRNAs (proteins) obtained may contain noise. In order to eliminate the influence of non-neighborhood noise and improve the prediction accuracy, we only consider strong neighborhood similarity relationship of the samples. Therefore, lncRNA neighborhood similarity matrix ( SL ) was constructed as follows: Among them, SLi j , represents the local similarity of lncRNA l i and l j , and N(l i ) represents the K neighbor sets closest to lncRNA l i . In order to adaptively select the number of neighbors according to the sample size, we make K N l = × 0 3 . , ⋅     indicates rounding up. It is known from equation (5) that SL is a symmetric matrix. According to lncRNAs with higher similarity, their features are as close as possible; we have: Furthermore, the objective function can be obtained as follows: We use the two-step method to solve (9). First, by fixing α i, β j , and using the Lagrangin multiplier and the KKT condition, we can get the iterative formula of U, V, PL i and PP j as follows:

PL PL U FL PL FL FL PL ee
|| 2 0 , C 1 represents the terms unrelated to α i and β j (3.8). We can get the objective function about α i and β j as follows: According to (14) and (15), α i and β j always satisfy nonnegative constraints. In formula (9), U and V are obtained based on the decomposition of the known lncRNA-protein interaction matrix. In order to reduce the prediction error of the new lncRNA (lncRNA without any protein interaction information) and the new protein, we utilized the method proposed by Liu et al. (2016), that is, the lncRNA(protein) was modified by using the neighborhoodlatent vectors. Let ũ i the modified latent vector of lncRNA l i , which can be calculated as follows: , , , of protein, we can obtain the final lncRNA-protein interaction score Y =   UV T .

Algorithm
In the process of model derivation, we assume that the features of lncRNA and protein are non-negative, so the original features need to be normalized before algorithm calculation. Let F R N M ∈ × represent the original feature matrix of lncRNA (protein), where ˆ, F i j represents the j-th dimension of the i-th sample, then the normalized feature matrix F is as follows: Where, min (F .j ) and max (F .j ) represent the minimum and maximum of the j-th dimension, respectively. Algorithm 1 summarizes the general process of solving lncRNA-protein interaction prediction by KDMPN.

ResULTs AND DIsCUssION experimental settings
According to previous studies, the performance of the interactive prediction method was evaluated by the 5-fold cross validation (CV), and the area under ROC curve (AUC), area under Precision-Recall curve (AUPR), and F1 value (F1) were used as evaluation indexes. Since the known lncRNA-protein interactions were much less than the unknown lncRNA-protein interactions, AUPR was usually used as the most important evaluation index to punish false positives (Zhang et al., 2018a;Zhang et al., 2018b). In addition, in order to eliminate the influence of random partition on the results in the crossover experiment, we selected the method of Liu et al. (2016), set 5 random seeds for CV, and took the mean value of the cross experiment results under all random seeds as the final prediction result. Specifically, the lncRNA-protein interaction matrix A R N N l p ∈ × has N l rows for lncRNAs and N p columns for proteins. In order to investigate the prediction ability for lncRNA-protein interactions, new lncRNAs and new proteins, we performed CV under three different settings, as follows: 1. CV a : CV on known lncRNA-protein interaction pairs. Specifically, we randomly divided the known lncRNA-protein interactions into 5 equal parts. Take turns to select one and all the unknown interactions to form the test set and the remaining four and all the unknown interactions to form the training set (that is, change the 1 corresponding to the test set in A into 0 as the training set). 2. CV l : CV on lncRNAs. Specifically, all lncRNAs are randomly divided into five equal parts; one is selected as a test set in turn, and the remaining four are training sets (that is, all the rows corresponding to the test set in A were changed to zeros). 3. CV p : CV on proteins. Specifically, all proteins are randomly divided into five equal parts; one is selected as a test set in turn, and the remaining four are training sets (that is, all the columns corresponding to the test set in A were changed to zeros).
It should be noted that with regard to CV a , we selected all zeros in A as the test set. For example, for DATA2, the test set of each crossover experiment contains 4,870/5 = 947 known interactions and 97,658 unknown interactions (that is, the ratio . Based on SL and SP, the neighborhood similarity matrices SL and SP of lncRNA and protein were obtained using equations (5) and (7) Fix PP j and V, Update β j according to formula (15). end for until Converges 7 Ũ was obtained by completing the subspace feature U of lncRNA according to formula (16). 8  V was obtained by completing the subspace feature V of protein according to formula (17). 9 Y =   UV T of positive and negative examples is approximately 1:100). This selection method ensures that all the unknown interactions can be included in each crossover experiment, which expands the search range and is also in line with the actual situation.

Parameter setting
The PMDKN algorithm have six parameters, namely the projection index parameter η, the projection regularization parameter μ, the latent vector regularization parameter γ, the neighborhood Laplacian regularization parameter λ, the potential subspace dimension d, and the known interaction important level parameter δ. Among them, μ and γ control the influence of feature projection, γ controls subspace feature contribution, λ describes the effect of similarity network, and δ controls the importance level of observed interaction. In order to study the effect of parameters on the prediction results, we calculated all the parameter combinations. Specifically, η was selected from {2,3,4,5,6}, μ was selected from {10 -3 ,10 -2 ,10 -1 ,10 0 ,10 1 }, γ was selected from {10 -3 ,10 -2 ,10 -1 ,10 0 ,10 1 }, and λ was selected from {2 -2 ,2 -1 ,2 0 2 1 ,2 2 ,2 3 }; according to the previous research (Zheng et al., 2013, Liu et al., 2016, for methods based on matrix decomposition, the potential subspace dimension d = 100, δ was selected from {1,2,⋯, 6}. It should be noted that unlike DATASET 1, DATASET 2 contained more lncRNAs and proteins, and the initially constructed lncRNA (protein) similarity network did not utilize any known interaction information and therefore has higher predictive value. In addition, since CV a , CV l , and CV p are considered the predictive power of the algorithm for new interactions, new lncRNAs, and new proteins, respectively, we believe that the three experimental setups are equally important for algorithm evaluation. Therefore, based on DATASET 2, for the combination of different parameters, the average evaluation index of the algorithm under the three experimental settings is the final evaluation standard. We take AUPR as the evaluation index, and the influence of the analysis parameters on the prediction results was shown in Figure 2.
As shown in Figure 2, the optimal parameters obtained by the PMDKN algorithm are η = 5, µ = 100, λ = 1, γ = 1, δ = 2, and the average optimal AUPR value under the three experimental settings is 0.4735. Specifically, we first analyze the influence of the projection parameters η and µ. Fixed λ = 1, γ = 1, δ = 2, and calculate the AUPR value of the model under all possible combinations of η and µ. As shown in (A) of Figure 2, as η becomes larger, the AUPR value of the model increases, but the overall AUPR value of the model fluctuates a little. Then, we fixed η = 5, µ = 100, γ = 1, δ = 2, and analyzed the influence of the change of λ on the AUPR value. As shown in (B) of Figure 2, when λ increases, the AUPR value of the model first becomes larger and then decreases, and when λ = 1, the AUPR value is the largest. Similarly, as shown in (C) in Figure 2, when γ < 1, the change of AUPR was relatively flat; when γ > 1, the AUPR value decreased sharply with the increase of gamma. In (D), δ = 1 indicates that the known interactions and the unknown interactions are equally important, and the corresponding AUPR value of the model is only 0.42; however, when δ = 2, the model has the maximum AUPR value, which further emphasized that the setting of δ is necessary to improve the performance of the model. Based on the above discussion, in the following study, we select η = 5, µ = 100, λ = 1, γ = 1, d = 100, and δ = 2 as parameters of PMDKN.

Comparison With state-of-the-Art Prediction Methods
In order to evaluate the predictive ability of PMDKN algorithm equitably, we conducted 5-fold cross validation on DATASET 1 and DATASET 2, and compared them with the following methods: SFPEL-LPI (Zhang et al., 2018b), LPLNP (Zhang et al., 2018a), LPBNI (Ge et al., 2016), and LKSNF (Ma et al., 2018b). Since DATASET 1 itself was the benchmark dataset for SFPEL-LPI, LPLNP, and LKSNF, we do not need to re-extract the features. For DATASET 2, we calculated the PCPseDNC and SCPseAAC of lncRNA according to the requirements of SFPEL-LPI, and calculated the PCPseAAC and SCPseAAC of the protein.
Since SWSS similarity leads to the reuse of known interaction information, only the Smith Waterman similarity of lncRNA (protein) were calculated. For LPLNP and LKSNF, we calculated the sequence feature and expression profile feature of lncRNA and the CTD of the protein according to their requirements. While LPBNI only uses known lncRNA-protein interactions for prediction, we did not need to extract additional information. According to previous studies, LPLNP, LPBNI, and LKSNF only predicted the unknown interaction of lncRNA-protein, while SFPEL-LPI not only predicted unknown lncRNA-protein interactions, but also predicted new lncRNA and new protein. Therefore, based on DATASET 1 and DATASET 2, we perform CV a on all models, and CV l and CV p on SFPEL-LPI. We performed the crossover experiment using the experimental setup in Section "Experimental Settings" and used the mean of the five-fold crossover experimental results of the five random seeds as the evaluation index of the algorithm, and the parameters of these models were selected using the recommended parameters. Table 1 shows the comparison of predictive performance of PMDKN and other state-of-the-art methods for new lncRNAprotein interaction prediction. It can be seen that, no matter in DATASET 1 or DATASET 2, the AUPR, AUC, and F1 values of PMDKN are higher than other models. Specifically, on DATASET 1, as for the most important evaluation index AUPR, PMDKN can reach 0.4959, which increases by 50.46%, 8.37%, 4.31%, and 6.07%, respectively, compared with LPBNI's 0.3296, LPLNP's 0.4576, LKSNF's 0.4754, and SFPEL-LPI's 0.4675. Regarding the commonly used evaluation index AUC, PMDKN can reach 0.9223, which is higher than 0.8546 of LPBNI, 0.9095 of LPLNP, 0.9150 of LKSNF, and 0.9201 of SFPEL-LPI. The F1 value of PMDKN can reach 0.4814, which is 24.04%, 6.50%, 4% and 3.37%, respectively, compared with 0.3881 for LPBNI, 0.4520 for LPLNP, 0.4629 for LKSNF, and 0.4657 for SFPEL-LPI. In DATASET 2, the AUPR of PMDKN could reach 0.4808, which improved by 40.67%, 2.45%, 6.18%, and 14.07%, respectively, compared with 0.3418 of LPBNI, 0.4693 of LPLNP, 0.4528 of LKSNF, and 0.4215 of SFPEL-LPI. The AUC value of PMDKN can reach 0.9732, higher than 0.9340 of LPBNI, 0.9700 of LPLNP, 0.9710 of LKSNF, and 0.9728 of SFPEL-LPI. The F1 value of PMDKN can reach 0.4761, which is 19.71%, 3.37%, 2.67%, and 7.04%, respectively, compared with 0.3977 for LPBNI, 0.4606 for LPLNP, 0.4637 for LKSNF, and 0.4448 for SFPEL-LPI. These demonstrate that the PMDKN algorithm of this paper has good predictive power for unknown lncRNA-protein interactions.
The prediction of new lncRNAs and new proteins are also the important criterion for evaluating the performance of the method. Among the four comparison algorithms above, only SFPEL-LPI performs the prediction of new lncRNA and new protein. Therefore, we only compare the prediction performance of SFPEL-LPI and PMDKN on CV l and CV p . As shown in Table 2, except for the F1 value of PMDKN on DATASET 2, which is 0.4864, slightly lower than the 0.4892 of SFPEL-LPI, PMDKN was better than SFPEL-LPI for other evaluation indicators, especially for the prediction of new proteins (CVp). Specifically, on DATASET 1, the AUPR values of PMDKN for CV l and CV p can reach 0.6301 and 0.4918, which is 30.92% and 49.71%, respectively, relative to SFPEL-LPI of 0.4813 and 0.3285. The AUC values of the PMDKN algorithm for CV l and CV p can reach 0.8907 and 0.7843, which are 7.52% and 17.66% higher than the 0.8284 and 0.6666 of SFPEL-LPI, respectively. The F1 value of the PMDKN algorithm for CV l and CV p can reach 0.6081 and 0.5251, which is 23.32% and 38.95% higher than the 0.4931 and 0.3779 of SFPEL-LPI, respectively. Similarly, in DATASET 2, the AUPR value and AUC value of CV l of PMDKN were higher than SFPEL-LPI, especially for CV p , the AUPR value, AUC value, and F1 value of PMDKN could reach 0.4604, 0.9019, and 0.4818, respectively, improving 281.13%, 37.78%, and 148.35% compared with the 0.1208, 0.6546, and 0.1940 of SFPEL-LPI.

Comparative Analysis of Model stability
Due to technical limitations, some noises may be hidden in the known lncRNA-protein interactions, such as lack of interaction information, unreal interaction information and so on. In order to test the dependence of the prediction performance of the model on the known interactions, according to the method of Zhang et al. (2018b), we randomly deleted some of the known interactions to represent the missing

Case study
LncRNA-protein interactions in DATASET 1 and DATASET 2 used in this paper were extracted from Npinter2.0, and the current version of Npinter has been updated to Npinter v3.0 (Hao et al., 2016). Compared with version 2.0 , Npinter v3.0 contains more lncRNAs, more proteins, and more interactive information. To test the predictive ability of new proteins, we extracted 95 new proteins that  did not exist in DATASET 2 from Npinter v3.0, extracted the amino acid sequence and gene ontology annotation of these new proteins, and combined with DATASET2 information to predict the interactions between these new proteins and lncRNAs. For the prediction score of each new protein, we calculated its AUPR and AUC values, and calculated the hit rate of the top 10, 20, 50, and 100 candidate lncRNAs (Nourania et al., 2016 Figure 4. As shown in Figure 4, for the prediction of new proteins, PDMKN not only has higher AUPR and AUC values than SFPEL-LPI, but also the top 10, 20, 50, 100 hit ratios of candidate lncRNAs are much higher than SFPEL-LPI. Specifically, the average AUPR and AUC values for PMDKN were 0.204 and 0.839, respectively, which were 20.66% and 8.49% higher than 0.169 and 0.773 for SFPEL-LPI, respectively. The hit rates of candidate lncRNAs in the top-10, top-20, top-50 and top-100 reached 42.8%, 47.1%, 52.1%, 57.2%, and increased by 266.32%, 264.37%, 125.75%, and 80.68%, respectively, compared with the 11.7%, 12.9%, 23.1%, and 31.7% of SFPEL-LPI, which further demonstrated that PMDKN had strong predictive ability.

DIsCUssION
In this study, we proposed a new lncRNA-protein interaction prediction model, which not only can predict the unknown interactions between lncRNAs and proteins, but also has strong prediction ability for new lncRNAs and new proteins. To fairly evaluate the predictive performance of the model, we performed three 5-fold cross-validation on the two benchmark datasets, namely, CV a for the new lncRNAprotein interactions, CV i for the new lncRNAs, and CV p for the new proteins. The results show that, on DATASET 1, the AUPR values of PMDKN under the three experimental settings could reach 0.4959 (on CV a ), 0.6301 (on CV l ), and 0.4918(on CV p ) respectively; on DATASET 2, the AUPR values of PMDKN under the three experimental settings can reach 0.4808 (on CV a ), 0.4794 (on CV l ), and 0.4604 (on CV p ) respectively, higher than other state-of-the-art methods. In the case study, 95 new proteins were predicted, and the results showed that for the top-10 candidate lncRNAs, the hit rate of PMDKN algorithm could reach 42.8%, much higher than other method. Therefore, PMDKN can be used as an effective tool for lncRNA-protein interaction prediction.
The good performance of PMDKN may have the following reasons: First, feature extraction and network construction. We extract multiple features to describe lncRNA and protein in all directions and integrate multiple infomation to construct a more accurate lncRNA (protein) similarity network, effectively avoiding the over-fitting problem that may be caused by the information deviation of a single data source. Second, the use of neighborhood information. We modified the initial lncRNA-protein interaction network to overcome the network sparsity problem, and used the adaptive neighborhood completion strategy to eliminate the errors caused by the lack of information in the latent vectors of new lncRNAs (new protein), so as to ensure the predictive ability of new proteins and new lncRNAs. Finally, the construction of the ensemble predictive model. We combine the multiple sequence features of lncRNA (protein) and the integrated similarity networks to construct the predictive model, which distinguishes positive and negative observations by setting important levels and establishes the relationship between features and potential vectors through the projection of the features, so as to improve the accuracy of model prediction.

DATA AVAILABILITY sTATeMeNT
The source code and datasets used in the paper can be found in the Supplementary Files.

AUThOR CONTRIBUTIONs
YM and XJ designed the projection-based neighborhood nonnegative matrix factorization for lncRNA-protein interaction prediction. YM and XJ designed the experiment and wrote the manuscript. TH and XJ supervised and helped conceive the study. All authors read and approved the final manuscript.