Probing lncRNA–Protein Interactions: Data Repositories, Models, and Algorithms

Identifying lncRNA–protein interactions (LPIs) is vital to understanding various key biological processes. Wet experiments found a few LPIs, but experimental methods are costly and time-consuming. Therefore, computational methods are increasingly exploited to capture LPI candidates. We introduced relevant data repositories, focused on two types of LPI prediction models: network-based methods and machine learning-based methods. Machine learning-based methods contain matrix factorization-based techniques and ensemble learning-based techniques. To detect the performance of computational methods, we compared parts of LPI prediction models on Leave-One-Out cross-validation (LOOCV) and fivefold cross-validation. The results show that SFPEL-LPI obtained the best performance of AUC. Although computational models have efficiently unraveled some LPI candidates, there are many limitations involved. We discussed future directions to further boost LPI predictive performance.


INTRODUCTION
Long non-coding RNAs (lncRNAs) are transcripts with greater than 200 nucleotides but lack protein coding capacity (Sanchez Calle et al., 2018). lncRNAs are closely associated with various key biological processes, such as cell cycle regulation, immune response, and embryonic stem cell pluripotency Agirre et al., 2019;Li et al., 2019b). More importantly, lncRNAs play an important role in understanding pathogenesis of various diseases, especially tumors (Chen et al., 2016a;Fu et al., 2017;Jiang et al., 2018;He et al., 2018a;Dallner et al., 2019). Although lncRNAs play a spectrum of regulatory roles across different cellular pathways, understanding about their regulatory mechanisms is very limited (Munschauer et al., 2018).
Recently, one broad theme is that lncRNAs can drive the assembly of RNA-protein complexes by facilitating the regulation of gene expression (Rinn and Chang, 2012;Chen and Yan, 2013;Hentze et al., 2018;Munschauer et al., 2018;Nozawa and Gilbert, 2019). lncRNAs achieve their specific functions by interacting with multiple proteins and thus regulating multiple cellular processes (Zhang et al., 2018c;Pyfrom et al., 2019). Studies reported that lncRNAs can activate post-transcriptional gene regulation, splicing, and translation by binding to proteins (Zhang et al,. 2018c;Li et al., 2019a) Therefore, identifying possible lncRNA-protein interactions (LPIs) is essential for unraveling lncRNA-related activities (Qian et al., 2018;Zhang et al., 2018c;Zhao et al., 2018c). Wet experiments validated parts of LPIs, but experimental methods remain costly and time-consuming. Therefore, different computational models are explored to infer potential LPIs Cheng et al., 2018;Zhang et al., 2018c;Zhao et al., 2018c). There exist numerous unexplored lncRNAs and proteins in public databases, which makes it possible to efficiently identify their underlying associations.
In this study, we introduced relevant repositories, summarized computational models and algorithms for LPI prediction, discussed their advantages and weaknesses by comparison, and presented further directions for boosting LPI prediction performance. We focused on two categories of computational models: network-based methods and machine learning-based methods. The machine learning-based methods contain matrix factorization-based methods and ensemble learning-based methods.

RELEVANT REPOSITORIES
There are abundant repositories related to LPI prediction. These repositories provide diverse information for efficiently uncovering potential LPIs. interface in which users can search the targets for a particular lncRNA or the lncRNAs for a particular gene.

lncRNAdb
The lncRNAdb database (Quek et al., 2014s) (http://lncrnadb. org) is a comprehensive database in compliance with the International Nucleotide Sequence Database Collaboration. It provides 287 eukaryotic lncRNAs and an interface enabling users to access sequence data, expression information, and the literature. The latest update of lncRNAdb integrated nucleotide sequence information, Illumina Body Atlas expression profiles, and a BLAST search tool.

lncRNASNP2
The lncRNASNP2 database (Miao et al., 2017) (http://bioinfo. life.hust.edu.cn/lncRNASNP2) provides 7,260,238 single nucleotide polymorphisms (SNPs) on 141,353 human lncRNA transcripts, and 3,921,448 SNPs on 117,405 mouse lncRNA transcripts. More importantly, it contains abundant information about mutations in lncRNAs and their impacts on lncRNA structure and function. It also provides online tools for analyzing new variants in lncRNA.

LbcRNAwiki
The lncRNAWiki database (Ma et al., 2014) (http://lncrna.big.ac.cn) integrated various human lncRNAs from different resources. It makes existing lncRNAs able to be updated, edited, and curated by diverse users. More importantly, any user can add newly uncovered lncRNAs.

LncRNAdisease
The lncRNADisease database (Bao et al., 2018) (http://www. rnanut.net/lncrnadisease/) integrated experimentally validated circular RNA-disease associations, and regulatory mechanisms among mRNA, miRNA, and ncRNA. Particularly, it contains more than 200, 000 lncRNA-disease associations. In addition, it gives confidence scores for all ncRNA-disease associations and maps each disease to disease ontology and medical subject headings.

UniProt
The UniProt database (Consortium et al., 2018) (http://www. uniprot.org/) is an important database providing protein sequences and annotations. It provides 80 million sequences and is a useful tool. Users can calculate a new proteome identifier to find a particular assembly for a species or subspecies. It also provides an effective measurement for computing an annotation score for all entries.

METHODS
Most computational methods contain two procedures: data extraction and model selection. In the first part, computational methods usually extract LPIs related to human lncRNA, lncRNA sequences, and protein sequences from NPInter (Hao et al., 2016), NONCODE (Zhao et al., 2015), and UniProt (Consortium et al., 2018), respectively. Computational methods filter LPIs by removing lncRNAs/proteins only interacting with one protein/ lncRNA. In the second procedure, computational methods design various models to uncover potential LPIs. These models can be roughly classified into two categories: network-based methods and machine learning-based methods.

Data Representation
Computational methods utilize an lncRNA set l = {l 1 , l 2 , l 3 …. l n }, a protein set, P = {p 1 , p 2 , p 3 …. p m }, and an LPI matrix Y n×m , where y ij = 1 if there is an association between an lncRNA l i and a protein p j ; otherwise, y ij = 0.

Network-Based Methods
Network-based methods obtain better performance by effectively integrating related biological information and network propagation algorithms into a unified framework. Li et al. (2015) developed an LPI prediction method combing a heterogeneous network model and random walk with restart, LPIHN. LPIHN can be broken down into four steps:

LPIHN
Step 1 Extracting known ncRNA-protein associations from the Npinter 2.0 database (Hao et al., 2016) and filtering the ncRNAs and their associated proteins based on organism and type of ncRNAs. LPIHN then selects lncRNAs from filtered ncRNAs based on the human lncRNA dataset provided by the NONCODE database (Zhao et al., 2015).
Step 2 Obtaining lncRNA expression profiles from the NONOCODE 4.0 database (Zhao et al., 2015). Given the expression profiles of two lncRNAs E 1 and E 2 , LPIHN calculates lncRNA expression similarity based on the Pearson correlation coefficient: where cov(E 1 , E 2 ) is the covariance of E 1 and E 2 , and s e 1 and s e 2 are the standard deviations of E 1 and E 2 , respectively.
Step 3 Extracting protein-protein interactions (PPIs) from STRING 9.1 (Szklarczyk et al., 2016) and obtaining 804 PPIs and the corresponding score matrix SP. SP is normalized as follows: where M is a diagonal matrix, and M(i, i) is the sum of row i in SP.
Step 4 Propagating the random walk to score for unknown lncRNA-protein pairs based on the following iterative equation: The details are shown as Figure 1.
LPLNP Zhang et al. (2018b) proposed a linear neighborhood propagation-based method, LPLNP, to probe potential LPIs. LPLNP found novel LPIs through the following steps.
Step 2 Obtaining three types of features for lncRNAs (interaction profile, expression profile, and sequence composition) and two types of features for proteins [interaction profile and CTD (composition, transition, and destruction)].
Step 3 Computing linear neighborhood similarity and regularized linear neighborhood similarity between lncRNA/ proteins by Eqs. (4) and (5), respectively: where X i denoted the feature vector of the ith lncRNA, and N(X i ) is K nearest neighbors of X i .
Step 4 Computing the interaction probabilities for unobserved lncRNA-protein pairs: The details are shown in Figure 2. Step 1 Extracting 4,158 high-confidence LPIs between 990 lncRNAs and 27 proteins from NPInter (Hao et al., 2016) and NONCODE (Zhao et al., 2015) by filtering unreliable lncRNA sequences and removing lncRNAs/proteins only associated with one protein/lncRNA.

LPI-BNPRA
Step 2 Calculating lncRNA-lncRNA similarity based on the Smith-Waterman technique: where sw(l i , l j ) denotes the Smith-Waterman score between two lncRNAs l i and l j .
Step 3 Calculating the protein-protein similarity matrix based on the Smith-Waterman technique: where sw(p i , p j ) denotes the Smith-Waterman score between two proteins p i and p j .
Step 4 For a given lncRNA l j , computing its bias ratings of lncRNAs for a protein p i with the agglomerative hierarchical clustering and associated measurement of minimum variance method: where n cr is the number of lncRNAs in the cluster cr including l j , and Tp i is the number of all lncRNAs interacting with p i .
Step 5 Finding LPI candidates based on the recommended bipartite network projection technique and bias ratings of every lncRNA for proteins: where The details are shown in Figure 3.  technique, HeteSim algorithm, and known LPI network into a unified framework. LPISNFH can be broken down into three steps.
Step 1 Obtaining 4,467 LPIs between 1,050 unique lncRNAs and 84 unique proteins from NPInter (Hao et al., 2016) and NONCODE (Zhao et al., 2015) by manually filtering LPIs not involving lncRNAs and removing the lncRNAs only associated with one protein.
Step 2 Constructing a protein-protein similarity network. LPISNFHS fused the sequence similarity, functional annotation semantic similarity (Go), domain similarity, and STRING similarity into a unified protein-protein similarity network based on the SNF technique.
Step 3 Inferring novel LPIs by combining the HeteSim algorithm and heterogeneous LPI network. Xie et al. (2019) developed a LPI prediction model, LPI-IBNRA. LPI-IBNRA integrated lncRNA-protein interactions, proteinprotein interactions, and similarity matrix for proteins and lncRNAs, and improved bipartite network recommender algorithm. LPI-IBNRA can be broken down into seven steps.
Step 2 Computing lncRNA similarity matrix sim L based on lncRNA expression similarity and Gaussian interaction profile (GIP) kernel similarity, and protein similarity matrix sim P based on protein interaction similarity and GIP kernel similarity.
Step 3 Computing the score between protein p i and lncRNA l j based on protein similarity and lncRNA similarity by Eqs. (18) and (19), respectively.
Step 4 Obtaining the initialized association score matrix as follows: Step 5 Computing the first-round scores of the lncRNA l k over all proteins: Step 6 Computing the second-round scores of the protein p i over all lncRNAs: Step 7 Computing the final association score matrix: where W′ = W + aW 2 and a ∈ (−1,0). The details are shown in Figure 4.   Ge et al. (2016) proposed an lncRNA-protein bipartite network inference method, LPBNI, to find potential LPIs. LPBNI can be broken down into five steps.
Step 2 Utilizing the LPI network to construct a bipartite graph G (L, P, Y).
Step 3 Propagating known biological information in G. For a lncRNA l j , S L (l j ) denotes the score on l j after the first step of propagation: where S 0 (i) = s ij , i ∈ {i, 2,…, m} denotes the original information of P for a given lncRNA l j . s ij = 1 if p i associates with l j ; otherwise, a ij denotes the number of lncRNAs associated with p i .
Step 4 Propagating all information in L back to P. S F (p i ) represents the final information on protein p i to denote the associated score between p i and l j : where d(l i ) = o m i=0 a ij is the number of proteins interacting with l j .
Step 5 Computing the final associated score S F after the above two-step information propagation yields The details are shown in Figure 5. Step 1 Describing lncRNA interaction profiles and protein interaction profiles as row vectors and column vectors based on the LPI network, respectively.
Step 2 Calculating the probability that two entities x i and x j belong to the same cluster based on the ant colony clustering method: where where r is the cluster radius, c j is the cluster center of the jth cluster, and a ∈ (0, 5), b ∈ (0, 5), r ∈ (0.1, 0.99), and Q ∈ (1, 10000).
Step 3 Applying lncRNA-protein bipartite network to identify LPI candidates. Given a protein p k , its association scores with all lncRNAs at the tth iteration P t k can be computed as follows: where W is a similarity matrix. The association scores for all proteins {p 1 , p 2 ,…, pm} can be represented as follows:

Machine Learning-Based Methods
Machine learning-based LPI prediction methods utilize machine learning-based models and algorithms to uncover potential LPIs. This type of method can be roughly classified into two categories: matrix factorization-based methods and ensemble learningbased methods.

Matrix Factorization-Based Models
Matrix factorization is exploited in recommendation systems and has been widely applied to bioinformatics Zhang et al., 2018a;Zhao et al., 2018b;Cantini et al., 2019). Matrix factorization-based LPI prediction techniques transformed the problem of LPI identification into a recommender task, and adopted the matrix factorization model to capture unobserved LPIs. Given an LPI matrix Y and two nonnegative matrices W ∈ ℜ k x n and H ∈ ℜ k x m the problem of predicting LPIs can be formulated as the following objective function: A few LPI identification methods have been designed based on matrix factorization method. Zhang et al. (2018a) designed a graph regularized nonnegative matrix factorization-based (NMF) method to predict potential LPIs, LPGNMF. LPGNMF consists of three steps.
Step 2 Computing lncRNA similarity and protein similarity.
LPGNMF computes the lncRNA expression profile similarity S l (i, j): Given the expression profiles of two lncRNAs E 1 and E 2 , LPIHN calculates lncRNA expression similarity based on the Pearson correlation coefficient: where cov(E 1 , E 2 ) is the covariance of E 1 and E 2 , and s e 1 and s e 2 are the standard deviations of E 1 and E 2 , respectively. LPGNMF computes the weight matrix based on lncRNA similarity: Here, N(l i ) and N(l j ) denote the p nearest neighbors of l i and l j . LPGNMF then calculates the sparse similarity matrix of lncRNAs S l* : Similarly, LPGNMF calculates the sparse similarity matrix of proteins S p *.
Step 3 Building the following optimization model based on the graph regularized nonnegative matrix factorization method: The details are shown in Figure 6. Liu et al. (2017) designed a novel LPI identification model based on neighborhood regularized logistic matrix factorization, LPI-NRLMF. LPI-NRLMF can be roughly broken down into three steps.
Step 2 Computing lncRNA sequence similarity matrix LSM and protein sequence similarity matrix PSM based on the Smith-Waterman algorithm: Step 3 Defining neighborhood information for lncRNAs and obtaining the adjacency matrix A of lncRNAs: Similarly, LPI-NRLMF computes the adjacency matrix B of proteins.
Step 4 Computing associated scores S N for unknown lncRNA-protein pairs based on the neighborhood regularized logistic matrix factorization model: Here, u i ∈ ℜ 1xr and v j ∈ ℜ 1xr can be computed by the following neighborhood regularized logistic matrix factorization model: The details are shown in Figure 7.
IRWNRLPI Zhao et al. (2018b) fused the random walk into LPI-NRLMF and exploited a novel LPI prediction model based on LPI-NRLMF, IRWNRLPI. IRWNRLPI is a semi-supervised learning-based model and does not require negative samples. IRWNRLPI contains the following five steps.
Step 2 Computing the lncRNA sequence similarity matrix LS and protein sequence similarity matrix PS based on the Smith-Waterman algorithm: Step 3 Building a random walk model to compute associated scores S R for unknown lncRNA-protein pairs: where r ij represents the extent of association between a neighbor v j and a protein p for a given node v i . L(l ij ) M x M is computed by l ij = r ij = o N j=1 r ij . IRWNRLPI divides L into two arrays of L U and L Q .
Step 4 Computing associated scores S N for unknown lncRNA-protein pairs based on the neighborhood regularized logistic matrix factorization model: u i ∈ ℜ 1 x r and v j ∈ ℜ 1 x r can be computed by the following neighborhood regularized logistic matrix factorization model: where U ∈ ℜ m x r and V ∈ ℜ n x r .
Step 5 Computing the final associated scores for unknown lncRNA-protein pairs: The details are shown in Figure 8. Step 1 Computing lncRNA kernels and protein kernels from four levels.

LPI-KTASLP
Level 1 GIP kernel: The GIP kernels between two lncRNAs and two proteins are defined as follows, respectively: Level 2 Sequence kernel: The sequence kernels of two lncRNAs and two proteins are defined as follows, respectively: where SW(.,.) is the Smith-Waterman score, and S represents the sequence information of a lncRNA/protein. Level 3 Sequence feature kernel: Constructing radial basis function kernels K lnc SF and K pro SF for lncRNAs and proteins based on the conjoint triad and pseudo position-specific score matrix, respectively.
Level 4 lncRNA expression kernel: Calculating the expression kernel of lncRNA K lnc EXP based on the expression profiles of lncRNAs provided by the NONCODE database (Zhao et al., 2015).
Step 2 Fusing the above kernels to generate the optimal kernel based on kernel target alignment: Step 3 Constructing the following model to compute interaction probabilities for unobserved lncRNA-protein pairs based on matrix factorization, low-rank approximation, and eigen decomposition: The details are shown in Figure 9.

Ensemble-Based Methods
Ensemble learning methods are widely applied to LPI prediction. HLPI-Ensemble (Hu et al., 2018) and SFPEL-LPI (Zhang et al., 2018c) are two state-of-the-art ensemble-based LPI prediction methods. Hu et al. (2018) developed the HLPI-Ensemble method for human LPI identification. HLPI-Ensemble consists of two major processes: benchmark dataset construction and HLPI-Ensemble model construction.
In the second process, HLPI-Ensemble utilizes the ensemble technique and generates three ensemble learning frameworks, HLPI-SVM, HLPI-XGB, and HLPI-RF. These three frameworks are based on support vector machines (SVMs), extreme gradient boosting (XGB), and random forests (RFs), respectively. The details are shown in Figure 10. Zhang et al. (2018c) exploited a sequence-based feature projection ensemble learning framework, SFPEL-LPI, to uncover novel LPIs. SFPEL-LPI integrated ℓ 1,2 -norm regularization, ensemble graph Laplacian regularization, and various biological information into a unified framework. It can be roughly broken down into five steps.
Step 2 Describing lncRNA and protein features based on sequence information and known LPIs.
SFPEL-LPI describes lncRNA features based on parallel correlation pseudo dinucleotide composition (PSEDNC). Given the occurrence frequency of different dinucleotides and the physicochemical properties of every dinucleotide, the PseDNC feature vector for an RNA sequence L can be represented as where In addition, SFPEL-LPI represents the interaction profile of an lncRNA as a row vector of the LPI matrix Y: IP L i = Y(i, : ). where Similarly, the interaction profile of a protein can be defined as a column vector of the LPI matrix Y: IP p i = Y( :, i).
Therefore, a features for lncRNAs/proteins can be represented as feature matrix: fX i g a i=1 .
Step 4 Computing lncRNA similarity and protein similarity.

SFPEL-LPI first computes the linear neighborhood similarity of lncRNAs based on PseDNC and IP.
SFPEL-LPI then computes the Smith-Waterman subgraph similarity (SWSS) of lncRNAs: Similarly, the PseAAC similarity, IP similarity, and SWSS similarity of proteins can be computed.
Therefore, b types of similarities of lncRNAs/proteins can be represented as b similarity matrices fW i g b i=1 .
G i , R, and q can be obtained by solving the following optimization model: The details are shown in Figure 11.

Other Methods
There are several methods used to predict possible LPIs except for matrix factorization-based methods and ensemble learningbased methods, for example, Fisher's linear discriminant-based LPI prediction method (IncPro) (Lu et al., 2013), eigenvalue transformation-based semi-supervised model (LPI-ETSLP) , and kernel ridge regression model based on fast kernel learning(LPI-FKLKRR) (Shen et al., 2018).
lncPRO Lu et al. (2013) explored a Fisher's linear discriminant-based LPI prediction method, lncPro. lncPro found new LPI through executing the following four steps.
Step 1 Downloading complexes data from the PDB database.
Step 2 Encoding sequence information into numerical feature vectors for lncRNAs and proteins based on the secondary structure, the Van der Waals' propensities, and the hydrogenbonding propensities.
Step 3 Transforming the feature vectors to unify the dimension based on the Fourier series: where L is the length of feature vector of lncRNAs/proteins. Step 4 Calculating the final score matrix < p|M|r> for the RNA feature vector r and a protein feature vector p based on Fisher's linear discriminant method: < p M j jr >= M 1 p 1 r 1 + M 2 p 1 r 2 + M 3 p 2 r 1 + M 4 p 2 r 2 (67) Hu et al. (2017) presented an eigenvalue transformation-based semi-supervised model, LPI-ESTLP, to uncover the underlying LPIs. LPI-ESTLP can be broken down into three steps.
and L l = I -LSM and L p = I -PSM denote the Laplacian matrices of lncRNAs and proteins, respectively.
LPI-ETSLP can obtain the final scores between unobserved lncRNA-protein pairs by integrating eigenvalue transformation into Eq. 70: where Ū l is a diagonal matrix with ½ U l ii = (1 + s(1 − l a l i )) −1 . L l = I -D l -0.5 K l D l -0.5 and the eigen decomposition of K l can be expressed as K l = V l U l V l . Similarly, K p = V p U p V p and U p can be defined.
The details are shown in Figure 12. Shen et al. (2018) developed an LPI prediction algorithm, LPI-FKLKRR, combining a kernel ridge regression model based on fast kernel learning. LPI-FKLKRR can be broken into six steps:
Step 2 Computing protein GIP, sequence features, protein sequence similarity, and protein GO kernel K pro GIP , K pro SW , K pro SF , K pro GO .
Step 3 Generating the optimal lncRNA and protein kernels with fast kernel learning: where w lnc a and w pro a represent each element in w lnc and w pro , respectively; K lnc a and K pro a denote the corresponding normalized similarity matrices in lncRNA and protein spaces, respectively.
Step 4 Constructing the optimization model to compute the optimal solution for w lnc or w pro : where w denotes the optimal solution w lnc or w pro , K u and K v denote two different kernel matrices, and tr(·) denotes the trace function.
Step 5 Computing lncRNA-protein association score matrix: Step 6 Producing the optimal F* by adjusting the parameters l ℓ and l p .
The details are shown in Figure 13.

DISCUSSION
lncRNAs play important regulatory roles in diverse biological processes, such as protein modification, DNA methylation, and chromosome (Weber et al., 2018;Huang et al., 2018a;He et al., 2018b;Zhao et al., 2018c). However, the regulatory mechanism remains unknown (Esteller, 2011;Jiang et al., 2018;Agirre et al., 2019). Studies reported that identifying protein molecules binding specific lncRNAs help to probe the mechanism of lncRNAs (Lu et al., 2013;Ge et al., 2016;Chen et al., 2018). Therefore, identifying possible LPIs has an important role in understanding lncRNA-related activities (Lu et al., 2013;Pan et al., 2016;Peng et al., 2017;Zhang et al., 2018c). However, experimental methods are expensive and timeconsuming. For limited existing knowledge, computational methods become vital as a silver-bullet solution to capture LPIs on a large scale, which contributes to prioritize LPI candidates and deploys further experimental validation .
In this study, databases involved in LPI identification are summarized. More importantly, the components of state-of-theart computational models for LPI prediction, such as networkbased methods and machine learning-based methods, are introduced. Particularly, machine learning-based models can be broken into matrix factorization-based methods and ensemble learning-based methods. To consider the performance of LPI prediction methods, we compared nine models (IRWNRLPI, LPBNI, LPGNMF, LPI-BNPRA, LPI-ETSLP, LPIHN, LPI-NRLMF, LPLNP, and SFPEL-LPI) on leave-one-out cross-validation (LOOCV). These nine models are conducted on the datasets provided by the corresponding papers. Parameters are set as the values recommended by the corresponding studies. Table 1 shows the comparison results based on AUC, precision, accuracy, and F1. In Table 1, SFPEL-LPI obtained the best performances of AUC and accuracy; LPGNMF obtained the best performances of precision and F1. The results demonstrated that SFPEL-LPI can correctly predict LPIs with a relative high proportion. LPGNMF can better identify potential LPIs when taking into account the proportion of correctly predicted LPIs and successfully predicted LPIs.
To further detect the performance of SFPEL-LPI, we compared it with four representative LPI prediction methods, LPBNI, LPI-ETSLP, LPIHN, and LPLNP, on fivefold crossvalidation. The experiments were conducted on the same dataset, i.e., LPIs, lncRNA sequences, and protein sequences are from NPInter (Hao et al., 2016), NONCODE (Zhao et al., 2015), and SUMPERFAMILY (Pandurangan et al., 2018), respectively. The details are shown in Table 2. The results demonstrate that SFPEL-LPI obtained the best performance of AUC and can better identify possible LPIs.
In general, network-based methods have become one type of effective tool in possible LPI identification by utilizing LPI network, lncRNA similarity network, and protein similarity matrix. Although network-based methods efficiently discovered unknown LPIs and obtained promising results from the perspective of propagation Ge et al., 2016;Zheng et al., 2017;Zhao et al., 2018b), this type of method has some weaknesses.
1. Parts of computational methods tested their performances only on one database, which may result in biased predictions because of the sparse nature of LPI data . More importantly, the lack of known LPIs limits the further research of LPI prediction in a larger network (Ge et al., 2016). 2. It is important to unravel potential LPIs for lncRNAs/ proteins without any associated information (we represent these lncRNAs/proteins as new lncRNAs/proteins); however, most network-based models fail to capture LPI candidates . 3. Current network-based methods tend to be biased to the lncRNAs/proteins with more known associated proteins. Some lncRNAs/proteins interact with multiple proteins/ lncRNAs and others interact with a few or even only one protein/lncRNA in an LPI network. The unbalanced nature of degree distributions in the LPI network may affect prediction performance. Increasing resistance based on the random walk may improve predictive accuracy for LPI prediction models . 4. Parts of methods compute lncRNA similarities based on the expression profile and may produce incomplete coverage of the lncRNA similarity network when adding LPI datasets. This problem may be solved by increasing appropriate data including LPIs . 5. Network-based methods can be applied to an LPI network in which there exists at least one link between two nodes. Especially for a bipartite network, network-based methods require that each node in the network has at least two linkages. However, the LPI network is usually composed of a few isolated subnetworks, and most of the existing networkbased models fail to identify the LPIs between the lncRNAs in one subnetwork and the proteins in another (Ge et al., 2016). 6. Most current network-based methods utilized local network information and showed better performance; however, many previous computational biology studies showed that global network information contributes to capturing the associations between two entities, such as LPIs (Karuza et al., 2016;Meng et al., 2016;Shi et al., 2017). 7. Biology finally aims at providing personalized medicine for cancer patients, and it is a key issue to predict relevant drugs/ targets for a certain disease by integrating multiple heterogeneous networks and constructing multiple-partite biological networks, such as protein-lncRNA-disease association networks and drug-protein-lncRNA-disease networks. However, current network-based methods are still not applied to this type of prediction (Yao et al., 2016;Yang et al., 2017;Bester et al., 2018;Lu et al., 2018;Ping et al., 2018;Fan et al., 2019).
In summary, machine learning-based LPI prediction methods have some limitations.
1. There are no non-LPIs (negative samples) with experimental validation; therefore, most supervised learning-based LPI prediction models can only randomly select unknown lncRNA-protein pairs as negative LPIs. However, this part of randomly selected negative LPIs may contain true LPIs (positive samples) as well, which significantly influences the predictive performance Zhao et al., 2018a;Zhao et al., 2018b;Zhang et al., 2018c;Shen et al., 2019). Although semi-supervised learning-based models utilized unlabeled information to decrease the limitations of negative LPI selection, it still has the same disadvantage as classifier combination Zhang et al., 2018a;Shen et al., 2019). 2. Some machine learning-based methods constructed two different classifiers, based on lncRNAs and proteins, respectively. The final results are an average of the performances of two predictive models. This type of model will produce biased results . 3. Many lncRNAs/proteins do not have known association information with any proteins/lncRNAs, and we represent them as new lncRNAs/proteins. Most current predictive models are unable to capture possible proteins/lncRNAs for new lncRNAs/proteins (Zhang et al., 2018c). 4. The proposed methods rely heavily on known LPI data; however, the current number of known LPIs is still very low. Therefore, most machine learning-based models are trained using RNA-protein interaction information instead of LPI data. This results in limited predictive performances Zhao et al., 2018a). With the increase in experimentally validated LPIs, the prediction performances of models will improve . 5. The better performances of existing machine learning methods rely severely on data called features (Goodfellow et al., 2016). Current computational methods utilize various lncRNA features and protein features. However, identifying more appropriate features for a given task is still a challenge Min et al., 2017). More importantly, these features are not available for all proteins or lncRNAs Zhang et al., 2018c). 6. Most experimental data are provided by the NPInter database. NPInter is a relatively abundant database for lncRNA and protein data, but it only provides gene-protein interaction data corresponding to relevant lncRNAs instead of direct LPIs. Gene-protein interactions were directly applied to machine learning-based methods to find possible ncRNA-protein associations and did not discover true LPIs Zhao et al., 2018a;Zhao et al., 2018b). 7. Most current computational models for LPI interaction prediction are measured based on cross-validation. Park and Marcotte (2012) used a proteochemometrics model (Wikberg and Mutulis, 2008) for drug-protein interaction prediction and observed that the paired nature of input samples has significant implications on the cross-validation of these pair-input methods. That is to say, there are significant cross-validation differences between input sample and out-of-sample interactions (Park and Marcotte, 2012). For drug-target interaction identification problems, the paired feature of input samples may produce a natural partition of test pairs, and thus the pair-input methods may obtain significantly distinct prediction accuracies for different test classes (Chen et al., 2015). The same situation applies to LPI prediction, which is still a pair-input computational identification problem.

CONCLUSION AND FURTHER RESEARCH
There are a few LPIs and numerous unknown lncRNA-protein pairs not validated by experimental methods in the existing databases. In addition, similar lncRNAs tend to interact with similar proteins, and vice versa (Xiao et al., 2017;Zhang et al., 2018a). Therefore, LPI data have a sparse, low-rank, and unbalanced nature Zhang et al., 2018a;Shen et al., 2019). With the development of experimental technology, more LPIs will be confirmed, and thus the prediction accuracy of computational models will increase. In this section, we present some suggestions for further research based on the nature of LPI data.

Fusing Comprehensive LPI Datasets
Parts of computational methods tested their performances only on one database, which may result in biased predictions because of the sparse nature of LPI data . More importantly, existing computational models utilize various biological information from proteins and lncRNAs, for example, physicochemical properties including hydrogen bonding, secondary structure, and van der Waals propensities (Belluci et al., 2011;Xiao et al., 2017). It is important to utilize diverse biological features to improve the performances of LPI prediction models. However, these features are not available for all proteins or lncRNAs, and thus computational methods cannot capture LPI candidates when information is unavailable (Zhang et al., 2018c). Therefore, exploring advanced data fusion methods to integrate more available data sources may further boost the performance of LPI identification. Focusing on the drawbacks of current network-based LPI identification methods, future research can begin with  These bolded texts represent that the corresponding method is the best among comparison methods.
integrating more heterogeneous networks, such as proteinprotein interaction network (Zhang et al., 2019a), lncRNA-miRNA interaction network (Zeng et al., 2016;Huang et al., 2018c;Zhao et al., 2019), lncRNA-mRNA interaction network (Alaei et al., 2019), lncRNA-disease association network (Fu et al., 2017;Wang et al., 2019), and lncRNA-miRNA-mRNA regulatory network Zhang et al., 2019b). However, how to address the data conflict problems while integrating diverse LPI data from different repositories is a challenge. Although there are not currently data conflict solutions for LPI prediction, we can find some clues by other problems in the area of bioinformatics. For example, Liu et al. (2015) set a confidence level for each DTI and gave a higher score to a DTI from a more reliable data repository. For example, the STITCH database assigns a score with a range [0, 1,000] to each DTI based on four types of different sources: model prediction, text mining, manually curated databases, and experimental validation. Particularly, Liu et al. (2015) gave DTIs from Matador and DrugBank the highest values (1,000) because DTIs from these two databases are reported by biochemical experiments and relevant studies. Lou et al. (2017) exploited another type of data fusion from a multiple-views perspective. This involved five steps: screening relevant information from different data sources; removing isolated nodes without edges in the networks; fusing various types of nodes and edges and building a heterogeneous network; constructing multiple similarity networks to boost the network heterogeneity; and excluding homologous nodes from the constructed heterogeneous networks to further reduce the possible redundancy of associated information. Inspired by these two methods, we can fuse diverse heterogeneous data to improve performance in future research. More importantly, new exploited network-based methods should be implemented on a constructed heterogeneous network rather than a single network.

Screening Credible Negative Samples
There are some known LPIs (positive samples) and abundant unknown lncRNA-protein pairs in existing LPI data resources. More importantly, there are no experimentally validated non-LPIs, and thus most supervised learning-based models have no other choice but to randomly screen negative LPIs from unlabeled lncRNA-protein pairs or even regarded all unlabeled lncRNA-protein pairs as negative samples Zhao et al., 2018b). However, the randomly screened negative LPIs may contain positive LPIs as well, and thus there are severe biases in supervised learning-based techniques. Therefore, exploiting an efficient model to select high-quality negative samples is a challenging task for boosting LPI prediction accuracy. Cheng et al. (2017) designed a FInding Reliable nEgative samples method (FIRE) to select negative RNA-protein interactions. FIRE was based on the following assumption: given a known RNA-protein interaction between an RNA i and a protein j, for an RNA k, the more differences between i and k, the less possibility that k interacts with j, and vice versa. FIRE screened negative RNA-protein interactions through the following steps: computing the protein similarity matrix, building a positive sample set based on known interaction information, scoring an unknown RNA-protein pair not included in positive sample set based on protein similarities, generating m negative samples by sorting these RNA-protein pairs via their scores in increasing order, and selecting the top-m RNA-protein pairs. Similarly, we may generate negative LPIs based on lncRNA-lncRNA similarities, protein-protein similarities, and the above assumption.
Positive-unlabeled (PU) learning (de Campos et al., 2018;Sansone et al., 2018;Yang et al., 2018) is applied to various situations. In PU learning, a supervised learning-based method is designed to learn a classification model from a positive sample set and an unlabeled dataset from an unknown class. Yang et al. (2018) designed an adaptive sampling framework with class label noise based on PU learning and introduced two new bioinformatic applications: identifying kinase-substrates and identifying transcription factor target genes. Therefore, PU learning may be one strong way to solve the problem of lacking negative LPIs.

Deep Learning
Existing computational methods have utilized different lncRNA features and protein features. For example, Bellucci et al. (2011) integrated three types of physicochemical properties, including hydrogen bonding, secondary structure, and van der Waals propensities; meanwhile, Lu et al. (2013) used six types of RNA secondary structures (besides physicochemical properties), which were provided by Bellucci et al. (2011). Therefore, designing more powerful models to integrate relevant biological features is a key issue. However, features are typically exploited by human biomedical engineers, and determining which features are more suitable for LPI prediction remains difficult. More importantly, encoding vectors that are too short may restrict the prediction accuracy of classification model. More importantly, most computational models only used sequence information but did not consider structure information (Peng et al., 2019).
Deep learning-based computational models composed of multiple processing layers require very little engineering knowledge and can efficiently extract features from raw data and construct high-level representations (Wei et al., 2018;Peng et al., 2019). These types of models have been applied to diverse analysis problems, and have obtained better performance due to the excellent power of feature learning (Jurtz et al., 2017;Min et al., 2017;Peng et al., 2019). Therefore, it is valuable and feasible to exploit deep learning-based methods to highly and effectively represent biological features for relevant entities in bioinformatics (Min et al., 2017;Zhang et al., 2018d;Peng et al., 2019;Zeng et al., 2019), such as information relevant to LPI prediction (Xiao et al., 2017;Shen et al., 2019;Zhu et al., 2019). More importantly, although deep learning demonstrated promising performance, it is not a silver bullet in LPI prediction. There still exist many challenges in LPI identification, such as the imbalanced nature of LPI data, limited LPI data, appropriate architecture selection, hyper parameter selection, and interpretation of learning results (Min et al., 2017). Therefore, solving these problems is the key to promoting deep learning-based LPI prediction models in future research.
Particularly, deep learning can be combined with PU learning and improve the performance of computational models (Bepler et al., 2018;Pati et al., 2018). For example, Bepler et al., 2018 designed the first particle-picking framework, Topaz. Topaz combined a convolutional neural network with a generalizedexpectation-binomial-based objective function. The convolutional neural network was used to train classification models using only positive and unlabeled samples. Meanwhile, the generalized-expectation-binomial-based objective function was used to learn model parameters based on positive and unlabeled samples. Topaz utilized convolutional neural network classifiers to fit labeled particles (samples) and the remaining unlabeled samples based on the minibatched stochastic gradient decent method. Deep learning methods based on PU learning provide valuable insight and may be a starting point for deep learning applied to LPI prediction in future research.

Capturing LPI Candidates for New LncRNAs/Proteins
Network-based methods can be applied to an LPI network that has least one link between two nodes. For a bipartite network especially, network-based methods require that each node in the network has at least two linkages. That is to say, network-based methods cannot discover possible proteins for any lncRNAprotein pair without any known reachable paths in the LPI network (Ge et al., 2016;Zhang et al., 2018c). These lncRNAs/ proteins without any interaction information are regarded as new lncRNAs/proteins (Zhang et al,. 2018c).
Given a known LPI dataset, we aim to predict (S1) LPIs between known lncRNAs and known proteins; (S2) LPIs between new lncRNAs and known proteins; (S3) LPIs between known lncRNAs and new proteins; and (S4) LPIs between new lncRNAs and new proteins. S1 has the most abundant association information, S2 and S3 have less data, and S4 has the least data. Computational models appropriate for S2 can still be applied to S3, and vice versa.
To the best of our knowledge, SFPEL-LPI provided by Zhang et al. (2018c) may be one of the rare computational methods for predicting possible LPIs for new lncRNAs/proteins. Although few computational models can be applied to the last three situations, some methods have been designed to solve similar problems in other areas in bioinformatics, and thus provide some clues for LPI prediction. For example, Shi et al. (2015) enhanced the similarity measures and introduced the concept of a "super-target" to capture the missing interactions for new drugs/targets. Furthermore, Chen et al. (2016b) exploited a miRNA-disease association prediction model based on within and between scores (WBSMDA) to uncover possible miRNA-disease associations for new miRNAs/diseases. These solutions provide clues for capturing LPI candidates for new lncRNAs/proteins.

Cross-Validation
Inspired by the evaluation methods proposed by Park and Marcotte (2012) and Chen et al. (2015), the test samples of LPIs could be categorized into four different groups: C1 is composed of the test samples sharing both lncRNAs and proteins with the training samples; C2 is composed of the test samples sharing only lncRNA with the training samples; C3 is composed of the test samples sharing only proteins with the training samples; and C4 is composed of the test samples sharing neither lncRNAs nor proteins with the training samples (Chen et al. (2015)). Therefore, it is vital to give cross-validation results under the above four independent test classes for LPI prediction.

AUTHOR CONTRIBUTIONS
LP and FL contributed equally to this work. LP, FL, XD, CP, and LZ introduced LPI data repositories and computational models. LP and FL wrote the paper. XL and YM revised original draft. LP, JY, GT, and LZ discussed the computational models and gave conclusion and further research. All authors read and approved the final manuscript.