Similarity-based Regularized Latent Feature Model for Link Prediction in Bipartite Networks

Link prediction is an attractive research topic in the field of data mining and has significant applications in improving performance of recommendation system and exploring evolving mechanisms of the complex networks. A variety of complex systems in real world should be abstractly represented as bipartite networks, in which there are two types of nodes and no links connect nodes of the same type. In this paper, we propose a framework for link prediction in bipartite networks by combining the similarity based structure and the latent feature model from a new perspective. The framework is called Similarity Regularized Nonnegative Matrix Factorization (SRNMF), which explicitly takes the local characteristics into consideration and encodes the geometrical information of the networks by constructing a similarity based matrix. We also develop an iterative scheme to solve the objective function based on gradient descent. Extensive experiments on a variety of real world bipartite networks show that the proposed framework of link prediction has a more competitive, preferable and stable performance in comparison with the state-of-art methods.

, where the element A ij = 1 if nodes v i and w j are connected and A ij = 0 otherwise. To test the algorithm's accuracy, the observed links E is randomly divided into two parts. The training set E T is treated as known information, and the probe set E P is used for testing the performance of methods for link prediction. It is clear that E T ∪E P = E and E T ∩E P = ∅. The corresponding adjacency matrix of the training set and the probe set can be represented by A T and A P respectively, also of note, they have the same size as A.
A framework of similarity-based regularized latent feature model. In this paper, we propose a framework for link prediction in bipartite networks by combining the topological structure and the latent feature model from a new perspective. The framework exploits the intrinsic similarity structure of the nodes and which is incorporated as an additional regularization term. By preserving the similarity structure, our framework has more discriminating power than the latent feature model. The framework is shown in Fig. 1. In detail, for each pair nodes, i ∈ V, j ∈ W, we assign a score, S ij , according to a given similarity measure. Higher score means higher similarity between i and j, and vice versa. Figure 1(c) gives the example of calculating CN measure in bipartite network. The CN measure between node v 4 and w 4 is CNs = 6, CNs counts the number of neighbours touched by the quadrangles that pass through the nodes v 4 and w 4 .
Combining this similarity-based regularizer with the latent feature model, we can get the following objective function O SciEnTific REPORTS | 7: 16996 | DOI: 10.1038/s41598-017-17157-9 where θ is a parameter vector, . . l( , ) is a loss function, Ω(·) is a regularization term that prevents overfitting, θ is the model's predicted score for (i, j), and the regularization parameter γ controls the smoothness of the new representation. Such a loss function can be constructed by using some measures of distance between two matrices A and A * . For example, the cost function with the square of the Euclidean distance can be written as The cost function with Kullback-Leibler divergence can be written as Specifically, in this paper we propose the objective function of the framework in view of the nonnegative matrix factorization and cost function with the square of the Euclidean distance. Therefore, we transform the solution of A * into solving optimal problem of NMF. By optimizing the objective function, we can obtain the basis matrix X and coefficients matrix Y. Finally, we get the reconstructed adjacency matrix A * = XY. Details can be illustrated in section Methods.

Evaluation Metrics.
To quantify the prediction accuracy, we use the precision 26 and AUC (area under the receiver operating characteristic curve) 27 to measure the quality of the prediction results in this paper. The precision represents the ratio of correct edges recovered out of the top L edges in the candidate list generated by each link predictor. This operation is repeated 100 times for each network and the mean for each method is reported. Given the ranking of the unobserved links, precision is defined as where L is the number of the predicted links, i.e. the number of links in A P , and L r is the number of correctly predicted links based on the methods. Clearly, large value of the precision means better prediction of the method. AUC metric can be interpreted as the probability that a randomly chosen link in E P (i.e., a missing link that indeed exists but is not observed yet) is ranked higher than a randomly chosen link in U-E (i.e., a nonexistent link) 7 , here U is the set of all possible node pairs in a network. Among n independent comparisons, if there are n′ occurrences of missing links having a higher score and n″ occurrences of missing links and nonexistent link having the same score, we could compute the AUC as: In general, a larger AUC value indicates higher performance. Hence, the AUC value of the perfect result is 1.0, whereas the value of AUC gennerated by a random predictor is 0.5.

Datasets and Baseline Algorithms.
To test the performance of our proposed framework, we consider the following eight real-world networks (i) G-protein coupled receptors (GPC Receptors) 28 : The biological network of drugs binding GPC receptors. (ii)Ion channels 28  The bipartite networks of users and movies. In the dataset each user gives any movie a rating from 1-5. If the rating is not less than 3, then we can draw a link between the user and the movie. The detailed information about these datasets is described in Table 1.
For comparison, we introduce some benchmark methods, which are defined in the following examples below. The first ten methods are based on topological structure. NMF, the eleventh method, directly predicts the links on bipartite adjacency matrix, and learns latent features from network. The last six methods are projection-based methods.
• Common Neighbors (CN) 18 , which denotes the similarity measure of two different types of nodes x and y as where N(x) and N(y) indicate the first-order neighbours, and N(N(x)) and N(N(y)) represent the second-order neighbours of the nodes x and y, respectively. CN measure in bipartite networks counts the neighbours touched by the quadrangle that passes through the nodes x and y 18 . For instance, in Fig. 1(c) the CN index of two nodes v 4 and w 4 equals to 6. • Jaccard Coefficient (JC) 18 is denoted as • Adamic Adar (AA) 18 is denoted as which considers the information about the degree of the common neighbors of two different types of nodes x and y, and assigns the low-connected neighbors with more weight. • Resources Allocation (RA) 18 is denoted as xy RA z N x N N y N y N N x ( ( ) ( ( ))) ( ( ) ( ( ))) the RA index assigns the different weight to the common neighbors of the two different types of nodes.  Table 1. Statistics of the networks studied in this paper. Where, |V|, |W| denote the number of two types of nodes respectively. |E|, LD, AD, LAD, and RAD are the number of edges, the link density, the average degree, the left average degree, the right average degree.
SciEnTific REPORTS | 7: 16996 | DOI:10.1038/s41598-017-17157-9 • Preferential attachment (PA) 16 is denoted as where s LCL counts the links (purple colour) between the common neighbors. In Fig. 1(c), the LCL index of two nodes x and y equals to 7. • Cannistraci-Jaccard (CJC) 18 is denoted as where |γ(z)| is the local community degree of z and corresponds to LCL that originates from z; • Cannistraci resource allocation (CRA) 18 is denoted as 18 is denoted as where e(x) is the external degree of node x, and is presented in Fig. 1(c) (red edges); • Nonnegative Matrix Factorization (NMF) 35 , which learns the representation parts of the original network by approximating the adjacency matrix into the product of two low-rank matrices, and has been developed to predict links with low-rank approximation. • Jaccard (Jac) 36 measures similarity between the same type of nodes x 1 and x 2 . Jaccard uses the size of the intersection divided by the size of the union of it. • Euclidean (Euc) 36 measures similarity between the same type of nodes x 1 and x 2 by the concept of Euclidean distance. • Cosine (Cos) 36 is based on the Cosine similarity between the same type of nodes x 1 and x 2 .
• Pearson (Pea) 36 is based on the well-known Pearson correlation coefficient.
• Bipartite projection via Random-walk (BPR) 36 defines a new similarity measure that utilizes a practical procedure to extract monopartite graphs without making a priori assumptions about underlying distributions. • Network-based inference (NBI) 13 computes the similarity between nodes in a projected network. NBI is based on resource allocation, and also takes the network structure into account.
Experiment results. In this section, we compare our SRNMF method with seventeen widely applied link prediction algorithms in bipartite network, consist of topological structure based methods (including CN, JC, AA, RA, CAR, CJC, CPA, CAA and CRA), projection-based methods (including Jac, Euc, Cos, Pea, BPR, NBI) and NMF methods. See "Baseline Algorithms" for details. In our experiments, we set γ = 1 2 , λ = 2. The prediction accuracy measured by precision and AUC is shown in Tables 2 and 3 respectively. For each of the nine networks, the training set contains 90% of the links, and the remaining 10% of links constitute the probe set. Among all the comparable indices the overall prediction performance of SRNMF outperforms significantly. Table 2 shows the comparison of precision for nine real-world networks. Our SRNMF methods are in red color text while the baseline methods are in black color text. The numbers behind the slash denote the ranking. For example, 0.31\18 means the precision is 0.31, and the whole ranking in all methods is 18. This table shows that the proposed SRNMF (including SRNMF-CN, SRNMF-RA, SRNMF-AA, SRNMF-JC, SRNMF-PA, SRNMF-CAR, SRNMF-CRA, SRNMF-CAA, SRNMF-CJC and SRNMF-CPA) framework outperforms the LCP-based (including CAR, CRA, CAA, CJC and CPA), CN-based (including CN, RA, AA, JC, PA), Projection-based (including Jac, Euc, Cos, Pea, BPR, NBI) and NMF algorithms. Based on the results, we can draw conclusion that the LCP-based methods perform better than the CN-based methods and NMF methods. The reason is that the LCP-based methods additionally concerns the information derived from the node neighbourhood connectivity. NBI and BPR methods perform better than other projection-based methods. In addition, our proposed SRNMF framework performs better than LCP-based methods by adding similarity-based regularization. Moremore, SRNMF is superior to projection-based methods which cause loss of the original topological information in the bipartite network structure. Such as on Enzymes network, an improvement of 106% is offered in average precision compared to similarity-based methods, and an improvement of 125.6% is offered in average precision compared to projection-based methods. This finding provides a strong evidence that methods using manifold learning and similarity regularized are more robust than other baseline methods.
Moreover, Table 3 demonstrates again a clear superiority on AUC index. Based on the results, A conclusion is drawn that the LCP-based (including CAR, CRA, CAA, CJC and CPA) methods almostly perform better than the CN-based (including CN, RA, AA, JC, PA), Projection-based (including Jac, Euc, Cos, Pea, BPR, NBI) and NMF methods. And our proposed SRNMF algorithms perform the best. Such as on SW network, an improvement of 12.6% is offered in average AUC compared with similarity-based methods, and an improvement of 11.7% is offered in average AUC compared with projection-based methods. Besides, our SRNMF methods perform better than benchmark methods (text in black color) in terms of stabilty.
Experiments under different fractions (from 40% to 90%) of four datasets (drug target, GPC, Ionchannel, malaria datasets) are conducted to test the accuracies for link prediction in bipartite networks. Results are shown in Figs 2 and 3 respectively. Each value of the accuracy is returned with the average over 100 runs with independently random network divisions of the training set and probe set. The number of predicted links, L, is always set as being equal to the size of the probe set. According to Figs 2 and 3, by varying the size of training set, prediction accuracies of SRNMF (including SRNMF-CN, SRNMF-RA, SRNMF-AA, SRNMF-JC, SRNMF-PA, SRNMF-CAR, SRNMF-CRA, SRNMF-CAA, SRNMF-CJC and SRNMF-CPA) methods are either the best or very close to the best, other benchmark algorithms (especially PA, NMF and Pea) give very poor predictions for some networks. Usually, larger training set contains more information which could make the prediction easier. However, as shown in Figs 2 and 3, the precision and AUC do not always increase with the size of training set.
As we know, the choice of parameters influences evaluation results. Our SRNMF model has two regularization parameters γ and λ. To show how the precision preformance of SRNMF varies with the parameters γ and λ, we choose drug-target network as an example in this paper and the results are depicted in Fig. 4. As seen from Fig. 4, SRNMF achieves consistently good performance when λ varies from 1.5 to 2.5 and γ varies from 1.5 to 2 with the different similarity measures.

Discussion
In this paper, we investigate the problems of link prediction in bipartite network and propose a framework based on similarity-based regularized latent feature model (SRNMF), which exploits the intrinsic topological structure  Table 2. The prediction accuracy measured by precision on the 9 real networks. We compare our SRNMF method with seventeen well-known methods presented in baseline algorithms. For each real network, 10% of its links will be randomly selected to constitute the probe set, and the rest of the links constitute the training set. Prediction accuracy is measured by precision. The numbers behind the slash denote the ranking.
of the nodes and encodes the geometrical information of the networks by constructing a similarity-based matrix. By preserving the similarity structure, our framework is more powerful in discrimination than the latent feature model. The new framework takes advantages of latent feature model and topological structure. A unified object function framework is proposed to derive the SRNMF in terms of NMF loss function and similarity-based regularization. The SRNMF can be optimized by applying the method of gradient descent. The results demonstrate a more effective, robust and stabilized performance of our SRNMF framework compared with the state-of art methods. We compare the proposed SRNMF framework with other seventeen baseline methods on nine real-world datasets. These methods can be classified into bipartite-based methods and projection-based methods. Bipartite-based methods directly predict links in the bipartite network, and projection-based methods project the bipartite network into two monopartite networks to predict new links. Cos, Euc, Jac, Pea, BPR, NBI belong to projection-based methods. The rest baseline methods and our SRNMF methods are bipartite-based methods. In general, bipartite-based methods performs better than projection-based methods, because projection-based methods cause loss of the original topological information in the bipartite network structure. By adding similarity-based regularization, our SRNMF methods are siginificantly superior to other bipartite-based methods in terms of accuracy and stablity. Despite the passable performance of projection-based methods, NBI and BPR exhibit the higher AUC and prediction values.
Some extensions of this work can be explored. One of the concerns is the drawback of NMF, since its high complexity of iterative calculation. To reduce the computational complexity, parallelization 37,38 and sampling methods can be adopted. Also more efficient optimization algorithms can be reconsidered to obtain the global optimal solution in NMF. Moreover, the weight to improve the link prediction accuracy in a bipartite network has not been reasearched systematically, which is important to be explored in the future.

Similarity-base Regularized Nonnegative Matrix Factorization (SRNMF). NMF obtains parts-
based representation due to the nonnegative constraints. However, the intrinsic geometrical and discriminating  Table 3. The prediction accuracy measured by AUC on the 9 real networks. We compare our SRNMF method with seventeen well-known methods presented in baseline algorithms. For each real network, 10% of its links will be randomly selected to constitute the probe set, and the rest of the links constitute the training set. Prediction accuracy is measured by AUC. The numbers behind the slash denote the ranking.
SciEnTific REPORTS | 7: 16996 | DOI:10.1038/s41598-017-17157-9 structure of the node space cannot be revealed. In this section, we introduce our SRNMF algorithm by incorporating a similarity based regularizer, which avoids the limitation.
Determination of the number of latent features. There are many methods to determine the number of latent features, such as Partition density, Bayesian information criterion and cross validation. These methods need to calculate each possible value of the latent features under each number. Thus they are too complex in computation to be used in real networks. The PCA 39 is used to reduce the dimensionality of a matrix consisting of a large number of interrelated variables, while still retaining the maximum information of the variation present in the matrix. This is achieved by transforming the original matrix to a new set of variables, named principal components (PCs), which are uncorrelated and ordered with the first few components explaining most of the variation present in all of the original variables. The eigenvalues of the matrix are used to calculate the cumulative contribution rate to determine the number of dimension. So in this paper, we determine the number of latent features by calculate the cumulative contribution rate and cumulative contribution rate of 95% is adopted to choose PCs.
NMF with Manifold Regularization. NMF aims to find two nonnegative matrices whose product provides a good approximation to the original matrix. A natural assumption here could be that if two nodes u i , v j are close in the intrinsic geometry of the node distribution, then A ij and (XY) ij are also close to each other. A ij and (XY) ij are the connected representations of these two nodes from the original network and a low-dimensional approximation derived from NMF. This assumption is so-called local invariance assumption 40,41 . It has been shown that learning performance can be significantly enhanced if the topological similarity structure is exploited and the local invariance is considered. S ij is used to measure the closeness of two nodes u i and v j . The different similarity measures such as CN, AA, RA, JC, PA, CAR, CRA, CAA, CJC, CPA can be used in this paper (for details see baseline algorithms). With the above defined similarity matrix S, we can use the following term to measure the smoothness of the low dimensional representation   By miniming R, we expect that if two nodes u i and v j are close (i.e., S ij is big), A ij and ∑ ⋅ = x y k K ik kj 1 are also close to each other. Combing this similarity-based regularizer with the original NMF objective function leads to our SRNMF. We now consider Euclidean distance formulations of NMF latent feature as the optimization problem, so the proposed model can be defined as the following constrained nonlinear programming Here, λ ≥ 0 and γ ≥ 0 are the balance parameters, ||XY|| * is the nuclear norm which is the sum of the singular values of XY. The benefit of the nuclear norm regularization is that, with a sufficiently large regularization parameter, the final solution will be low-rank 42 .
We utilize a standard reformulation of the nuclear norm which is more flexible to manipulate 43 .
Combining Equation (17) and Equation (18) The objective function O(x, y) in (19) is not convex in both x and y together. Therefore, it is unrealistic to expect an algorithm to find the global minima. To address this problem, two iterative algorithms are introduced. Let ϕ ik and ψ kj be the lagrange multipliers for constraint x ik ≥ 0 and y kj ≥ 0, respectively. The Lagrange L is: The proposed SRNMF framework. The low-dimensional approximation matrix of the network A * can be obtained by the above optimal procedures and the pseudocode is presented in algorithm 1.

Complexity analysis.
Here, we give a simple complexity analysis of the proposed SRNMF framework. The most time-consuming part occurs in updating X and Y. For each iteration, the time cost of AY S A Y ( ( ) ) , thus the total time cost of the algorithm is where N iter is the number of iterations, V and W denote the number of two different types of nodes respectively. Many real-world networks are known to be sparse, so the final time cost can be denoted as , where E is the number of the edges in the bipartite network.