Neural graph embeddings as explicit low-rank matrix factorization for link prediction

Learning good quality neural graph embeddings has long been achieved by minimizing the point-wise mutual information (PMI) for co-occurring nodes in simulated random walks. This design choice has been mostly popularized by the direct application of the highly-successful word embedding algorithm word2vec to predicting the formation of new links in social, co-citation, and biological networks. However, such a skeuomorphic design of graph embedding methods entails a truncation of information coming from pairs of nodes with low PMI. To circumvent this issue, we propose an improved approach to learning low-rank factorization embeddings that incorporate information from such unlikely pairs of nodes and show that it can improve the link prediction performance of baseline methods from 1.2% to 24.2%. Based on our results and observations we outline further steps that could improve the design of next graph embedding algorithms that are based on matrix factorization.


Introduction
Networks are a convenient representation of the relationships of complex system components, such that the interconnectivity of nodes and edges inherently encodes valuable information about the system. It is such a flexible tool that has been used to represent complex emerging patterns in disparate domains, from molecular biology to social sciences [1]. Perhaps, one reason for the flexibility of this representation, is that the information stored within can be understood through the analysis of its topology. Link prediction [2] has become a popular pattern recognition problem, where given the current topology of the network we predict the next most likely links to emerge. To address this problem, one common way is to treat the network as a graph G = (V, E) with nodes u ∈ V and edges (u, v) ∈ E; u, v ∈ V . Then, to perform link prediction, one would split the graph edges E into two partitions (not necessarily of the same size) E train , E test , which serve the purpose of positive edges. Next, we sample the non-existent edges (u, v) ∈ E to makeĒ train ,Ē test , which serve as negative examples. Habitually, one would then come up with a prediction model that would learn patterns from the training dataset; typically learn to score positive edges higher than negatives. And the prediction performance of such a model would be assessed on the test set with an accuracy metric. As a result, one would hope that through this partitioning the prediction model would generalize enough to predict the future states of the system, i.e., to predict the formation of new links [3].
The conventional way to efficiently process the network structure to analyze the complex system relies on explicit topological descriptors, such as betweenness centrality and triangle count. The accuracy of such ad-hoc descriptors is directly proportionate to the expertise and domain-knowledge put into their design. However, recently, the field of representation learning has proposed a more flexible approach of learning latent representation of networks, which embeds the graph structure into a latent space. More specifically, given a graph G the goal is to learn a dictionary mapping V → R d that embeds each vertex to a d-dimensional vector. The link prediction is then performed by a classifier (V × V → R) that for a pair of input vertices u, v scores the likelihood of them being connected (i.e., likelihood to form a link).
The roots of this line of work can be traced back to the spectral graph theory [4] and social dimensional learning [5]. However, the more recent approaches (e.g., deepwalk [6], node2vec [7]) have been based on word embedding algorithms, in particular the skipgram with negative sampling (SGNS) version of the word2vec model [8]. The central idea is that the node sequences w 1 , w 2 , . . . , w n generated from random walks simulation on graphs can be treated as a text corpus and inputted into the word2vec model to get the output in a form of latent vector representation for each node (technically a word in the random walks generated text). While deepwalk-inspired approaches have been demonstrated to be effective in link prediction problems, their theoretical mechanisms have been relatively understudied, up until Qiu et al. [9] have explicitly demonstrated the connection of deepwalk-inspired approaches and matrix factorization methods. The central idea relies on the fact that the horseback of these approaches -word2vec -has been shown to implicitly factorize the matrix containing the collocation of words in the text. Concretely, Levy et al [10] have formally demonstrated that the word2vec is factorizing the point-wise mutual information (PMI) matrix -a well-known matrix in the information theory. Based on this demonstration, Qiu et al. [9] have given closed-form expressions for the PMI matrix that deepwalk-inspired methods are factorizing in graphtheoretic terms. This expression of the PMI matrix is dubbed the deepwalk matrix and its closed-form derivation is tightly connected to random walks and graph Laplacians. Finally, the authors proposed the netmf algorithm that learns network embeddings for nodes by performing the singular value decomposition (SVD) of the deepwalk matrix.

Challenges
An important observation in this general approach to low-rank approximation of the PMI matrix is that this matrix may pose computational issues. The trouble kicks in when we have a pair of words (w, c) which are never observed. These negative pairs lead to the ill-posed task of decomposing PMI(w, c) = log 0 = −∞ entries. As noted in [10], solutions to these problems include a Dirichlet prior by adding a small "fake" count to the underlying counts matrix -smoothing thus negative entries, or truncating the PMI matrix, such that PMI(w, c) = 0 for the unobserved pairs. Qiu et al. [9] use the latter, mainly because they want to leverage the fact that a truncated matrix is very sparse, 2 and thus SVD is accelerated. However, by forcing zeroes into the matrix we ignore the information from most of the entries of the matrix, as PMI scores for low-frequency pairs of nodes will be indistinguishable from pairs of nodes that truly never appear in contexts. We show that this may lead to a drop in performance on link prediction problems.

Contribution of this work
In this work, we extend the work of Qiu et al. [9] and propose a new approach to learning low-rank factorization embeddings that incorporate information from unlikely pairs of nodes. The contributions of our work may be summarized in these three bullet points: • We apply smoothing to shifted PMI scores to enable a low-rank factorization of matrix entries that does not penalize low-frequency node pairs. We show that this correction may drastically improve accuracy. • By considering different transformations of the PMI matrix, and by deriving matrix forms of the joint probability matrix, we demonstrate superior link prediction performance. These results lead us to argue that the linguistic collocation metric (PMI) may not be the best metric for link prediction. • Finally, we compare our results with the state-of-the-art accuracy scores from the approach of Abu-El-Haija et al [11], and conclude that our approach is on par. Their algorithm is using a stochastic (implicit) matrix factorization using neural networks, while ours is based on explicit SVD.

Organization of this manuscript
This manuscript is organized as follows: in Section 2 we will highlight the theoretical notions that are necessary for a fluid reading of our work; Section 3 will present our approach; in Section 4 we will review the specific steps we took to evaluate our approach on link prediction tasks; Section 5 will cover our results; finally, we will close with a discussion section (Section 6).

Theory
To improve the self-contained reading of this manuscript, we briefly highlight the most salient parts of the required theoretical concepts.

Neural word embeddings with skip-gram negative sampling
Neural graph embedding research started with the introduction of deepwalk [6] algorithm, which adapted the famous word representation algorithm word2vec [8] to latent graph representation learning. Recall that, the goal of word2vec is to learn a dictionary mapping Θ : W → R d , which embeds a word w i ∈ W into a d-dimensional vector. These vectors encode generic linguistic information of words in a text corpus and can be used in downstream tasks, such as document classification.
Given a word sequence w c1 , w c2 , w t , w c3 , w c4 , a word2vec version based on skip-gram training objective tries to predict context words w ci given the target word w t . Formally, it models the conditional probability P Θ (w ci |w t ) = e Θ(w t ),Θ(wc i ) k e Θ(w t ),Θ(w k ) using inner products, e.g., Euclidean dot product, of word embeddings, e.g. Θ(w t ), and a softmax function in 3 the denominator. The exact computation of the softmax function is computationally infeasible for any real-world text corpus, i.e., k would range over millions of words. To approximate the softmax function, for one target-context pair of words (w t , w c ), Mikolov et al. [12] proposed to sub-sample k words from a unigram context distribution w k ∼ U , which do not appear in the context of w t , and minimize a negative log-likelihood loss function where a sigmoid function σ(x) = 1 1+e −x squashes everything into a 0, . . . , 1 range. This approximation is called skip-gram with negative sampling (SGNS). In SGNS, model parameters Θ, e.g. word embeddings, are learned implicitly by training a neural network that minimizes an aggregated negative log-likelihood, argmin Θ − (t,c)∈Ω L (wt,wc) , of all target-context pairs Ω in the training corpus. In this model, words that appear in close proximity in the text would tend to get embedded closer to each other in the embedding space.

Connection to neural graph embeddings
The natural connection between graph and word embeddings relies in the fact, that, given a graph G = V, E , and a starting node w t ∈ V , we could launch a random walk of length l to generate a context node sequence w t , w c1 , . . . , w c l . Performing such random walk simulations for all nodes in the graph would generate many context node sequences, which we could group in a multiset Ω consisting of target-context node pairs (w t , w c ), w t , w c ∈ V . Effectively, we could treat Ω as a training text corpus, and use SGNS word2vec model to learn a dictionary mapping Θ : V → R d , for each node w i ∈ V . Nodes that appear more frequently in the same context node sequence would get embedded closer to each other. This is the central idea of neural graph embedding models, that rely on word2vec, such as deepwalk [6] and node2vec [7].

Connection to information theory and low-rank matrix approximation
Levy et al. [10] have formally demonstrated that for a target-context pair (w t , w c ), Eq. 1 will be minimized when the inner product is equal to Θ(w t ), Θ(w c ) = log P (wt,wc) P (wt)P (wc) − log b, where b is the amount of negative words for the target word w t . This can be obtained by setting the gradient of with respect to the inner product to zero, ∂L(wt,wc) ∂ Θ(wt),Θ(wc) = 0, and simplifying algebraically. In fact, PMI(w t , w c ) = log P (wt,wc) P (wt)P (wc) is point-wise mutual information (PMI), a well-known quantity in the information theory, which measures the discrepancy between the probability of the coincidence of two random variables given their joint distribution and their individual distributions. PMI is used in linguistics to measure the collocation or linguistic association of two words. Good collocation pairs have high PMI because the probability of co-occurrence is only slightly lower than the probabilities of occurrence of each word.
This theoretical result allows us to interpret the condition for the optimized SGNS word2vec model. Effectively, the best model is obtained, when for each target-context 4 pair (w t , w c ) their inner product approaches the shifted pointwise mutual information statistic And, by grouping target-context pairs (w t , w c ) as entries in a matrix M wt,wc = PMI(w t , w c ) − log b, we can treat the minimization of the loss function in Eq. 1 as an implicit matrix factorization of a huge matrix M filled with PMI statistics. We can formalize it with where Y, R are weight matrices of the hidden layer in the word2vec neural network, and · is the Frobenius norm. By optimizing the weights Y, R of the neural network we implicitly find word embeddings, such that their pairwise inner products approach the shifted PMI (see Eq. 2).

Connection to random walks on graphs
Since neural graph embeddings use SGNS word2vec to learn vector representations for nodes, it was a natural step to re-interpret Eq. 2 in terms of intrinsic properties of a graph G. Qiu et al. [9] have re-expressed the shifted-PMI of a target-context pair of nodes (w t , w c ) in graph-theoretic terms as, where d wt is the degree of the node w t , vol(G) = i d(i) is the volume of graph G (sum of all degrees), T size of the context window of a random walk (how far to the left and right we will search for context nodes), and b is the number of negative nodes for a target node w t . The transition probability matrix P = D −1 A is obtained from the adjacency A and degree D matrices of a graph, such that an entry P wt,wc holds a probability of transitioning into w c from w t in 1 step. Power matrices P r = P (P (. . . (P ) . . .) give us conditional probabilities of transitioning from one node to others in r steps, i.e., P r wt,wc = P (x r = w c |x 1 = w t ) for a random walk Effectively, Qiu et al. have shown that deepwalk is optimized when the inner product of a target-context node pair (w t , w c ) is approaching PMI(w t , w c ) − log b. After a series of derivations, they finally propose the closed form of the shifted PMI matrix in terms of graph theory terminologies Throughout this manuscript, we will refer to log Q as the deepwalk matrix. As mentioned previously, the matrix M is ill-imposed, since pairs of nodes that never occur in Ω lead to log 0 = −∞. To circumvent this computational trouble, Qiu et al [9] consider the truncated version M = log(max(Q, 1)) leading to a highly sparse representation. Consequently, they only consider pairs of nodes that occur in Ω during the decomposition process, or even more dramatically, they ignore the information from most of the entries of M . To obtain a d-rank approximation of M ≈ L × R T , Qiu et al. [9] apply singular value decomposition (SVD) directly on M , instead of implicitly learning them by minimizing Eq. 3 with a neural network. By explicitly factorizing the closed-form of

Proposed approach
The skip-gram powered neural graph embedding approaches relied on word2vec to learn latent vector representation by implicitly factorizing the shifted PMI matrix. Driven by the goal of theoretically explaining deepwalk-like neural graph embedding algorithms, Qiu et al. [9] settled for the PMI matrix, and especially its closed-form expression in graph-theoretic terms. However, the choice of the PMI matrix for learning graph embeddings was implicitly imposed by the design of word2vec. In developing our approach, we ponder whether the matrix factorization of PMI matrix is the best way to obtain node representations suitable for link prediction of real-world networks? To test our hypothesis, we first show why PMI matrix might be a bad option, and then use this rationale to derive our method.

Contribution of low-frequency pairs to learning good graph embeddings
We start our incursion into the effect of truncating or smoothing negatives on link prediction by visually investigating the reconstruction of the PMI matrix. First, we apply the original NetMF (log[max(Q, 1)]) algorithm on the karate club network [13], consisting of 34 nodes, to obtain graph embeddings Y (Eq. 6). Then, we compare the ground truth PMI matrix (Figure 1, left) and the reconstruction with Y · Y T (Figure 1, right). We can see that Y · Y T ignores the contribution of low-frequency pairs (u, v), where Q uv < 1. While NetMF (log[max(Q, 1)]) reconstruction is fairly good for highly frequent pairs of nodes, there is little variation between the non-occurring pairs and the pairs that occur with a low PMI.
This matrix form of J gives us a closed-form solution for random walk co-occurrence probability computation, without the need for generating random walk sentences and sampling context pairs. Table 1 gives an overview of the considered matrices, derived from the original NetMF algorithm, and details what one entry in each matrix means. Entries in these matrices represent statistic quantity that can be computed from the original graph structure. Table 1: Matrices whose SVD low-rank factorization is considered in our link prediction experiments. An entry M ij in a considered matrix M records a statistic of co-occurrence of two nodes i, j; graph statistics are derived from random walks.

matrix M meaning of entry Mij
NetMF(log[max(Q, 1))) truncated shifted point-wise mutual information log[max(Qi,j , 1)) for nodes i, j 1+e −x applied to each entry of Q (smoothened negatives) J P (i, j) -probability p(i, j) that nodes i, j co-occur within the same context in all simulated random walks 4. Methods

Datasets
To perform unbiased comparison of link prediction performance using different graph embedding models, we used exactly the same train and test splits as the ones used in Abu-El-Haija et al. [11]. In [11], 5 networks from 1 SNAP (Stanford Large Network Dataset Collection) were considered, which we summarize in Table 2.  : May 27, 2020). To use the same dataset that was tested in this work you would have to download the version available in [11]. PPI dataset was originally published in [14].

Graph embedding models
We compared the link prediction performance of our approach to the original NetMF algorithm and to the following state-of-the-art graph embedding models: BigClam [15], NNSED [16], SocioDim [17], RandNE [18], NFMADMM [16], GLEE [19], and NodeSketch [20]. All of our experiments were primarily implemented in Python (version 3.8). We used the usual packages from the scientific Python stack, such as numpy, scipy, and scikitlearn. To compare our approach with NetMF, we used the 2 official implementation available on GitHub. For other models, we used karateclub [21], a Python library for unsupervized learning on graph-structured data.
Furthermore, we used link prediction scores in [11] to directly compare neural networkbased graph embedding models to the proposed method. Specifically, the link prediction performance of the following models was considered: DGNR [22], Node2Vec [7], Assym Proj [23], and WatchYourStep [11]. Note that we have not performed hyperparameter optimization for these models, but we took the best scores from [11].

Link prediction, evaluation, and Bayesian hyperparameter optimization
To make sure that all graph embedding models were compared head-to-head in a fair manner, we performed the model selection, where we optimized for the best hyperparameters for each model on every considered graph. Since all considered graph embeddings models are sensitive to the specific choice of the associated hyperparameters, model selection was performed using a computationally intensive Bayesian hyperparameter optimization pipeline summarized in Figure 2. For example, our proposed approach has only one parameter to optimize the deep walk length, i.e., the highest power of the transition matrix. NetMF adds another parameter, the number of negative samples in the computation of the pointwise mutual information metric. Graph embedding models based on non-negative matrix factorization have other hyperparameters, such as the number of iterations to solve the NMF optimization problem. We provide the full list of sampled hyperparameter configurations and the underlying distributions for each graph embedding approach in Supplementary Material. Our hyperparameter optimization pipeline was implemented using 3 optuna, a Python hyperparameter optimization library.
We took train and test splits provided in Abu-El-Haija et al. [11]. The original train set was used to learn graph embeddings Y and perform hyperparameter tuning, and the original test set was only used for the final link prediction evaluation of the best selected model. For classifying positive and negatives links using graph embeddings Y , obtained with a graph embedding model M , we used a classifier g M (Y ) = σ(Y · Y T ), where σ(x) = 1 1+e −x . The performance of the classifier g M was measured with the area under the receiver-operating curve (ROC AUC). To report a signed difference in ROC AUC scores for two classifiers g M , g M , we used the following metric φ(g M , g M ) = ROC AUC score(g M )−ROC AUC score(g M ) ROC AUC score(g M ) . Overall, we made sure that there was no double-dipping into the data, and that no information leakage from the test set took place. For each graph embedding model, to determine the best model parameters, we let the Bayesian hyperparameter optimization pipeline run for 50 trials on the fraction of the original training set (80%). Model selection was performed by optimizing the ROC AUC score on the other fraction (20%) of the original training set, commonly referred as the validation set. Best selected model parameters were then used in the final link prediction evaluation on the holdout test set. Before the evaluation, each graph embedding model was retrained on the full training set using the optimized parameters from the optimization pipeline.

Link prediction
Our approach outperformed all the considered models on all graphs, as documented by the highest ROC AUC scores on the holdout test set (Table 3, Figure 3). Compared to NetMF (log[max(Q, 1)]) the proposed approach, J, increased the ROC AUC score by 24.2% and 15.6% on the graphs wiki-vote and ppi. The improvement of link prediction performance on ca-AstroPh, ca-HepTh, and soc-facebook were 1.2%, 1.4%, and 2.8%, respectively. BigClam, NNSED, and NMFADMM models had better ROC AUC scores than NetMF (log[max(Q, 1)]) on wiki-vote and ppi. On the remaining graphs, the baseline version of NetMF outperformed all other state-of-the-art link prediction models (except ours).

Truncating or smoothing low co-occurrence random walk pairs?
The original version of NetMF, log[max(Q, 1)] truncates the contribution of lowoccurence pairs due to the log[max(·, 1)] transformation of the matrix Q. Instead, we Table 3: Link prediction performance comparison between the proposed approach and baselines. Following a Bayesian hyperparameter optimization consisting of 50 trials on the training set, the best performing model was selected and its ROC AUC score on the holdout test is reported; the best ROC AUC scores in bold.   considered a version of NetMF, σ(Q), that smoothens these entries, instead of truncating them, and compared its best-optimized model to the best optimized log[max(Q, 1)] model. To smoothen entries in the Q matrix, we used the sigmoid function, σ(x) = 1 1+e −x , which squishes inputs from the [−∞, ∞] into the [0, 1] range. Due to this transformation the information from unobserved pairs, or the pairs with very low joint probability, is not lost. However, this comes at the cost of decomposing a dense matrix to get the graph embeddings Y . Note that we cannot decompose the original shifted PMI matrix log Q, since it leads to decomposing ill-defined log 0 = −∞ entries. The NetMF version that smoothens low-occurrence pairs outperformed the baseline on wiki-vote and ppi, had the same performance on soc-facebook, and obtained a slightly worse performance on ca-AstroPh and ca-HepTh graphs. Table 4 shows head two head comparison of two versions' ROC AUC scores.

Generalization gap
We looked into the generalization capability of J and baseline NetMF models, NetMF(log[max(Q, 1)]) and NetMF(σ(Q)). In the statistical learning theory, under the bias-variance tradeoff paradigm, we expect the test score to be slightly worse than the training score. Moreover, really big negative differences may point to the overfitting phenomenon, where a prediction model has learned the training dataset but failed to generalize on the test set. Figure 4 plots train, validation, and test ROC AUC scores for the proposed approach and two versions of NetMF. Compared to the baseline, the proposed approach had better absolute scores on all graphs, and specifically on the graphs wiki-vote and ppi. On these two graphs, the test ROC AUC score with respect to the train ROC AUC score for the approach of Qiu et al. [9] dropped 6% and 32%, respectively. For comparison, it only dropped 1% and 9% for the proposed approach. For these three approaches, it is interesting to note that the graph soc-facebook had the best generalization gap, as differences between the ROC AUC scores were less than 1%. The graph ca-HepTh had the biggest difference between the validation and the train ROC AUC score, dropping˜36% for all three models, however, the gaps between the train and test scores were much smaller (5-9%).  Table 5 documents the optimization duration for all the considered models. NodeSketch was the most computationally intensive model to optimize, RandNE was the fastest model to optimize. Two NetMF versions and the proposed approach took more time to optimize than the median optimization duration. Notably, the sigmoid version of NetMF (σ(Q)) took less time to optimize than the original version that used the truncated log transformation (log[max(Q, 1)]). The optimization duration for the proposed approach J and the original NetMF was comparable.   Figure 5 shows the best hyperparameters for the proposed approach and two NetMF versions. While J required more dimensions, i.e., bigger representation capacity, it required a smaller random walk length. It is interesting to see that the best ROC AUC scores were obtained when only 1 negative sample was taken into account for the two versions of the NetMF algorithm.

Comparison with graph embedding models based on neural networks
We compared our approach with graph embedding models based on neural networks, using published results in Abu-El-Haija et al. [11] (Table 6). In their work, the authors used a neural graph embedding model WatchYourStep that was implicitly factorizing the expected values of the matrix of co-occurrence counts E[C], where each entry E[C] wc represents the expected amount of times that nodes w, c will co-occur in all random walks [11]. In addition, it was reported that WatchYourStep largely outperformed other graph embedding models, including DNGR [22], node2vec [7], and Assym Proj [23]. We have used exactly the same training and test splits as in [11] and found that our approach also outperformed DNGR, node2vec and Assym Proj. Compared to WatchYourStep headto-head, our proposed method, J, obtained comparable scores (within˜3% ROC AUC scores), and on the graph wiki-vote even outperformed it by >3%. 13 Figure 5: Optimized hyperparameters that lead to the best ROC AUC scores. Table 6: Comparison of test ROC AUC scores of the proposed approach and the scores obtained with state-of-the-art neural graph embedding algorithms as reported in [11]; † best scores per model. Train and test splits are exactly those from [11]. Overall best ROC AUC scores in bold, second best in italic.

Discussion
In the last decade, the research on learning neural graph embeddings has been deeply influenced by the success of the word2vec algorithm. This word embedding algorithm used neural networks to perform low-rank factorization of the shifted pointwise mutual information matrix, where each entry represents a linguistic collocation of two words. In this work, we put under question whether the collocation metric is the right fit in the realm of real-world networks. Instead, we proposed a graph embedding approach that relies on factorizing the joint probability matrix J, where each entry encodes the probability of two nodes to co-occur in the same context in one random walk. Benchmarked on link prediction tasks, our approach improved the performance of baseline graph embeddings obtained by factorizing the shifted pointwise mutual information matrix. Depending on the network topology, the improvement ranged from 1.2% to 24.2%. In the following, we discuss a few insights from our work that could be important in designing graph embedding algorithms.
6.1. Explicit modeling of negative samples has little effect on the quality of graph embeddings Our work ponders the importance of explicit modeling of negative samples for link prediction problems on graphs. Qiu et al. [9] have shown that deepwalk is optimized when the inner product of a target-context node pair (w t , w c ) is approaching PMI(w t , w c ) − log b. In fact, the shifted PMI approach, which stems from natural language processing, requires the explicit term log b, where b is the amount of sampled negative examples during stochastic training. However, in our experiments, we saw that the role of the negative term in the shifted PMI formula plays a limited role. The best-optimized models for all graphs consistently chose b = 1, such that the term log[b = 1] = 0 did not contribute any new information.

Smoothing low-frequency pairs yields better link prediction performance than truncating
The original NetMF algorithm truncates the contribution of node pairs with a low shifted PMI score using a log[max(1, ·)] transformation. This technical choice has been done to avoid the ill-imposed decomposition of lim x→0 log x = −∞ entries in the matrix for pairs of nodes that never appear together in a random walk. However, due to the log[max(1, ·)] transformation the contribution of low-frequency pairs to learning graph embeddings is undermined. In fact, applying the sigmoid function σ(·) to the argument of log, i.e., σ(log[·]), yields a comparable (<1% difference in ROC AUC score), or in some cases much better link prediction performance (7.8% -15.5% improvement). We believe that this happens because all the variance of the low-frequency pairs -most entries in the graph co-occurrence statistics matrix -is taken into account during the matrix factorization with SVD. Therefore, we believe that smoothening low-frequency pairs should be favored over truncating them for link prediction.

Pointwise mutual information metric and scale-free property of real-world networks
Interestingly, the joint probability matrix J and shifted PMI matrix are extremely related, both encode co-occurrence information from the same graph and for the same simulated random walks, so why do we see the difference in performance? We believe that to explain this, we need to analyze under which circumstances a relatively high joint probability of i and j appearing together encodes more useful information for the link prediction problem. Point-wise mutual information of a pair of outcomes x, y from discrete random variables X and Y quantifies the discrepancy between the probability of their coincidence (co-occurrence) given their joint and individual distributions. In computational linguistics, PMI is used for finding collocations and associations between words. Good collocation pairs have high PMI because the probability of co-occurrence p(i, j) is only slightly lower than the probabilities of occurrence of each word (p(i) and p(j)). On the other hand, PMI(i, j) is zero when these two outcomes are independent, i.e., p(i, j) = p(i)p(j)); PMI(i, j) = log p(i,j) p(i)p(j) = log 1 = 0. And it is undefined when i, j do not co-occur at all, i.e., log PMI(i, j) = log p(i,j) p(i)p(j) = log 0 = −∞, when p(i, j) = 0. On the other hand, PMI is maximized when two outcomes are perfectly associated, i.e., p(i|j) = 1 or p(j|i) = 1. For instance, a pair of words puerto rico has a very high PMI, since they are rarely used individually in the 4 Wikipedia articles. From the same source, a pair of words it the, despite having the same joint probability value as puerto rico, will have a very low PMI score because their joint probability is small in comparison with their individual probabilities.
Translating to the graph domain, consider two connected nodes i, j, such that both i and j are extremely frequent (very high centrality). If we compute the PMI of this pair it will give us a very low score because the numerator (joint probability) is dominated by the denominator (big individual probabilities), therefore the dot product of embeddings for these two nodes will give us a low score, which might be interpreted that the two nodes are not connected (while they are). This can seriously undermine the link predictor. It is very well known that many real-world networks, such as protein interaction, social, and citation, networks exhibit scale-free nature, where the degree distribution follows the power law, which results in networks that have few hubs to which all other nodes tend to attach [24], [25], [26]. Therefore, we believe that a high joint probability for two highly frequent nodes should not be penalized, as it might be very indicative of a very plausible link between the two nodes. Hence, we would recommend J, the joint probability matrix, as the main choice for the computation of node embeddings for link prediction.

The importance of non-linear feature extraction in link prediction
Improving the capacity of neural graph embedding models to extract non-linear features from graph data is an active area of pattern recognition research [27]. The insights from our work may help train non-linear feature extraction link predictors more efficiently, in particular, when it comes to hyperparameter optimization. Indeed, non-linear neural embeddings approaches require extensive computational costs to find an optimal choice of hyperparameters, e.g., embedding dimension. Neural graph embedding algorithms based on explicit matrix factorization with SVD, such as the one presented in this work, can be used as a baseline model to determine the best embedding dimension. In particular, the Eckart-Young-Mirsky theorem guarantees the decomposed vectors of rank d (i.e., embedding dimension) are the best linear approximation.
However, we should also be aware that some real-world networks may benefit only marginally from complex non-linear feature extraction. Our proposed linear approach obtained a similar performance as the non-linear WatchYourStep approach by Abu-El-Haija et al. [11].

Related work
Our work is based on the tremendous previous efforts to connect neural embeddings for symbolic data and the solid theory of matrix factorization [10,9]. Perhaps, the most similar work to ours, to the best of our knowledge, comes from the NLP domain [28], where they showed that the SVD decomposition of the truncated PMI matrix, as in [10], may worsen the quality of word embeddings.
Among the most popular approaches to learning node embeddings and applying them to link prediction, we can outline three main classes: • Methods that are not based on neural networks. For instance, BigClam [15], NNSED [29] use community detection algorithms to learn node embeddings. SocioDim [17], RandNE [18], NMFADMM [16], NodeSketch [20], and GLEE [19] exploit neighborhood information. In our comparison, we found out our approach outperforms these baselines. Other methods that rely on feature engineering for individual nodes [30], [31], [32], [33] have been also popular for link prediction problems. We have not included these methods in our comparison. • Methods that are based on neural networks. Deepwalk [6] inspired approaches include node2vec [7], NetMF [9], DGNR [22], Assym Proj [23], and WatchYourStep [11]. We showed that the performance of our approach compares favorably. • Methods that require per-node feature information. Some algorithms require not only graph structure information, but also per-node feature information, e.g., gene expression values in protein interaction networks. Prominent examples include graph convolution methods for semi-supervized node classification [34], [35], [36] and graph representational learning using variational autoencoders [37]. We have not included these algorithms in the comparison since they require additional per-node feature information. It is worth mentioning that the predictive performance of graph convolutional neural networks can be significantly improved by incorporating the structural information about the input graph into the training process [38], [39], [27]. Similarly, findings from our work may also be used to optimize the training process of graph convolutional neural networks.

Limitations
Our analysis is conditioned on the linear low-rank decomposition assumption of the SVD algorithm. Another matrix factorization algorithm could have decomposed the truncated log[max(Q, 1)] matrix in such a way as to learn better node embeddings for link prediction. However, it would be surprising that the information from only highfrequency pairs would be sufficient for link prediction. On another note, the original NetMF [9] was not tested on link prediction problems, i.e., the matrix factorization of the truncated PMI (log[max(Q, 1)]) was benchmarked on node classification problems. Lastly, our proposition of smoothening the information from low-frequency pairs might yield a highly dense matrix, which might not scale for big graphs.

Conclusion and future work
Our work contributes to a better understanding of the interplay between the skipgram powered neural graph embedding algorithms and matrix factorization. Neural graph embeddings implicitly decompose matrices that represent quantities derived from the co-occurrence frequencies in simulated random walks. We showed that the link prediction accuracy of graph embeddings strongly depends on the transformations of the original graph co-occurrence matrix that they decompose. In particular, smoothening entries corresponding to low-frequency pairs, as opposed to truncating them, may lead to better performance in link prediction.
Future work will be focused on addressing the limitations and devising workaround strategies to make sure that our approach can scale to very big graphs. In particular, we would like to incorporate spectral graph sparsification [40] into our algorithm.

Acknowledgements
The computational results presented have been achieved in part using the Vienna Scientific Cluster (VSC).