Nonparametric Link Prediction in Large Scale Dynamic Networks

We propose a nonparametric approach to link prediction in large-scale dynamic networks. Our model uses graph-based features of pairs of nodes as well as those of their local neighborhoods to predict whether those nodes will be linked at each time step. The model allows for different types of evolution in different parts of the graph (e.g, growing or shrinking communities). We focus on large-scale graphs and present an implementation of our model that makes use of locality-sensitive hashing to allow it to be scaled to large problems. Experiments with simulated data as well as five real-world dynamic graphs show that we outperform the state of the art, especially when sharp fluctuations or nonlinearities are present. We also establish theoretical properties of our estimator, in particular consistency and weak convergence, the latter making use of an elaboration of Stein's method for dependency graphs.


Introduction
The problem of predicting links in a graph occurs in many settings -recommending friends in social networks, predicting movies or songs to users, market analysis, and so on. This has led to widespread interest in the problem. However, the state of the art methods suffer from two weaknesses. First, the literature mostly consists of heuristics (common neighbors, weighted common neighbors, counting paths, etc.) which, while working very well in practice, have not been thoroughly analyzed in terms of their underlying models, asymptotic consistency guarantees, rates of convergence to the asymptotic estimates, etc. ( [19] is one step in this direction). Second, almost all the heuristics are meant for predicting links from one static snapshot of the graph. However, graph datasets often carry additional temporal information such as the creation and deletion times of nodes and edges, so the data is better viewed as a sequence of snapshots of an evolving graph. The historical evolution of the graph is usually ignored, apart from a few studies such as [15]. We focus on link prediction in such dynamic graphs, and propose a non-parametric method that (a) makes no model assumptions about the graph generation process, (b) leads to formal guarantees of consistency, and (c) offers an extremely fast and scalable implementation via locality sensitive hashing (LSH).
Our approach falls under the framework of non-parametric time series prediction, which models the evolution of a sequence x t over time [16]. Each x t is modeled as a function of a moving window (x t−1 , . . . , x t−p ), and so x t is assumed to be independent of the rest of the time series given this window; the function itself is learnt via kernel regression. In our case, however, there is a graph snapshot in each timestep. The obvious extension of modeling each graph as a multi-dimensional x t quickly runs into problems of high dimensionality, and is not scalable. Instead, we appeal to the following intuition: the graphs can be thought of as providing a "spatial" dimension that is orthogonal to the time axis. In the spirit of the time series model discussed above, our model makes the additional assumption that the linkage behavior of any node i is independent of the rest of the graph given its "local" neighborhood or cluster N (i); in effect, local neighborhoods are to the spatial dimension what moving windows are to the time dimension. Thus, the out-edges of i at time t are modeled as a function of the local neighborhood of i over a moving window, resulting in a much more tractable problem. This model also allows for different types of neighborhoods to exist in the same graph, e.g., regions of slow versus fast change in links, assortative versus disassortative regions (where high-degree nodes are more/less likely to connect to other high-degree nodes), densifying versus sparsifying regions, and so on. An additional advantage of our non-parametric model is that it can easily incorporate node and link features which are not based on the graph topology (e.g., labels, in labeled graphs).
Our contributions are as follows: (1) Non-parametric problem formulation: We offer, to our knowledge, the first non-parametric model for link prediction in dynamic graphs. The model is powerful enough to accommodate regions with very different evolution profiles, which would be impossible for any single link prediction rule or heuristic. It also enables learning based on both topological as well as other externally available features (such as labels).
(2) Consistency of the estimator: Using arguments from the literature on strong mixing, we show mean square consistency of our estimator. The closest analysis akin to ours is the analysis of subsampling from random fields [17], which also makes similar assumptions.
(3) Fast implementation via LSH: Non-parametric methods such as kernel regression can be very slow when the kernel must be computed between a query and all points in the training set. We adapt the locality sensitive hashing algorithm of [11] for our particular kernel function, which allows the link prediction algorithm to scale to large graphs and long sequences.
(4) Empirical improvements over previous methods: We present our empirical results in two parts. First we show that on graphs with non-linearities, such as seasonally fluctuating linkage patterns, we outperform all of the state-of-the-art heuristic measures for static and dynamic graphs. On real-world datasets whose evolution is far smoother and simpler, we perform as well as the best competitor.
The rest of the paper is organized as follows. We present the model and prove consistency in Sections 2 and 3. We discuss our LSH implementation in Section 4. We give empirical results in Section 5, followed by related work and conclusions in Sections 6 and 7.

Proposed Method
Let the observed sequence of directed graphs be G = {G 1 , G 2 , . . . , G t }. Let G t,p = {G t , G t−1 , . . . , G t−p+1 }. A naive extension of prior work on time series would model G t+1 as a stochastic function of G t,p . However, this has very high dimensionality and is difficult to scale.
One solution is to learn a local prediction model for each possible edge i → j using some features, and patch these predictions together to form the entire graph. However, the features have to be chosen carefully. If they only encode narrow relationships between i and j (e.g., their degrees and number of common neighbors), the predictor cannot adapt to local variations in linkage patterns (e.g., different communities in the graph behaving differently). On the other hand, adding in a large number of features describing properties of the entire graph would again render the problem unscalable. Instead, we propose a model that combines some features specific to i → j along with others that describe the local neighborhood of i. This keeps the dimensionality under control while still being powerful enough to distinguish between local variations.
Specifically, let Y t (i, j) = 1 if the edge i → j exists at time t, and let Y t (i, j) = 0 otherwise. Let N t (i) be the local neighborhood of node i in G t ; in our experiments, we define it to be the set of nodes within 2 hops of i (up to a maximum of some large value ∆), and all edges between them. Note that the neighborhoods of nearby nodes can overlap. Let N t,p (i) = {N t (i), . . . , N t−p+1 (i)}. Then, our model is: where 0 ≤ g(.) ≤ 1 is a function of two sets of features: those specific to the pair of nodes (i, j) under consideration (s t (i, j)), and those for the local neighborhood of the endpoint i (d t (i)). We require that both of these be functions of N t,p (i). Thus, Y t+1 (i, j) is assumed to be independent of G given N t,p (i), limiting the dimensionality of the problem. Also, two pairs of nodes (i, j) and (i ′ , j ′ ) that are close to each other in terms of graph distance are likely to have overlapping neighborhoods, and hence higher chances of sharing neighborhood-specific features. Thus, link prediction probabilities for pairs of nodes from the same graph region are likely to be dependent, which agrees with intuition. Next, we will discuss the forms of the features and the estimator. Then we discuss some implementation details and extensions for very sparse graphs. The asymptotic consistency of our estimator will be shown in the next section. Features. Let s t (i, j) be any pair-specific feature mapping; we only assume that the range of s t (i, j) is a finite set S. Let d t (i) = {η it (s) , η + it (s) ∀s ∈ S}, where η it (s) are the number of node pairs in N t−1 (i) with feature vector s, and η + it (s) the number of such pairs which were also linked by an edge in the next timestep t. Thus, d t (i) tells us the chances of an edge being created in t given its features in t − 1, averaged over the whole neighborhood N t−1 (i) -in other words, it captures the evolution of the neighborhood around i over one timestep. The intuition behind this choice of d t (i) is that past evolution patterns of a neighborhood are the most important characteristics in predicting future evolution. In the following, we often refer to each feature s as the "cell" s with contents (η it (s) , η + it (s)), and d t (i) as the "datacube" of all these cells. Estimator. Our estimator of the function g(.) is: . (1) To reduce dimensionality, we factor Sim(.) into pair-specific and neighborhood-specific parts: In other words, the similarity measure Sim(.) computes the similarity between the two neighborhood evolutions (i.e., the "datacubes"), but only for pairs (i ′ , j ′ ) that had exactly the same features as the query pair (i, j) at time t (i.e., pairs belonging to the "cell" s = s t (i, j)). Combining Equations 1 and 2, we can get a different interpretation of the estimator: Intuitively, given the query pair (i, j) at time t, we look only inside cells for the query feature s = s t (i, j) in all neighborhood datacubes, and compute the linkage probability η + i ′ t ′ (s) /η i ′ t ′ (s) in these cells; these probabilities are then averaged by the similarities of the datacubes to the query neighborhood datacube. Thus, the probabilities are computed from historical instances where (a) the feature vector of the historical node pair matches the query, and (b) the local neighborhood was similar as well. Now, we need a measure of the closeness between neighborhoods. Two neighborhoods are close if they have similar probabilities p(s) of generating links between node pairs with feature vector s, for any s ∈ S. We could simply compare point estimates p(s) = η + . (s) /η . (s), but this does not account for the variance in these estimates. Instead, we consider a normal approximation to the posterior of p(s), and use the total variation distance between these normals as a measure of the closeness. More precisely, where TV(., .) represents the total variation distance between its two argument distributions, and 0 < b < 1 is a bandwidth parameter such that Implementation Details and Extensions. While our theoretical results (presented in the next section) apply to any feature vector s t (i, j) that has finite support S, the precise form is clearly of significant practical interest. Choosing features that were found to be useful in prior work [21,15], represents the number of common neighbors of i and j at time t, and ℓℓ t (i, j) the number of timesteps elapsed since the edge i → j last occurred in G t,p ; this is 0 if the edge occurs in time t, and is set to p if it never occurs in G t,p . Features based on node and edge labels can also be used here. All feature values are binned logarithmically in order to combat sparsity in the tails of the feature distributions (this is particularly true for degrees, which are commonly observed to follow power laws).
If the graphs are too sparse, then two practical problems can arise. First, a node i could have zero degree and hence an empty neighborhood. In such cases, we extend the definition of neighborhood to include nodes j that were connected to i in the previous p timesteps (p > 1). Note that the datacube d t (i) still remains a function of N , (i) as required by the model. Second, η . (s) and η + . (s) could be zero for many different s ∈ S. Hence, many historical neighborhoods that are close to the query neighborhood might not have any data for the query feature s = s t (i, j), rendering them useless for estimation purposes. In such cases, we combine η . (s) and η + . (s) with a weighted average of the corresponding values for any s ′ that are "close" to s, the weights encoding the similarity between s ′ and s. This ensures that some estimates are available even in extremely sparse portions of the neighborhoods.

Consistency of Kernel Estimator
Now, we prove that the estimator g defined in Eq. 3 is consistent. Recall that our model is as follows: For simplicity, assume that the number of nodes in a graph at all timesteps is n. From Eq. 3, the kernel estimator of g for query pair (q, q ′ ) at time T can be written as: The estimatorĝ is defined only when f > 0. The kernel was defined earlier as , where b is the bandwidth which tends to zero as the total number of datapoints approach infinity, and D(.) is the distance function defined in Eq. 4. This is similar to other discrete kernels [3,22], and, analogous to kernels in R, this kernel has the following property 1 : Theorem 3.1 (Consistency). If E[ f ] > 0, then g is a consistent estimator of g, i.e., g P −→ g.

Proof.
The condition E[ f ] > 0 is necessary to ensure that enough data matching the query is seen for the estimates to converge. This corresponds to the assumption in traditional kernel regression of requiring that the true density is strictly positive. The proof is in two parts. (1), and then Thm. 3.5 that g − g − B P −→ 0. Together, they imply that g − g P −→ 0, thus proving that g is a consistent estimator of g.
We introduce the first two of our assumptions here. (A1) Finite dimensionality: The graphs have a bounded intrinsic dimensionality ρ that is independent of n and T . Consequently, our two-hop neighborhood sizes are also bounded by some value ∆. (A2) Bandwidth convergence: The kernel bandwidth parameter b → 0 as nT → ∞. Proof. For t ∈ [p, T − 2]; i ∈ [1, N ]; s = s T (q, q ′ ), the numerator of B is an average of the terms: Taking expectations w.r.t. d t (i), and denoting K b (d t (i) , d T (q)) by γ, the first term becomes: where the last equality follows from the fact that E η + it+1 (s) |d t (i) = η it+1 (s) · g(s, d t (i)), as can be seen by summing Eq. 5 over all pairs (i, j) in a neighborhood with identical s t (i, j), and then taking expectations. Thus, the numerator of B becomes an average of terms of the following form: E [K b (dt (i) , dT (q))ηit+1 (s) · (g(s, dt (i)) − g(s, dT (q)))] . This expectation is over all possible configurations of the neighborhoods N t (i) and N t+1 (i). Since our neighborhood sizes are bounded by ∆ which is constant w.r.t N and T (assumption (A1)), the expectation is a finite sum of terms (independent of N or T ).
We now let b → 0. Since the expectation is over a finite sum, so we can take the limit inside the expectation. Since lim b→0 K b (d t (i) , d T (q)) is zero unless d t (i) = d T (q) (Eq. 6), we find that E [K b (d t (i) , d T (q))η it+1 (s) · (g(s, d t (i)) − g(s, d T (q)))] → 0, for any i and t. So the numerator of B goes to zero asymptotically. Hence, the estimator g is unbiased.
Before proving the second part of Thm. 3.1, we prove the following lemmas.
Proof. The variance can be expressed as: Since q is bounded, so are the variances, so the total contribution from the variances is O(1/nT ). Among the covariance terms, some pairs {i, t, t + 1}, {i ′ , t ′ , t ′ + 1} will share random variables, whereas the others will be disjoint. For example, if j ∈ N t (i), then N t (i) intersects N t (j) as well. Similarly, the datacubes summarizing the evolution of i from t → t + 1 and from t − 1 → t are also overlapping. Consider the graph obtained by "stacking" up all the individual graphs, such that every node i is now also connected to node i at time t − 1 and time t + 1 (we consider a lag of p = 1; the extension to arbitrary p is simple). Denote by dist({i, t, t + 1}, {i ′ , t ′ , t ′ + 1}) the length of the shortest path between any pair of nodes in the datacubes d t (i) and d t ′ (i ′ ). For example, with i ′ = i and t ∈ [t − 1, t + 1] this distance is zero, whereas for t ′ = t + 2, dist is one, and so on. We will now split the covariances into two groups, the pairs at dist zero and those at dist > 0. The number of overlapping datacubes with a given datacube is upper-bounded by a constant since the neighborhood size ∆ is bounded (assumption (A1)), and the lag p in the time series is a constant w.r.t. T . Also since q is bounded, cov(q(i, t, t + 1), q(i ′ , t ′ , t ′ + 1)) is bounded by some constant. Hence, For the remaining covariance terms, we introduce our final assumption. (A3) Strong mixing: Define the strong mixing coefficients α(k) .
is the sigma algebra generated by the random variables in the datacube for i, t. Then, following [17], we assume that ∞ k=1 α(k)k ρ−1 < ∞. We will use the well known mixing inequality for covariances of bounded random variables (see [4] Let N (i, t, dist) denote the total number of pairs (i ′ , t ′ ) at distance dist from (i, t). Note that for a given (i, t) this part of the covariance can be written as a sum over dist. Now using assumption (A3) and Lemma 3.4 below, we obtain: Thus, all terms in the initial variance expression are O(1/(nT )), which tends to zero.

Lemma 3.4. For the "stacked" time series graph with lag p (which is a constant w.r.t T ) built from a sequence of temporal graphs with bounded intrinsic dimensionality
Proof. This follows by a direct application of the definition of intrinsic dimensionality in [14], and the fact that we are using 2-hop neighborhoods.
Proof. Since our neighborhood sizes are bounded (assumption (A1)), η + it+1 (s) and η + it (s) are also bounded. By definition, g is a probability and is bounded, as is the kernel K b . Using Lemma 3.3 with q(.) = K b (d t (i) , d T (q))η it+1 (s) and q(.
qm −→ 0. Since convergence in quadratic mean implies convergence in probability, we have:

Fast search using LSH
A naive implementation of the non-parametric estimator (eq 3) searches over all n datacubes for each of the T timesteps for each prediction, which can be very slow for large graphs. In most practical situations, the top-k closest neighborhoods should suffice (in our case k = 100). Thus, we need a fast sublinear-time method to quickly find the top-k closest neighborhoods.
We achieve this via locality sensitive hashing (LSH) [11]. The key step is to devise a hashing function for neighborhoods such that neighborhoods that are close according to our similarity function are likely to be mapped to the same hash bucket. Recall that the similarity between two datacubes is the sum of the total variation distances between the distributions in each corresponding cell. We now represent the distribution in each cell as a sequence of bits, so that the Hamming distance between the bit representations of two distributions approximates their total variation distance. LSH schemes from [11] for hamming distances can then be used. Conversion to bit sequence. The key idea is to approximate the linkage probability distribution by its histogram. We first discretize the range [0, 1] (since we deal with probabilities) into B 1 buckets, and compute the probability mass that falls in each bucket. To encode the probability mass, each bucket is encoded in B 2 bits; if the probability mass in the bucket is p, then the first ⌊p · B 2 ⌋ bits are set to 1, and the others to 0. Thus, the entire distribution (i.e., one cell) is represented by B 1 · B 2 bits. The entire datacube can be stored in |S| · B 1 · B 2 bits. However, in all our experiments, datacubes were very sparse with only M ≪ |S| cells ever being non-empty (usually, 10-50); thus, we use only M · B 1 · B 2 bits in practice.
We use this particular representation because the total variation distance between discrete distributions is half the L 1 distance between the corresponding probability mass functions. The L 1 distance, in turn, is just the Hamming distance between the bit representations above. Thus the distance between two datacubes can be approximated by computing the Hamming distance between two pairs of M · B 1 · B 2 bit vectors, using the following LSH scheme. Distances via LSH. We create a hash function by just picking a uniformly random sample of k bits out of M ·B 1 ·B 2 . For each hash function, we create a hash table that stores all datacubes whose hashes are identical in these k bits. We use ℓ such hash functions. Then, given a query datacube, we hash it using each of these ℓ functions, and then create a candidate set of up to O(max(ℓ, top-k)) of distinct datacubes that share any of these ℓ hashes. The theoretical guarantees on finding the nearest neighbors in this set with high probability can be found in [11]. The total variation distance of these candidates to the query datacube is computed explicitly, yielding the closest matching historical datacubes.
Finally, we underscore a few points. First, the entire bit representation of M · B 1 · B 2 bits never needs to be created explicitly; only the hashes need to be computed, and this takes only O(k · ℓ) time. Second, the main cost in the algorithm is in creating the hash table, which needs to be done once as a preprocessing step. Query processing is extremely fast and sublinear, since the candidate set is much smaller than the size of the training set. Finally, we have found the loss due to approximation to be minimal, in all our experiments.

Experiments
We first introduce several baseline algorithms, and the evaluation metric. We then show via simulations that our algorithm outperforms prior work in a variety of situations modeling non-linearities in linkage patterns, such as seasonality in link formation. Finally, we demonstrate high link prediction accuracy on real-world dynamic graphs as well, in particular, the co-authorship graphs in the Physics and NIPS communities, and the "machine learning" community according to Citeseer. Baselines and metrics. We compare our non-parametric link prediction algorithm to the following well-known baselines, each of which gives a ranking of node pairs in terms of their probability of forming a link in the next timestep: (a) LastLink (LL), which orders node pairs according to the last time when they were linked, with more recently linked pairs being ranked higher [21], (b) Common Neighbors (CN), which gives higher ranking to pairs with more common neighbors in the last timestep [15], (c) Adamic/Adar (AA), which considers common neighbors weighted by their degrees [2], and (d) Katz measure (Katz), which extends CN to longer paths but gives exponentially smaller weights to such paths [12]. We compare these against the non-parametric link prediction algorithm described above (Non-Param), and also against a random ranker (Random).
For any graph sequence (G 1 , . . . , G T ), we test link prediction accuracy on G T for a subset S >0 of  Table 1: Average AUC on simulated graphs nodes with non-zero degree in G T . Each algorithm is provided training data until timestep T − 1, and must output, for each node i ∈ S >0 , a ranked list of nodes in descending order of probability of linking with i in G T . For purposes of efficiency, we only require a ranking on the nodes that have ever been within two hops of i; all algorithms under consideration predict the absence of a link for nodes outside this subset. Now, for each i ∈ S >0 , we compute the AUC score for the predicted ranking against the actual edges formed in G T , and we report the mean AUC over S >0 .

Simulations
One unique aspect of Non-Param is its ability to predict even in the presence of sharp fluctuations in the data. As an example, we focus on seasonal patterns. While co-authorship graphs have a minor seasonality component, our experiments on such graphs use a longer duration per timestep that smears out the seasonality, to be fair to the competing algorithms. Hence, the effects of seasonality are best shown via simulations, described next. We simulate a model of Hoff et al. [8], that posits an independently drawn "feature vector" for each node. Time moves over a repeating sequence of seasons, with a different set of features being "active" in each. Nodes with these features are more likely to be linked in that season, though some links are also formed due to noise. There are two parameters: the number of timesteps T , and the strength of the seasonality φ (seasonal linkages become likelier as φ increases). All graphs have 50 nodes, and results are averaged over multiple runs. Effect of seasonality. Clearly, Seasonal has nonlinear linkage patterns; the best predictor of links at time T are the links at times T − 3, T − 6, and so on. However, all the baseline algorithms are biased towards predicting links between pairs which were linked (or had short paths connecting them) at the previous timestep T − 1; this implicit smoothness assumption makes them suffer heavily. On the other hand, for Stationary, links formed in the last few timesteps of the training data are good predictors of future links, and so the difference between the algorithms is marginal. We note, however, that Non-Param performs very well in all cases. Effect of longer history. For T = 4, Non-Param performs poorly, because the training data includes only the shift from seasons 1 to 2, and 2 to 3, but not from 3 to 1 again, and this is what it needs to predict. The error is rectified as T increases, with performance eventually flattening out near T = 24. LL also improves, simply because it sees more links from previous instances of the test season; edges seen in other seasons hurt its performance, but these are relatively sparse. As before, CN, Katz, and AA are mostly unaffected: they use only the structure of the graph in the last timestep to predict future links. Effect of increased block weight. Finally, we vary the importance of the block in determining links by testing several values of the seasonality strength φ. As it decreases, Non-Param is hurt more. This is expected, since most edges are now being created due to baseline noise.

Real-world dynamic graphs
The simulation experiments above demonstrate the utility of non-parametric link prediction especially in the presence of non-linearities such as seasonality. Now, we show that the accuracy of this method extends even to real-world dynamic graphs, for which the baseline algorithms are among the best known. In particular, we perform experiments on the co-authorship graphs of the subset of the Physics community ("HepTh") [1], authors in NIPS [6], and authors of papers indexed by Citeseer that contain "machine learning" in their abstracts. Each timestep looks at 2 years of papers (1 year for Citeseer), with nodes being linked if they were co-authors on any paper in that time period. Table 2 shows the average AUC for all the algorithms. Non-Param and LL are better than all other algorithms; between the two, Non-Param is better on NIPS and HepTh. This is expected, since LL has been shown to perform very well in many real-world graphs [21], probably because their evolution over time tends to be smooth and linear (seasonality effects are not important in co-authorship graphs).

Related Work
Link prediction in static networks is a well studied problem, whereas link prediction in evolving networks is slowly becoming important with the growing popularity of large social networks (LiveJournal, Facebook) networks; evolving citation networks(Citeseer, DBLP) etc. Existing work on link prediction in dynamic social networks can be broadly divided into two categories: generative model based [20,7,5,13,9] and graph structure based [10,21]. Generative models. These include extensions of Exponential Family Random Graph models [7] by using evolution statistics like "edge stability", "reciprocity", "transitivity"; extension of the latent space models for static networks [18] by allowing smooth transitions in latent space [20], and extensions of the mixed membership block model to allow a linear Gaussian trend in the model parameters [5]. In other work, the structure of evolving networks is learnt from node attributes changing over time. This includes nonparametric estimation of time-varying Gaussian random fields with a smoothly changing covariance matrix [23] and its discrete analog [13] where the graph structure of time-varying Markov Random Fields are estimated. Although these models are both intuitive and rich, they generally a) make strong model assumptions, b) require computationally intractable posterior inference, and c) explicitly model linear trends in the dynamics of the networks. Models based on structure. Huang et al. [10] proposed a linear autoregressive model for individual links, and also built hybrids using static graph similarity features. In [21] the authors examined simple temporal extensions of existing static measures for link prediction in dynamic networks. In both of these works it was empirically shown that LL and its variants are often better or among the best heuristic measures for link prediction. Our non-parametric method has the advantage of presenting a formal model, with consistency guarantees, that also performs as well as LL.

Conclusions
We proposed a non-parametric model for link prediction in dynamic graphs, and showed that it performs as well as the state of the art for several real-world graphs, and exhibits important advantages over them in the presence of non-linearities such as seasonality patterns. We are currently working on longitudinal econometric datasets where seasonality patterns are more marked. Non-Param also allows us to incorporate features external to graph topology into the link prediction algorithm, and its asymptotic convergence to the true link probability is guaranteed. Together, these make a compelling case for Non-Param being a useful tool for link prediction in dynamic graphs.