Efficient algorithm to compute Markov transitional probabilities for a desired PageRank

We propose an efficient algorithm to learn the transition probabilities of a Markov chain in a way that its weighted PageRank scores meet some predefined target values. Our algorithm does not require any additional information about the nodes and the edges in the form of features, i.e., it solely considers the network topology for calibrating the transition probabilities of the Markov chain for obtaining the desired PageRank scores. Our experiments reveal that we can reliably and efficiently approximate the probabilities of the transition matrix, resulting in the weighted PageRank scores of the nodes to closely match some target distribution. We demonstrate our findings on both quantitative and qualitative evaluations by reporting experimental results on web traffic (the English Wikipedia and a Hungarian news portal) and the bicycle sharing network of New York City.

Given a network with its topology, PageRank algorithm determines a probability distribution over its nodes by calculating the stationary distribution of a random walk which is assumed to make the decision about which node to visit next based on local vicinity information of nodes in a uniform manner, i.e., inversely proportional to the number of its neighboring nodes.
From a real-world web browsing perspective, this assumption implies that users are believed to click on hyperlinks located at some website with equal probability, which most likely is not the case in reality. Substantial research has been conducted on biasing the transition probabilities of random walk models in an application-specific manner in order to obtain modified PageRank scores for the nodes of certain types of networks [15][16][17][18][19][20][21].
In this work, we take a different approach as our goal is not to devise such transition probabilities which yield useful PageRank scores for a particular application. Our goal instead is to set the weights of the transition matrix such that it converges to a predefined PageRank vector provided as input. This latter problem has been in the focus of [22,23]. In this paper, we propose an efficient algorithm-being linear in the number of edges of the input graph-for the above task, which we shall refer as the reverse PageRank problem.
From an application-oriented point of view, finding such transition probabilities which results in a predefined PageRank distribution for some Markov chain can help in designing better routing protocols. For instance, one might want to spread around some resource over a communication or social network with the additional (soft) constraint that all the nodes should have an even (or similar) amount of access to the resource in the long run. In such a scenario, we are interested in predicting the transition probabilities of the network such that the desired distribution of the PageRank scores over the nodes is as uniform as possible, i.e., the random walk visits nodes independently of their centrality within the graph.
Another potential use case for the investigated problem is to gain insights into the underlying principles of some partially observable Markov chain. Placing the problem in the original context of the PageRank algorithm, even if we have information about the relative popularity of the individual websites, it is not necessary that we also have access to the actual flow of network traffic between pairs of websites (hence partial observability). We might, however, want to still be able to approximate click-through rates between websites in the absence of access to web server logs with referral statistics. In such a scenario, solving for the reverse PageRank problem, we can approximate the transition probabilities between websites such that the stationary distribution over them closely matches our initial beliefs for the relative popularity of the websites. Our large scale experiment on English Wikipedia is precisely of this form, i.e., we trained our model such that it attempts to predict click-through rates between pairs of Wikipedia articles. In order to do so, we were merely relying on the overall popularity of the articles.
The main contributions of this paper can be summarized in the followings: • we propose an efficient algorithm to calculate the approximate gradient for our problem, which can be calculated in linear time as a function of the number of possible transitions of the Markov chain, • we illustrate the effectiveness of the proposed algorithm via rigorous quantitative and qualitative experiments on three real-world complex networks, i.e., the English Wikipedia, a Hungarian news portal and the bicycle sharing network of New York City, • we release our source code at https://github.com/begab/reversePageRank in order to foster reproducibility and additional experimentation.

Illustration of the problem
This section provides a small example which demonstrates the problem and its difficulty. Suppose that we have a graph G = (V , E) as displayed in Fig. 1. Let us further assume that there is some resource traversing the network, and this resource should be present at vertices A through F with probabilities proportional to the radii of the solid circles marking them, i.e., 0.26, 0.24, 0.12, 0.09, 0.16 and 0.13, respectively. Now the question that we face is the following: how should we guide the resource, so that its expected presence over the nodes (marked by dashed circles) match the desired input distribution (solid circles)?
Illustration of the problem of finding transition probabilities for some desired stationary distribution. The radii of the nodes is proportional to their share from the stationary distribution Figure 1(a) illustrates that letting the resource to choose its upcoming destination uniformly at random from the directly accessible neighboring states-as it is done in vanilla PageRank algorithm-can provide some states with excess importance (e.g., states C and D) and leave some of the states under-represented at the same time (cf. states A and E).
An alternative to the uniform routing scheme could be if we chose succeeding states proportional to the desired stationary distribution of the neighboring states. As Fig. 1(b) depicts, however, such a strategy might not provide any real benefit over uniform routing. Figure 1(c) demonstrates the behavior of our proposed algorithm. Note that in this figure the radii of the dashed and solid circles practically concur as opposed to Fig. 1(a) and Fig. 1(b). This means that performing resource routing according to the proposed transition probabilities would result in a stationary distribution closely matching the desired one.
For this previous motivating example our algorithm is able to find a close to perfect solution, however, we should note that in the general case it is possible that for a certain network topology and expected PageRank scores no perfect solution exists even in theory. We deal with the general solvability of our problem in Sect. 5.3.

Related work
There have also been previous work with similar motivation to ours [22,23], i.e., to determine transition weights in some complex network. The core motivation in [22] and [23] are identical to ours, however, they employ different approaches.
Inspired by Luce's choice theory [22][23][24] learn weights for the vertices of some network in an iterative fashion and infer transitional probabilities for the edges based on them. The reliance on Luce's choice theory lets [22,23] to learn only O(|V |) parameters for some network G = (V , E) and use those parameters for inferring the O(|E|) > O(|V |) transition probabilities. Note that the classical Metropolis-Hastings algorithm [25,26] is also applicable to obtain state transition probabilities of a Markov chain for some desired stationary distribution, however, as stated in [22], "very little is known about it from a formal point of view, especially, its mixing time and rate of convergence". ChoiceRank [23] is based on and involves an iterative EM-type inference algorithm for trying to recover the marginal popularity of the nodes in the network provided as an input to the algorithm. Unlike [22], the authors of [23] do not require the input network to be strongly connected.
There exist further work on the estimation of transition probabilities in Markov chains that assume temporal access to observing transitions from the network [27,28]. [29] proposed the inverse Markov chain problem, in which the authors try to predict the number of transitions in a transportation network in a partially observable setting. The problem formulation in [29] is similar to ours, but also differs in important aspects. [29] assumes that the visit frequencies for a subset of verticies V o ⊂ V are observed, whereas our algorithm requires a distribution (the PageRank vector) defined for all the vertices. Due to the different setting, the loss function employed in [29] also differs from the one we utilize. Partial observability in [29] also apply to the edge transition frequencies over V o × V o , i.e., the algorithm has an additional (and optional) term in the loss function regarding the actual edge transition frequencies over a subset of the edges in the network. Another difference is that the objective function applied in [29] was optimized using the natural gradient, whereas we used a quasi-Newton optimization technique, which suites better problems of the scale our experiments involved. Perhaps most importantly, the calculation of the gradient of the objective in [29] is quadratic in the number of (observable) vertices, which makes its application prohibitive even for moderate sized networks.
The Supervised Random Walk approach [2] aims at learning an appropriate weighting function over the edges of complex networks by minimizing the sum of margin-based soft losses between positively and negatively labeled nodes. The margin-based loss employed in [2] includes the stationary distribution of random walks with restarts from multiple source nodes. [2] assumes that each edge (u, v) can be described by a low dimensional feature vector ψ uv , whereas our model does not require access to any additional meta-data in order to calculate edge features. Since we do not regard edge weights as a parameterized function of feature values, our proposed approach is also applicable in scenarios when edge features, such as edge age-an example feature used in [2]-are not accessible.
Various studies have been conducted recently related to Wikipedia about trying to predict missing links [30,31], improving and analyzing navigability [32][33][34][35]. More generally, the field of network tomography [36] focuses on statistical methods that can be used for making inferences about unknown edge properties of (typically computer) networks that could be based on node-level aggregated information. Network tomography applications vary vastly, i.e., the measurements may be collected passively by monitoring natural traffic flows or actively by generating probe traffic. Our setting belongs to the former category of passive approaches. Typical network tomography approaches, such as [37,38] for recovering the unobserved random variables operate via Expectation Maximization and Markov Chain Monte Carlo Methods.

Notation and background
In this section, we briefly describe the original PageRank algorithm and also introduce our notation for the rest of the paper. Markov chains can be used to define stochastic processes by providing the state space S = {S 1 , . . . , S n } and a row stochastic transition matrix P ∈ R n×n , the p ij element of which can be interpreted as the probability of observing state S j in time t given that state S i was observed at time t -1. The stationary distribution of a Markov chain characterized by transition matrix P can be determined by calculating the left principal eigenvector of P.
Given a graph G = (V , E), the PageRank algorithm determines a distribution of importance over the nodes in the network that we denote by π . We use subscripts for indexing a particular element of a vector. That is, π v denotes the PageRank score for node v.
The PageRank model assumes that the prestige of any node depends on the importance of those nodes which link to it in a recursive manner. Formally, π v can be defined as where p uv denotes the probability of traversing from state u to v. For some node u, p uv is uniformly set to 1 d u , d u denoting the outdegree of node u. According to the matrix formulation, the PageRank vector is the stationary distribution of the Markov chain parameterized by P, which is guaranteed to uniquely exist for ergodic Markov chains. In order to handle non-ergodic networks as well, Page et al. [12] suggests the following calculation for the PageRank scores: where the damping factor β is typically chosen from the interval [0.1, 0.2]. This extension ensures the Markov chain to be aperiodic and irreducible as a non-zero probability (proportional to the damping factor) is reserved for direct transitioning between any pair of nodes in the network. Equation (2) reveals that all the outgoing edges of a particular node are still weighted in a uniform manner, i.e., proportional to the outdegree of that node. In the following, we introduce our model for learning edge-specific transition weights for a given target distribution over the network.

Reverse PageRank algorithm
We next formulate the model utilized for learning the transition weights of Markov processes with predefined PageRank scores. We coin our approach as the Reverse PageRank algorithm that we shall refer as ReversePR for short. Throughout the rest of the paper, we denote the desired PageRank vector we wish our weighted random walk with restarts to converge by π * .
In the following, we denote the parameters we optimize for and the particular PageRank scores the nodes receive after convergence by Θ and π(Θ), respectively. The edge weight associated for a particular edge (u, v) ∈ E is going to be proportional to exp(θ uv ). More concretely, the probability of directly traversing between any pair of nodes (u, v) ∈ V × V is given by As the above formulation is invariant to a constant additive factor, that is p uv (Θ) = p uv (Θ + c) for any c ∈ R, infinitely many Θ can yield the same p uv values. In order to overcome this and to decrease the number of variables in the model by O(|V |), we ensure that for each node u, there is one parameter in θ u * , i.e., among the parameters controlling for the transition probabilities starting from node u, which is constantly kept at the value of zero (cf. the comment in line 11 of Algorithm 1). Our overall goal is thus to find a feasible value of Θ for which π(Θ) lies as close to π * as possible. More formally, given a graph G = (V , E), the objective function we seek to minimize is i.e., the Kullback-Leibler (KL) divergence between distributions π * and π(Θ).
Algorithm 1 Efficient computation of the approximate gradient of Eq. (5) π ← WeightedPageRank(G, Θ, β) β denotes the damping factor 5: for all u ∈ V do 6: for all v ∈ AdjacentVertices(G, u) do 8: i ← 0 10: for all v ∈ AdjacentVertices(G, u) do 11: if i > 0 then we keep one of the θ u * values as 0 12: 13: In case of Eq. (5), the partial derivative with respect to a single parameter θ ij has the form revealing that we need to sum over all the vertices in order to calculate a single partial derivative. Furthermore, since π v (Θ) is recursively entangled with θ ij , the partial derivative within the summation would require the application of a PageRank-style iterative approach [2,39]. As a consequence, the gradient of Eq. (5) can be calculated in O(|E| · (|E| + |V |)) time, implying that the exact calculation of the gradient is beyond practical applicability even for moderate sized graphs. In order to circumvent the difficulty of calculating the gradient of Eq. (5), we propose an alternative approximate gradient which can be computed in a highly effective O(|E|) manner. As stated earlier, the main bottleneck during the computation of the gradient of Eq. (5) arises from the presence of ∂π v (Θ) ∂θ ij in Eq. (6). Due to the recursive formulation of PageRank, π v (Θ) can be expressed as (u,v)∈E p uv (Θ)π u (Θ), which means that In our approximation, we make the simplifying assumption that the π u values are not dependent on Θ, canceling this way out the (otherwise computationally still expensive) second term from Eq. (7). Together with the fact that the transition probabilities parameterized by Θ have the partial derivative the partial derivative from Eq. (6) can be conveniently approximated as The nice property of Eq. (9) is that its value can be calculated in O(1) amortized time. This is due to the fact that the last sum in Eq. (9) can be reused for the calculation of multiple partial derivatives-i.e., for any of the variables θ i * . This observation implies that the entire gradient of the objective function can be calculated in O(|E|) time as also illustrated in Algorithm 1.

Details on the optimization
A key property of optimization algorithms is their time needed for convergence. As opposed to first order methods, second order methods are known to have the ability of converging faster to an optimum. The application of second order methods for optimizing Eq. (5), however, would be strictly prohibitive. This is because second order methods rely on the inverse of the Hessian of the objective function which has O(|E|) parameters according to our problem formulation-with |E| potentially being in the range of 10 9 s for large real-world networks.
In order our algorithm to be applicable for large-scale networks as well, we applied the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) [40] quasi-Newton optimization technique. The usage of L-BFGS is advantageous since it tries to mimic the behavior and the fast convergence of second order methods, while only requiring the computation and storage of the gradients of the objective function up to a fixed number of previous iterations.
Similar to first-order optimization methods, L-BFGS also has hyperparameters, such as the step size for standard gradient descent, however, we did not investigate tuning these hyperparameters for obtaining better results. Instead, we relied on the default hyperparameter settings in this regard of the MALLET package that we used for performing the optimization.
Confirmed by our preliminary experiments, relying on the calculation of the proposed approximate gradient not only sped up the optimization process by orders of magnitude, but we also managed to obtain better solution in terms of our objective function introduced in (5). Our conjecture for the latter phenomenon is that the proposed approximate gradient is strictly smaller than the actual gradient, thus missing good local optima during the optimization of the non-convex objective function was less likely.

The non-convexity of the objective
For an illustration of the non-convexity of the objective function already with a single parameter, see Fig. 2. Note that the non-convexity of the objective implies that different initializations for Θ can result in convergence to different local optima. According to our empirical evidences-that we discuss in more details in Sect. 6.3.2-the quality of the local optima according to multiple evaluation criteria did not differ substantially.
Furthermore, we have experienced that initializing Θ homogeneously with all zeros can be regarded as a generally robust way to initialize model parameters. We should also add that the optimization procedure we employ is deterministic in the sense that once initialized identically, it always converges to the same (local) optimum when applying the same hyperparameters for L-BFGS (that we kept constant at their default values throughout our experiments).

The general solvability of the problem
We finally comment on the existence and the general solvability for the proposed problem formulation. A trivial solution always exists-i.e., choosing p uv proportional to π * vfor fully connected networks with π * obeying the inequality β |V | ≤ min v∈V π * v . Note that when the inequality is violated, then the node corresponding to the smallest value in π * already receives a larger then desired amount of distribution based on its share from the damping factor ( β |V | ), hence a perfect solution cannot be given. On the other hand, for any fully connected network obeying the above inequality, it suffices to set all the rows of its corresponding transition matrix identically proportional to π * -β |V | in order to obtain an objective value of zero.
In the more common scenario, when we are not given a fully connected network, an optimal solution with a KL-divergence of zero might not exist for certain input distributions. It can be easily verified that whenever node u has a single outgoing edge towards node v, π v (Θ) ≥ π u (Θ) should always follow. If the desired distribution behaves contradictory for such a pair of vertices, that is π * v < π * u , then a perfect solution in terms of KL-divergence cannot be given. Figure 3 illustrates a further example for which no perfect solution can be determined, even though it obeys the previously stated necessity condition.

Comparison to existing algorithms
ReversePR differs from [22,23] in that it treats the transition probability of every edge in the network as a separately tuneable parameter. As the number of edges are typically larger than the number of vertices, it makes the problem formulation of ReversePR underdetermined. [22] and [23] follows a different route, as they learn a separate parameter for each node in the network and they infer the edge transition probabilities based on the node-level parameters they learn.
Another difference between ChoiceRank and ReversePR is that the former receives the aggregate incoming traffic and outgoing traffic of each individual node as input, whereas for the ReversePR algorithm it suffices to have access solely to π * . Unlike [29], ChoiceRank and ReversePR share the common trait that they both rely on the full topology, i.e., the entire edge structure are assumed to be observable (without the actual traffic flowing over them).
Finally, the algorithm in [22] assumed that the input network is strongly connected. Note that ReversePR does not require such an assumption as it conveniently turns any input network into ergodic by employing a damping factor similar to the PageRank algorithm.

Experiments
This section contains our experimental results conducted over the English Wikipedia, a Hungarian news portal and the bicycle sharing network of New York City. We used the machine learning package MALLET [41] to perform the L-BFGS optimization over our objective function.
As already mentioned in Sect. 4, PageRank implementations often employ a damping factor β to overcome difficulties arising with non-ergodic networks. Unless stated otherwise, we conducted all our experiments using the default damping factor of β = 0.01. We purposely chose such a small damping factor in order not to severely alter the random walk applied over the original network, while ensuring ergodicity.

Baselines
In order to measure the effectiveness of the proposed algorithm, we compare it to several strong baselines. As our algorithm makes use of the topology of the network (E) and the expected PageRank scores (π * ), we derived various baselines relying on these sources of information. The different approaches utilized in our experiments are summarized in Table 1.
The Random strategy is the most simplistic of all as it assigns transition probabilities in an uniform manner when determining it for some node. The Indegree strategy assigns the transition probabilities based on the number of indegrees some node has. As the number of incoming edges can be a good proxy to the hubness of the nodes, we anticipated this kind of heuristic to work particularly for transportation networks in particular, such as the Citibike network.
The heuristic dubbed Jaccard takes into consideration the extent to which the neighborhood of some potential target node v overlaps with the source node u when providing predictions for p uv . Note that this strategy can only assign some nonzero probability between pairs of nodes which have at least one neighbor in common. Another property of the Jaccard heuristic is that it assigns higher values for pairs of nodes for which there is little difference in their direct neighbors. This strategy seems less useful for transportation networks, i.e., from a real world perspective one would expect low traffic between airports from which the directly accessible destinations are highly similar. For networks, where semantics matter more, such as the Wikipedia network, the Jaccard baseline is expected to provide more accurate predictions, since high similarity in the neighborhood of nodes can be an indication towards high topical relatedness.
The Traffic baseline simply chooses the next node to visit based on the marginal popularity of the nodes, proportional to π * . Obviously, the choice which is the most popular in general is not necessarily the most popular one in every situation. Predictions for the PageRank approach are determined based on the PageRank scores of such a random surfer, which performs random walk in an unweighted manner, i.e., following the next node to visit uniformly at random.
Our baselines also include ChoiceRank [23], a recent algorithm building upon Luce's choice theorem [24] for identifying network traffic preferences. We relied on the publicly available open-source implementation of the algorithm a during our experiments. ChoiceRank operates on the marginal indegree and outdegree sequence of the nodes and tries to recover them by calibrating a score for each node in the network. As our proposed algorithm has sole access to a desired distribution over the nodes, we treated the indegree and outdegree of some node u as (v,u)∈E π * v and (u,v)∈E π * v , respectively. Although the model proposed in [22] is also intended to solve our task, we did not consider it as a potential baseline, since its assumption on the strong connectedness of the input network does not generally hold for the networks used in our experiments.

Evaluation metrics
For some node u, we measure the quality of the predicted distribution over its outgoing edges by comparing it to the ground truth transition probability distribution from that node. We rely on four distinct evaluation measures that we describe below.
Following [23], we apply the normalized link displacement scores and KL-divergence between the ground truth and the predicted distributions for evaluation. Normalized link placement for some node u is calculated as where σ u (v) * and σ u (v) returns the ordered rank of node v within the neighbors of node u according to the ground truth and the predicted transition probabilities, respectively. The KL-divergence that we use for evaluating the quality of the learned transition probabilities gets calculated for some node u in the network as An important property of (11) is that it is only defined for such cases when p * uv > 0 also implies p uv > 0 to be true. That is, we cannot measure the KL-divergence when there is a non-zero ground truth transition probability, but a model assigns zero probability for it. This is hence a major shortcoming for the Jaccard similarity based heuristic that it can only assign a non-zero transition probability between pairs of nodes which have at least one neighbor in common. Since this criterion was not fully met for all the nodes involved in our experiments, we were not able to calculate the KL-divergence based evaluation for the Jaccard baseline.
Besides the aforementioned scores, we also measure the quality of the predicted transition probability distributions by quantifying their root mean squared error (RMSE) as in [22]. For some node u, the RMSE of the predicted transitions is calculated according to Additionally, we devise a fourth evaluation criterion, namely mean reciprocal rank (MRR) having its roots in the information retrieval community. This criterion is intended to measure the extent to which a model has the ability of ranking those transitions high that are actually the most frequent ones according to the ground truth data. That is, for a given node u, we calculate the value , with σ (·) denoting the same function as in (10), i.e., it returns the rank of the node in its argument relative to the predicted transition probabilities with respect to u. MRR is then obtained by averaging the individual scores for the different nodes.

Experiments on the link structure of Wikipedia
For these experiments, we take the hyperlink structure of the English Wikipedia and try to estimate its click-through rates solely based on the distribution that we derive from the page visit statistics of the individual articles. We conducted our experiments on the 20160305 dump of the English Wikipedia, b which consist of more than 1.23 * 10 7 articles and 3.87 * 10 8 hyperlinks. Figure 4 illustrates the typical long tail degree distribution of the Wikipedia network we had access to.
As an additional resource in our evaluation, we relied on the publicly available raw page visit statistics c for calculating π * . That is, we derived the desired distribution over the pages based on the individual page visit statistics of the Wikipedia articles.
The running time of ReversePR was approximately 40 minutes on a 2.00 GHz Intel Xeon E7-4820 processor for the above described Wikipedia network. For ease of reproducibility, we make the Wikipedia network as well as the π * vector that we based our experiments on readily accessible. d

Predicting the most likely cross-article clicks
For evaluation purposes, we utilize the Wikipedia Clickstream dataset [42] which includes referral statistics for March 2016 over the entire English Wikipedia. This valuable source of information makes it possible to calculate p * uv , i.e., the ground truth click-through rate between a pair of Wikipedia articles (u, v).  Obviously, predicting the most likely article to visit next is much easier (purely by chance) for articles with a low outdegree as opposed to articles with a large numberpotentially thousands-of neighboring articles to choose from. For this reason, we report our performance metrics as a function of the outdegree of the articles.
Results are reported in Fig. 5 according to the metrics already introduced in Sect. 6.2. It can be seen in Fig. 5, that the rankings produced by the ReversePR algorithm is dominantly better compared to the alternative baselines.
The baseline which relies on the popularity of articles can only produce reliable predictions for articles with extremely low outdegree and mostly for those evaluations that focus more on the ranking capabilities of the methods instead of the accurateness of the actual transition probabilities.
In particular, the popularity-based Traffic baseline delivers the best performance in terms of its replacement and MRR scores for articles not having more than five outgoing links. Its performance, however fades out quickly for articles containing more hyperlinks. Inspecting Fig. 4 reveals that only a minority of Wikipedia articles belong to the category on which the popularity baseline could give reliable predictions.
It is the Jaccard similarity-based approach alone which shows a similar overall ranking ability to ReversePR. When approximating p * uv , the Jaccard baseline compares the extent to which the set of hyperlinks to be found on Wikipedia article u overlap with those included on article v. As a consequence, the Jaccard similarity-based approach makes much less reliable predictions for Wikipedia articles with a small number of outgoing links. The inability of the Jaccard similarity-based heuristic to give accurate predictions for nodes with low degree is most pronouncedly showcased by the RMSE results. The fact that the KL-divergence evaluation cannot be calculated for the Jaccard baseline illustrates another weakness of that baseline, i.e., it sometimes assigns zero transition probabilities to transitions for which there is a nonzero ground truth transition probability in reality. Figure 5 demonstrates that even random guessing (i.e., the Random baseline) can outperform the Jaccard approach for articles with an extremely low number of outgoing links. We can notice that a large fraction of the Wikipedia articles fall into the category for which Jaccard baseline performs poorly, i.e., half of the Wikipedia articles had an outdegree of at most 61 for the investigated snapshot of Wikipedia (cf. Fig. 4).

Sensitivity analysis of ReversePR
Our algorithm is deterministic in the sense, that when initialized by the same parameters, it converges to the same solution. There could nonetheless still be some variability in the results depending on how we initialize our parameters Θ. This potential variability comes from non-convexity of our objective (cf. Sect. 5.2).
As such we cannot guarantee that every initialization of Θ would converge to the sameor even to an equally good-solution. Our default edge transition initialization strategythat we applied in all our previously reported experiments-is to set the node transition distributions according to the vanilla PageRank algorithm, i.e., we deterministically set all edge probabilities uniformly. This means that initial transition probabilities for some node with outdegree d were initialized to have a value 1 d . Practically, this means that all parameters of Θ got initialized to zero by our default strategy (that we denote as Initialization 0 hereon).
In order to assess the sensitivity of ReversePR towards the initial choice for Θ, we repeated our experiments by applying random initializations for Θ as well. This strategy simply set the parameters in Θ independently to random values sampled uniformly from [0, 1].
Besides the previously mentioned default initialization (Initialization 0), we executed ReversePR nine additional times, having Θ randomly initialized as described above. Figure 6 contains the evaluation scores for all the initializations, suggesting that random initialization does not play a crucial role in the performance of ReversePR in terms of RMSE and KL-divergence. Furthermore, we can see practically no variability in the results for any of the evaluations between the different random initializations of Θ.
Even though the default initialization performs similarly to random initializations for the KL-divergence and RMSE evaluations, they dominantly perform better for the remaining two ranking-oriented metrics, i.e., we can see a clear benefit from initializing Θ according to the all-zeros default strategy in terms of the displacement and MRR scores. Table 2 includes the aggregated numeric metrics for the default initialization strategy as well as the summary statistics (mean and standard deviation) of the metrics for 9 random initializations.
To summarize, random initialization of the parameters we optimize for does not account for a large variability in the quality of the solution ReversePR converges to. Furthermore, our deterministic strategy to initialize all parameters to zero seems to provide as good or even better quality in all evaluation aspects, for which reason we return to the usage of our default initialization strategy for all our upcoming experiments.

The effects of damping factor
Another aspect of ReversePR which could potentially affect the quality of the solution it converges to is the choice of the damping factor β. We have mentioned earlier that initially,  we set the damping factor to 0.01 so that the random walk does not deviate considerably from the true topology of the network. Here we report our results for ReversePR when evaluated using four substantially different damping factors, β ∈ {0.01, 0.05, 0.1, 0.2}. Figure 7 illustrates the results that we recorded for our experiments when we applying different damping factors. We can draw similar conclusions for we did earlier for the effects of random initializations in Sect. 6.3.2. That is, the choice of the damping factor does not seem to have a decisive impact on the results obtained, ReversePR behaves rather robustly for the different β values.
We can nonetheless observe some small differences again for the two metrics which focus on the quality of the rankings and not the actual transition probabilities, i.e., the displacement and the MRR scores. Choices for larger values of β tend to slightly favor low degree nodes, whereas higher degree nodes benefit more from the application of smaller values of β. This can be explained by the fact that nodes with large degree are more likely to be a core member of the network, whereas lower degree nodes tend to be located at the periphery of the network. These are the nodes on the periphery which benefit more from a more frequent restart of the random walk, which corresponds to a higher value  of β. We report the aggregated metrics that we obtained for the different choices of β in Table 3, corroborating that smaller values of β seem to perform better for the rankingbased metrics and that the other two metrics (KL and RMSE) behaved mostly insensitive to the different choices of β.

Qualitative assessment of the different approaches
In this section, we evaluate the different approaches from a qualitative perspective. While these examples are not intended for drawing far-reaching conclusions, they can complement the more rigorous quantitative evaluation and possibly provide additional insights into the behavior of the different approaches.
Since the Jaccard similarity-based approach is heavily relying on the local vicinity of articles, it makes the Jaccard approach less capable for adapting to the changes in temporal user preferences compared to ReversePR. This phenomenon is illustrated in Table 4 which lists the titles of Wikipedia articles when following the most likely transitions up to two hops, starting out from the Wikipedia article about Machine learning, Stephen Hawking and Brie Larson.
The articles predicted by ReversePR-starting at the article Machine learning and traversing through Deep learning-illustrates its ability to reflect temporal dynamics of Further inspection of Table 4 reveals that multiple baseline approaches favor transitions towards nodes with high indegree centrality, such as International Standard Book Number. Not pruning such articles from the network was a deliberate choice in order to preserve the original topology of the network-instead of applying any pragmatic (but somewhat arbitrary) removal of vertices from the network. Table 4 also suggests that Re-versePR seems to treat nodes with high indegree centrality but low interest better than the baseline approaches.

Experiments on the Clickstream subnetwork of Wikipedia
In our experiments reported so far, we were relying on the entire Wikipedia network from March 2016. Our motivation for not performing any filtering of the vertices or the hyperlink structure was to preserve the "natural topology" of Wikipedia.
The Wikipedia network used in [23] covers the same period of time (March 2016), however, it contains only that subnetwork of the entire Wikipedia that is included in the Clickstream dataset [42]. As a consequence, the Wikipedia network included in [23] is approximately one order of magnitude smaller compared to the network we have reported our experiments earlier. The subnetwork that is purely based on the Clickstream dataset consists of approximately 2.32 * 10 6 nodes and 1.32 * 10 7 hyperlinks, whereas the entire Wikipedia network from the same period consists of 1.23 * 10 7 articles and 3.87 * 10 8 hyperlinks, respectively.
For the sake of better comparability with the experimental protocol included in [23], we additionally report another set of experiments on the English Wikipedia. This time we restricted our (sub)network to be the one that is included in the Clickstream dataset in the exact same manner how it was done in [23].
The gold standard transition probabilities for our previous experiments also relied on the same Clickstream data, however, the network itself could also include such nodes and edges that were not necessarily part of the Clickstream dataset. For our Clickstream experiments, we determined π * proportional to the traffic that was included in the Clickstream dataset.
We include the results according to the different metrics obtained for the Clickstream network in Fig. 8. Comparing Fig. 8 with Fig. 5-which includes the same metrics over the large Wikipedia network-we can see similar trends. That is, we can observe that Re- versePR either performs better than alternative approaches by a fair margin or it is competitive to the rest of the approaches. The metrics are in general better compared to the ones received for the entire Wikipedia network, which can be explained by the fact that the vertices in the Clickstream network have fewer connections on average and making predictions for a smaller-sized neighborhood is an easier task in general.

Experiments on the Kosarak dataset
The Kosarak dataset e includes anonymized click-stream data of a Hungarian on-line news portal including 41,001 vertices and 974,560 edges based on 7,029,013 clicks. Kosarak is similar to Wikipedia in the sense that it also contains web traffic data, but it also differs from the Wikipedia in important aspects.
On the one hand, the size of the network induced from the Kosarak dataset is substantially smaller from the hyperlink network of Wikipedia. Additionally, in the case of Kosarak dataset we see a much higher density of edges in contrast to the Wikipedia network, i.e., the fraction of edges and the number of potential edges in the network is 1 * 10 -6 and 4 * 10 -8 for Kosarak and Wikipedia, respectively. Figure 9 contains a comparison of all the evaluated approaches regarding their performances on the various evaluation metrics on the Kosarak dataset. Figure 5 contains the same evaluation metrics for the Wikipedia network, which lets us compare the two results.
Besides the resemblances, we can also spot differences between Fig. 5 and Fig. 9. We can see for instance that the Traffic baseline achieves the best MRR score on the Kosarak dataset. In contrast, the Traffic baseline was only performant up to articles with medium outdegree for Wikipedia. Another difference is that, even though the Jaccard baseline was The differences can potentially be explained by the different nature of the two networks, i.e., Wikipedia has a different link structure due to its encyclopedic nature compared to the Kosarak network which was obtained from a news portal. A further important difference is that the Kosarak network contains only such edges that were included in at least one browsing session. In contrast when experimenting with Wikipedia, we could exploit the entire network structure (including those links which received no actual traffic). To this end, we regard the Wikipedia-based evaluation more meaningful and realistic.
Interestingly, the performance of the Jaccard approach is similarly good for the Kosarak dataset when evaluated in terms of displacement. What it means that the transition probabilities inferred from the Jaccard heuristic is generally applicable, even though it is not so reliable for determining the single most likely transition compared to other approaches.
The Traffic baseline behaves oppositely on the Kosarak network, i.e., it has the highest displacement scores even though it was performing best in terms of MRR. Again, what it means, that the setting the transition probabilities could help us find the most likely transition, but in general the ranking it provides might be quite dissimilar from reality.
ChoiceRank provides a steady performance on the Kosarak dataset regarding those evaluations which solely focus on the ranking of the neighboring vertices, as it obtains one of the highest MRR and a medium displacement score. ReversePR has lower MRR and higher displacement values, however, it outperforms all alternative approaches in terms of the KL-divergence and RMSE metrics by a fair margin. These metrics can be viewed more informative for the Kosarak dataset, where providing a good ranking is much easier in most of the cases, since the median outdegree in this network is only 4. This peculiarity of the dataset explains the good performance of the Random baseline as well.

Experiments on Citibike network
We additionally considered the dataset of the bicycle sharing system of New York City . f The dataset contains sessions of bike rentals in the form of pick-up and drop-off station pairs with additional meta-data such as the date and duration for the rides.
We conducted our evaluation on the same network as in [23]. That is, we relied on those source-target destination pairs from year 2015 for which the duration of the ride was no more than one hour and which had an average daily ride frequency of at least one. In the end, we were dealing with a network of 489 nodes and 5209 edges from 3.38 million bike rentals in total.
We performed the same quality assessment on the Citibike network as for the previously analyzed networks. The result of our evaluation is depicted in Fig. 10. Note that ReversePR refers to our algorithm using its default version, i.e., when Θ gets initialized to all zeros and the damping factor β equals 0.01. Figure 10 illustrates that ReversePR scored the best for all evaluation metrics when evaluated on the Citibike network. ChoiceRank obtains the second best results for all evaluation criteria, whereas the Jaccard baseline was hardly able to beat even the Random baseline for this network. This observation verifies our anticipation for the Jaccard baseline being less suitable for transportation networks (cf. Sect. 6.1).
By looking at Fig. 10, we can additionally notice that the simple strategy of predicting transition probabilities proportional to the indegree of the neighboring vertices provides We can thus conclude that the dynamics driving the bike rental network makes different baselines perform better in predicting the transition probabilities compared to web traffic networks.
In summary, the relative performance of the individual alternative baseline approaches vary remarkably based on the peculiarities of the analyzed networks, whereas ReversePR always performed close to or better than the best performing baseline approach in all of our experiments.

Summary of the experimental results
We finally provide a thorough comparison between ReversePR and ChoiceRank in Table 5 both in terms of their speed and overall performance on the three notably different network structures we conducted our experiments on. The median outdegree was 61 for the Wikipedia network and 4 in the case of Kosarak and Citibike datasets. Table 5(a) summarizes the most important statistics of the different input networks alongside with the amount of time needed for the different approaches to approximate the transition probabilities. We can see that the size and density of the three input networks differ substantially, and that ReversePR managed to find a solution at least 1.5-2 times faster than ChoiceRank. We find it remarkable-even if part of the speedup could be accounted for the usage of different programming languages for implementation (Python for ChoiceRank and Java for ReversePR)-since the optimization problem solved by ChoiceRank involves only |V | parameters, whereas ReversePR optimizes over O(|E|) |V | parameters.  Table 5(b) and Table 5(c) compares the mean and median of the evaluation measures that were calculated over all the vertices of the different networks with an out-degree larger than one. We can see that other than for the displacement and MRR scores on the Kosarak dataset, the performance of ReversePR is as good as that of ChoiceRank for every combination of evaluation criteria and input network.
We additionally performed paired t-test and Mood's median test to see whether the differences in the mean and median scores of the two approaches differ statistically significantly. We indicate it in Table 5(b) and Table 5(c) when the difference was found to be significant (with p 0.01). Using the non-parametric Wilcoxon sign test for checking whether the differences of the means over the vertices differ significantly also yielded the same results. We can see from Table 5(b) and Table 5(c) that all differences are significant for the click-stream networks, i.e., the Wikipedia, Clickstream and Kosarak datasets. ChoiceRank performed significantly better for the displacement and MRR metrics in the case of the Kosarak, whereas the metrics obtained by ReversePR were significantly better in all the other cases. For the smallest network, the mean and median evaluation metrics did not differ significantly between the two algorithms, which was likely due to the small size of the network.
As a final comparative assessment, we provide the distribution of the per vertex evaluation scores for all the datasets as a box plot in Fig. 11. We can observe that ReversePR is indeed among the best performing models for all networks and evaluation metrics, both in terms of its median and average scores. The advantage of ReversePR is the most pronounced for the KL-divergence and RMSE metrics, i.e., in those cases when the actual transitional probabilities are also considered not just the ranking they specify. The primarily benefit of ReversePR resides in its efficiency. As also highlighted in Table 5(a), ReversePR delivered a 2-8-fold speedup compared to ChoiceRank, with its evaluation metrics being competitive or even better than those of ChoiceRank. We achieved the above-mentioned speedup despite of the fact that our approach has O(|E|) parameters, whereas ChoiceRank operates with O(|V |) O(|E|) parameters.
As our model involves O(|V |) constraints, i.e., one for each element of π * , having O(|E|) variables makes our problem formulation overparameterized. This overparameterization allows us more flexibility in determining the p uv transitional probabilities, however, it also means that multiple configuration of edge weights could yield the same objective value. A potential solution for mitigating this phenomenon could be to incorporate various regularization terms into the objective for encouraging the transitional probabilities to have a high entropy, or to the contrarily incentivize finding highly skewed transitional probabilities. The kind of inductive bias conveyed by the regularizers could be determined in an application specific manner. We treat the incorporation of various regularizes as a potential future extension of our proposed algorithm.

Conclusion
We introduced an efficient algorithm for estimating the transition probabilities between the nodes of complex networks given some desired PageRank distribution over its vertices. We managed to solve the problem by formulating an approximate gradient of our objective