Cache-Efficient Approach for Index-Free Personalized PageRank

Personalized PageRank (PPR) measures the importance of vertices with respect to a source vertex. Since real-world graphs are evolving rapidly, PPR computation methods need to be index-free and fast. Unfortunately, existing index-free methods suffer from cache misses. They follow the state-of-the-art algorithm that first performs the Forward Push (FP) phase and subsequently runs the random walk Monte-Carlo simulation (MC) phase. Although existing methods succeed in reducing cache misses in the FP phase, an inefficient data layout limits their performance improvement. Besides, existing methods have overlooked the importance of reducing cache misses in the MC phase. In this paper, we propose a cache-efficient approach that accelerates both FP and MC phases. In the FP phase, we first reorder the data layout with low overheads. Specifically, we utilize the Breadth First Search result so that vertices near the source vertex are co-located on the reordered data layout. We subsequently perform optimized FP, namely Distance-Extension Forward Push (DEFP). By preferentially proceeding FP around the source vertex, DEFP improves memory access locality. In the MC phase, we perform optimized MC, namely Vertex-Centric Random Walk (VCRW). VCRW aggregates random walks at each vertex to eliminate redundant memory access for repeatedly obtaining neighbor vertices. We prove that most of the random walks can be aggregated while maintaining accuracy guarantees. Experimental results show that the proposed method is up to 4.7x faster than existing index-free methods and outperforms the state-of-the-art index-oriented method under rigorous accuracy guarantees.


I. INTRODUCTION
Personalized PageRank (PPR) [1] is one of the most popular graph computations to measure the proximity of vertices. Of our particular interest is the single-source PPR (SSPPR) among several PPR variants, such as single-target PPR, pairwise PPR, and fully PPR. Given a graph G = (V , E), a termination probability α, a source vertex s, and a target vertex t, the SSPPR score π(s, t) is defined as the probability that an α-decay random walk starting from s terminates at t in G. When performing an α-decay random walk, a random The associate editor coordinating the review of this manuscript and approving it for publication was Chong Leong Gan . walker starting from s terminates at the current vertex with probability α or moves to a neighbor vertex with probability 1 − α. By this definition, the SSPPR scores can also be regarded as the measure of relative importance of all vertices with respect to s. Based on this, SSPPR has various real-world applications, such as spam detection [2], link prediction [3], social recommendation [4], community detection [5], [6], [7], graph learning [8], [9], [10], and so on.
Nowadays, massive real-world graphs are evolving rapidly. In this scenario, index-oriented SSPPR computation methods [11], [12], [13] are impractical because they need to hold huge indices and update them frequently. Therefore, SSPPR computation methods need to be index-free and fast.  Fig. 1(a) and vertex 0 in Fig. 1(b) access neighbor vertices' data. Even though the two vertices have identical positions in each graph, memory access locality in Fig. 1(b) is better than that in Fig. 1(a).
The state-of-the-art index-free methods [14], [15] follow the FORA algorithm [16] that first performs the Forward Push (FP) phase and subsequently runs the random walk Monte-Carlo simulation (MC) phase. Although these methods have steadily accelerated SSPPR computation, they still suffer from a large number of cache misses in both FP and MC phases.
In the FP phase, irregular memory access primarily causes frequent cache misses. As shown in Fig. 1(a), a real-world graph tends to have an unstructured data layout. Assuming that vertex 2 accesses its neighbor vertices, irregular memory access occurs because these neighbor vertices are not co-located on memory. On an unstructured data layout, every vertex repeatedly accesses neighbor vertices until the FP phase finishes, which leads to a significant number of cache misses in total. Since existing methods [14], [15] optimize the FP phase without considering the data layout, their performance improvement is insufficient. Graph reordering [17], [18], [19], [20], [21], [22] is widely used to optimize the data layout by relabeling vertex IDs. As shown in Fig. 1(b), a reordered data layout improves memory access locality. In general, reordering methods assign close IDs to frequently accessed vertices so that processors can reuse cached data. Although existing reordering methods can accelerate various graph computations, their time-consuming procedures to find an optimal relabeling cause an end-to-end slowdown. Therefore, to alleviate irregular memory access, we need to conduct lightweight reordering that captures the memory access pattern of the FP phase.
In the MC phase, cache misses are mainly caused by redundant memory access. For accuracy guarantees, existing methods perform a large number of α-decay random walks sequentially to memorize the starting vertex of each random walk. Through this sequential process, each random walk needs to obtain neighbor vertices for every single step to decide the next destination. Therefore, we need to obtain neighbor vertices multiple times at each vertex, which leads to redundant memory access. Notably, existing index-free methods have only focused on optimizing the FP phase, and they overlook this redundant memory access. To overcome this problem, we need to reduce the total number of operations to obtain neighbor vertices at each vertex.
In this paper, we propose a cache-efficient approach that significantly accelerates the FORA algorithm. To reduce cache misses, we focus on optimizing the computational procedure of both FP and MC phases. While some methods [14], [23] accelerate the FORA algorithm by modifying the timing at which the FP phase switches to the MC phase, examining an appropriate switching timing lies outside the scope of this paper. To optimize the FP phase, we first conduct lightweight reordering. We observe that vertices near the source vertex are frequently accessed during the FP phase. Therefore, the FP phase can be accelerated if vertices near the source vertex have close IDs. To realize this, we reorder the data layout according to the Breadth First Search (BFS) result from the source vertex. This reordering can preferentially assign close IDs to vertices near the source vertex with low overheads. To fully utilize the reordering result, we introduce the optimized FP, namely Distance-Extension Forward Push (DEFP). DEFP executes FP within k distances from the source vertex and gradually increases the k value, which can reduce the total number of operations during the FP phase. In the MC phase, we perform Vertex-Centric Random Walk (VCRW). VCRW aggregates random walks at each vertex to reduce the total number of operations to obtain neighbor vertices. This aggregation eliminates redundant memory access for repeatedly obtaining neighbor vertices at each vertex. We prove that VCRW can aggregate most of the random walks while maintaining accuracy guarantees.
Our contributions are summarized as follows: • We propose a cache-efficient approach for fast index-free PPR computation. In the FP phase, the proposed method first conducts lightweight reordering according to the Breadth First Search result. On the reordered data layout, the proposed method performs Distance-Extension Forward Push. In the MC phase, the proposed method performs Vertex-Centric Random Walk. These techniques can significantly reduce cache misses.
• We conduct extensive experiments using six real-world graphs. Experimental results show that the proposed method is up to 4.7× faster than the existing methods. We confirm that the proposed method reduces cache misses on both the L1 cache and L3 cache. Notably, the proposed method outperforms the state-of-the-art indexoriented method under rigorous accuracy guarantees.
This paper extends our previous work [24] that originally proposed the idea of aggregating random walks. While [24] measures the computational efficiency simply through the running time, this paper further investigates the cache performance, which is our major interest, in addition to the running time measurement. Additionally, this paper has the following VOLUME 11, 2023 three novelties. First, we focus on optimizing the FP phase by graph reordering and extend the survey to include graph reordering methods. Second, we theoretically prove that the proposed method guarantees the accuracy. Third, we conduct far more extensive experiments considering various applications scenarios.
The rest of this paper is organized as follows. Section II states the problem definition. Section III discusses related work. Section IV describes the details of the proposed method. We evaluate the proposed method through extensive experiments using real-world graphs in Section V. The source code of the proposed method can be found online. 1 Finally, we conclude this paper in Section VI.

A. PROBLEM DEFINITION
Let G = (V , E) be a directed unweighted graph with a set of vertices V and a set of edges E. n = |V |, m = |E| are the number of vertices and edges, respectively. For an undirected graph, we convert every undirected edge (u, v) into two directed edges (u, v) and (v, u). Let N in (v) (resp. N out (v)) denote the set of in-neighbor vertices (resp. out-neighbor vertices) of v ∈ V , and let d in (v) (resp. d out (v)) denote the in-degree (resp. out-degree) of v ∈ V . Given a source vertex s, the SSPPR score π(s, t) of t is defined as the probability that an α-decay random walk starting from s terminates at t in G. An α-decay random walk terminates at the current vertex with α probability or moves to an out-neighbor vertex with 1 − α probability. In addition, let dist(s, v) denote the shortest distance from s to v in G, V k = {v ∈ V | dist(s, v) = k} denote the set of vertices whose dist(s, v) = k, and U k = {v ∈ V | dist(s, v) ≤ k} denote the set of vertices whose dist(s, v) ≤ k. In this paper, we focus on the Approximate SSPPR query, hereafter referred to simply as PPR (Definition 1). Table 1 summarizes the notations we frequently use in this paper.
Definition 1: (PPR Query) Given a graph G = (V , E), a source vertex s, a threshold δ, an error bound ϵ, and a failure probability p f , PPR query returns the estimated PPR scorê π(s, t) for all t ∈ V , such that for any π(s, t) > δ,

Algorithm 1: Forward Push
Input: Graph G, source vertex s, termination probability α, threshold r max Output: residue r(s, t) and reserveπ(s, t) for all t ∈ V 1π (s, t) ← 0 and r(s, t) ← 0 for all t ∈ V ; 2 r(s, s) ← 1; 3 while ∃t ∈ V such that r(s, t) > d out (t) · r max do # Pushing operation 4π (s, t) ←π(s, t) + α · r(s, t); representation for out-neighbor vertices in Fig. 2. Note that in-neighbor vertices can be represented in the same way.

B. FORWARD PUSH
Forward Push (FP) [26] is a local update method. FP simulates random walks in a deterministic way by repeatedly pushing the probability mass to out-neighbor vertices. Algorithm 1 shows the pseudo-code of FP. FP maintains residue r(s, t) and reserveπ(s, t) for each vertex t ∈ V . At the beginning, FP initializes r(s, t) andπ(s, t) (Lines 1-2). A vertex t becomes active when r(s, t) > r max ·d out (t). An active vertex t executes pushing operation that increasesπ(s, t) by converting α portion of r(s, t) and transfers 1 − α portion of r(s, t) to N out (t) (Lines 4-7). If t is a dangling vertex, t transfers 1−α·r(s, t) to s (Lines 8-9). t finishes the pushing operation after resetting r(s, t) to zero (Line 10). When there are no active vertices, r(s, t) andπ(s, t) are returned as final residue and the PPR score, respectively (Line 11). The expected running time of FP is O 1 α·r max . However, FP cannot provide any accuracy guarantees.

C. RANDOM WALK MONTE-CARLO SIMULATION
Random walk Monte-Carlo simulation (MC) is a classic and straightforward solution to answer PPR queries [27], [28]. MC performs ω random walks from s, and measures the fraction of random walks that terminate at t. Then MC uses its fraction to estimateπ(s, t). To satisfy Definition 1, MC needs random walks [27].
Assuming that a given graph is scale-free, where the number of edges is m = O (n · log n), the expected running time is bounded by O , which is infeasible with massive real-world graphs.

III. RELATED WORK
A. PERSONALIZED PAGERANK FORA [16] is the first method that combines FP and MC. FORA first invokes FP with early termination and subsequently performs random walks to obtain accuracy guarantees. FORA utilizes the following invariant [26]: where π(s, t) • is the reserve of a vertex t after the FP phase. Since computing π(v, t) for all v ∈ V is infeasible, FORA computesπ(s, t) as follows: where π(v, t) ′ can be obtained by MC. In the MC phase, , which is a factor of 1/ϵ smaller than that of MC.
To optimize the computational procedure of the FP phase, ResAcc [14] and SpeedPPR [15] have been proposed. Their main idea is to reduce the total number of pushing operations. Specifically, ResAcc exploits the looping phenomenon, where some residues return back to s. By accumulating returned residues, ResAcc avoids multiple pushing opera-tions at s. SpeedPPR gradually reduces r max so that vertices holding a large residue execute the pushing operation. Although both ResAcc and SpeedPPR have outperformed FORA, they have overlooked the importance of reordering the data layout and optimizing the computational procedure of the MC phase.
Index-oriented methods [11], [12], [23], [29] aim to answer PPR queries rapidly by using precomputed results. Matrixbased methods [11], [29] convert the adjacent matrix so that the converted matrix has a large and easy-to-invert submatrix. Since these methods hold the converted matrix as indices, they cause huge space overheads. HubPPR [12] executes random walks from high-degree vertices and stores the results as indices. The indices are combined with Backward search [30]. FORA+ [23] is the state-of-the-art method. FORA+ samples random walk from all the vertices in advance and uses them directly in the MC phase. Despite the efficiency of query responses, index-oriented methods are impractical due to high overheads of updating indices.

B. GRAPH REORDERING
Connectivity-based methods [18], [19] assign close vertex IDs to densely connected vertices. Rabbit-Order [19] reorders the data layout based on the community structure. Since communities consist of densely interconnected vertices, Rabbit-Order assigns consecutive IDs to vertices in each community. Gorder [18] is the state-of-the-art method that offers significant performance improvement. Gorder assigns consecutive IDs to vertices sharing a large number of common vertices. Specifically, let R w be w vertices that are recently relabeled with a new ID. A vertex v obtains a new ID if v shares the largest number of common neighbor vertices with R w . Although connectivity-based methods can accelerate various graph computations, time-consuming reordering procedures in these methods cause an end-to-end slowdown.
To alleviate huge reordering overheads, degree-based methods [20], [21], [22] have attracted much attention in the literature. In general, degree-based methods assign close IDs to high-degree vertices because these vertices tend to be frequently accessed in graph computations. Since degree-based methods simply use the degree information of vertices, they can reduce the reordering time. However, it remains challenging to realize enough performance improvement comparable with connectivity-based methods.

IV. PROPOSED METHOD
In this section, we present the proposed method. We first outline the overall approach in Section IV-A. The details of the proposed method are described in Section IV-B and VOLUME 11, 2023 Algorithm 2: the Proposed Method Input: Graph G, source vertex s, termination probability α, threshold r max Output: PPR scoreπ(s, t) for all t ∈ V 1 G * ← BFS-based lightweight reordering with G, s; 2 invoke Distance-Extension Forwrd Push with G * ; 3 perform Vertex-Centric Random Walk with G * ; 4 returnπ(s, t) for all t ∈ V ; Section IV-C. Finally, we elaborate the running time analysis in Section IV-D.
A. OVERVIEW Algorithm 2 shows the pseudo-code of the proposed method. To reduce cache misses, the proposed method optimizes the computational procedure of both FP and MC phases. In the FP phase, the proposed method first conducts lightweight BFS-based reordering (Line 1). This reordering preferentially assigns close vertex IDs to vertices with small dist(s, v) because these vertices perform a large number of pushing operations. On the reordered data layout, the proposed method invokes Distance-Extension Forward Push (DEFP) (Line 2). In the MC phase, the proposed method performs Vertex-Centric Random Walk (VCRW) (Line 3). VCRW aggregates random walks at each vertex to reduce the total number of operations to obtain neighbor vertices. Finally, the proposed method returns the PPR scoreπ(s, t) satisfying Definition 1 for all t ∈ V (Line 4).

B. DISTANCE-EXTENSION FORWARD PUSH
We first analyze the workload of the FP algorithm shown in Algorithm 1 focusing on dist(s, v). Let D s denote the maximum distance from s in G. We selected one source vertex, and then measured the average number of pushing operations and the average unit-cost benefit [15] of V k . Note that unit-cost benefit of the pushing operation on v is defined as α·r(s,v) d out (v) because v needs to access d out (v) vertices to reduce the sum of residues r sum by α · r(s, v). We conducted the same analysis on 50 source vertices selected uniformly at random, and we confirmed that all vertices indicate a similar tendency. Therefore, we report the results from source vertex s whose D s is closest to the average D s computed from 50 source vertices. Fig. 3 shows the corresponding results. We can see

Algorithm 3: BFS-Based Lightweight reordering
Input: Graph G, source vertex s Output: that vertices near s tend to perform a large number of pushing operations with large benefits.
Based on this, the proposed method reorders the data layout to improve memory access locality of V k with small k. Moreover, the proposed method establishes the Distance-Extension strategy that performs pushing operations with U k and gradually increases the k value. Since pushing operations on V k with small k have large benefits, our strategy can efficiently proceed the FP phase.
Algorithm 3 shows the pseudo-code of the reordering procedure. We perform BFS from s and assign a new ID in ascending order to a found vertex. Considering that a vertex v ∈ V k with small k expands the searching area, the majority of N out (v) have not been found yet. Therefore, most of N out (v) can obtain consecutive vertex IDs. As a result, vertices in V k+1 have close IDs because V k+1 consists of out-neighbor vertices of V k . To quickly determine U k , we also calculate DistanceIdx along with the BFS procedure, where DistanceIdx[k] denotes the maximum vertex ID of V k . Considering that BFS has O(m + n) = O v∈V (d out (v) + 1) running time, Algorithm 3 is fairly lightweight compared with the representative reordering method Gorder [18] that has O v∈V (d out (v)) 2 running time. Algorithm 4 shows the pseudo-code of DEFP. As mentioned before, DEFP performs FP with U k and iteratively

Algorithm 4: Distance-Extension Forward Push
Input: Graph G * , source vertex s, termination probability α, threshold r max , DistanceIdx Output: residue r(s, t) and reserveπ(s, t) for all t ∈ V 1π (s, t) ← 0 and r(s, t) ← 0 for all t ∈ V ; 2 r(s, s) ← 1;  increases the k value (Lines 3-12). Since vertices in V k with small k are assigned close vertex IDs and pushing operation on these vertices have large benefits, DEFP can rapidly reduce r sum . Fig. 4  by I v . Since I v depends on r(s, v) and therefore differs for each vertex v, existing methods need to perform all random walks sequentially to memorize the starting vertex of each random walk. In this case, the total number of operations to obtain neighbor vertices is excessively large because each random walk needs to obtain neighbor vertices for every single step. We show an example in Fig. 5. Assume that there are three random walks having similar paths, where 0 → 3 → 4 → 6 → 7 (red), 3 → 4 → 6 → 7 (blue), 4 → 6 → 7 (green). Even though all these random walks pass through vertex 4, 6, and 7 in the same order, we need to obtain neighbor vertices of vertex 4, 6, and 7 three times, respectively. VCRW is an effective solution for this problem. Our idea is to aggregate random walks at each vertex v so that these aggregated random walks can move by obtaining N out (v) only once. We show an example of the aggregation in Fig. 5(b). The three random walks shown in Fig. 5(a) can reach the termination vertex 7 at the same time. Since it suffices to obtain neighbor vertices at vertex 4, 6, and 7 only once for each, we can reduce the total number of operations to obtain neighbor vertices from twelve in Fig. 5(a) to five in Fig. 5(b).
To realize the aggregation, we need to perform random walks that depend on r(s, v) as little as possible. Therefore, we unify the increment value ofπ(s, t) while maintaining accuracy guarantees. Specifically, we unify the increment value of ⌊r(s, v) · ω⌋ random walks out of all ⌈r(s, v) · ω⌉ random walks. By unifying most of the increment value, we can aggregate a large part of random walks, which significantly reduces the total number of operations to obtain neighbor vertices. Although we still need to perform sequential random walks whose increment value depends on r(s, v), it is guaranteed that each vertex v performs ⌈r(s, v) · ω⌉ − ⌊r(s, v) · ω⌋ sequential random walks. Since the value of ⌈r(s, v) · ω⌉ − ⌊r(s, v) · ω⌋ is either 0 or 1, the running cost of sequential random walks is fairly low.
Algorithm 5 shows the pseudo-code of VCRW. VCRW consists of two types of random walk, namely Independent Random Walk (IRW) (Lines 2-13), and Dependent Random Walk (DRW) (Lines 14-18). In IRW, each vertex v initially has ⌊r(s, v) · ω⌋ random walks (Line 2). A vertex v that has one or more random walks first obtains its own neighbor vertices (Line 4) and decides the next destination of each random walk (Lines 7-8). The destination vertex u chosen by v aggregates a random walk (Line 9). If a random walk terminates at v, v's reserve is increased by 1 ω , which is a unified constant value (Lines 10-12). Since we do not need to memorize where each random walk is from, all random walks at v can move by obtaining N out (v) only once. In DRW, each vertex v with r(s, v) > 0 performs a sequential random walk only once. Letting c v = r(s, v) · ω − ⌊r(s, v) · ω⌋, this sequential random walk increasesπ(s, t) of a termination vertex t by c v ω (Lines 14-18).
We show that VCRW returnsπ(s, t) satisfying Definition 1 for all t ∈ V . We first introduce a generalization of the Chernoff inequalities (Lemma 1), and subsequently guarantee the accuracy (Theorem 2). VOLUME 11, 2023 FIGURE 5. Assuming that there are three random walks from vertex 0, 3, and 4 that have the same path 4 → 6 → 7, sequential random walks need to obtain neighbor vertices every time as shown in Fig. 5(a). On the other hand, VCRW can reduce the total number of operations to obtain neighbor vertices by aggregating three random walks at vertex 3 and vertex 4 as shown in Fig. 5(b).

Algorithm 5: Vertex-Centric Random Walk
Input: Graph G * , source vertex s, termination probability α, threshold r max 6 Generate a random number r between (0, 1); 7 if r > α then 8 Choose u randomly from Neighbors; 9 IRW(u) ← IRW(u) + 1;  15 c v ← r(s, v) · ω − ⌊r(s, v) · ω⌋; 16 Generate a sequential random walk from v; 17 Let t be the termination vertex; 18π (s, t) ←π(s, t) + c v ω ; Lemma 1 (Chernoff Bound [44]): Let X 1 , . . . , X n be independent random variables with i p i , a = max{a 1 , . . . , a n } and λ ≥ 0, we have Proof: Let ω sum be the total number of random walks, and X j (t) be a random variable as follows: To apply Lemma 1, let X = ω sum j=1 a j · X j (t), and ν = ω sum j=1 a 2 j · E X j (t) , where a j takes value 1 if j-th random walk starting from v is IRW, and value r(s, v)·ω −⌊r(s, v)·ω⌋ if j-th random walk starting from v is DRW.
By definition, a j ≤ 1. Therefore, ν ≤ E [X ] and a ≤ 1, where a = max{a 1 , . . . , a ω sum }. Applying them to Lemma 1, we have Let c v = r(s, v) · ω − ⌊r(s, v) · ω⌋, and Y k (t) denote a random variable that takes value 1 if k-th random walk starting from v terminates at t, and value 0 otherwise. By definition, Then we have, Observing that ⌊r(s,v)·ω⌋ j=1 1 ω · Y k (t) is exactly the amount of increment thatπ(s, t) receives from a vertex v during IRW, and c v ω ·Y k (t) is that of during DRW. Hence, we can rewrite (2) and (3) as follows.

D. RUNNING TIME ANALYSIS
It has been proved that the expected running time of the FP algorithm shown in Algorithm 1 can be bounded by O m · log 1 m·r max [15]. Since the proposed method follows Algorithm 1, we can also define the expected running time of DEFP as O m · log 1 m·r max . Letting ω sum be the total number of random walks, existing methods expect that MC has O ω sum α running time This is because every random walk is performed sequentially, and one random walk moves 1 α steps on average. Obviously, the expected running time of the MC phase is determined by the total number of operations to obtain neighbor vertices. Therefore, we estimate this number to define the expected running time of VCRW.
We first investigate the expected running time of DRW. In DRW, each vertex v with r(s, v) > 0 performs a random walk only once. Therefore, we easily know that there are at most n random walks. Since these random walks are performed sequentially, we can define the total running time of DRW as O n α . Next, we investigate the expected running time of IRW. For the ease of analysis, we first define the iteration of IRW. We denote I (k) as the set of all active vertices at the beginning of the k-th iteration, where a vertex v is active if v has one or more random walks. Therefore, I (k) consists of vertices that still have random walks after all vertices in I (k−1) move their random walks. Initially, I (0) contains every vertex v with ⌊r(s, v)·ω⌋ ≥ 1. In addition, we denote ω (k) sum as the remaining number of random walks at the beginning of the k-th iteration. Based on these definitions, we prove the following theorem.
Theorem 3: The expected running time of IRW is O (n · log(m · r max · ω)).
Proof: Considering the (k + 1)-th iteration, we have Based on this, we can easily get Besides, ω sum follows We assume that IRW finishes when ω (k) sum < 1. By combining (14) and (15), we know that IRW needs K = O (log(m · r max · ω)) iterations to satisfy the condition ω (k) sum < 1. In the k-th iteration, I (k) vertices are still active. Observing that I (k) ≤ n, the total number of operations to obtain neighbor vertices follows Therefore, the expected running time of IRW is O (n · log(m · r max · ω)). This completes the proof. □

A. EXPERIMENTAL SETUP
We conducted all experiments on a Linux 20.04 server with dual Intel Xeon E5-2643 processors and 94GiB memory. The size of the L1 data cache, L1 instruction cache, L2 cache, and L3 cache were 384KiB, 384KiB, 3MiB, and 50MiB, respectively. All algorithms were implemented with C++ and compiled with G++ 9.4.0 using the -O3 optimization.

2) EXPERIMENTS
We compared the proposed method with three index-free methods: FORA [16], ResAcc [14], and SpeedPPR [15], and the state-of-the-art index-oriented method FORA+ [23]. The computational efficiency was measured through the overall running time and the cache performance. Before starting the experiments, we shuffled the vertex IDs randomly. This is because the given vertex IDs have been modified by publishers, which might influence the experimental results unintentionally [19]. After shuffling the vertex IDs, we generated 50 query source vertices uniformly at random. We report the average results of these 50 vertices. To generate random numbers efficiently, we used the latest version of SIMD-oriented Fast Mersenne Twister Library. 3 As for cache performance analytics, we used the perf tool. 4 We measured five events on processors: L1-r, L1-m, L3-r, L3-m, and Total-m. L1-r denotes the number of L1 cache references and L1-m denotes the number of L1 cache misses. L1-r is equal to the total number of cache references because processors first reference the L1 cache. L1-r and L1-m can be measured with the perf options L1-dcache-loads and L1-dcache-misses, respectively. Similarly, L3-r and L3-m denote the corresponding numbers at the L3 cache, and these numbers can be measured with LLC-loads and LLC-misses options, respectively. Total-m denotes the percentage of the total cache misses to the total cache references, such that Total-m = L3-m/L1-r. Note that we made sure that all caches were initialized before each experiment by executing sysctl -w vm.drop_caches=3.

3) PARAMETER SETTINGS
By default, we set α = 0.2, δ = 1 n , p f = 1 n , and ϵ = 0.5 following the existing methods. Specifically, we set , as default because this setting can theoretically establish the lowest expected running time [15]. Moreover, ResAcc has two additional parameters h and r hop max . In a nutshell, ResAcc first performs the FP algorithm shown in Algorithm 1 with r max = r hop max on vertices within h distances from the source vertex s, and accumulates residues at s. After distributing accumulated residues, ResAcc continues running Algorithm 1 with the entire graph. We empirically set h and r hop max to realize the best performance.
B. OVERALL RUNNING TIME Fig. 6 shows the overall running time of the index-free methods in log scale. The running time of the proposed method was smaller than the existing methods on all the datasets except for WebStan. On WebStan, the proposed method slightly underperformed the existing methods. However, this level of performance gap is negligible because the overall running time on WebStan is fairly low. Compared with FORA, ResAcc, and SpeedPPR, the average speedup of the proposed method was 3.2×, 3.9×, and 2.5×, and the maximum speedup was 4.5×, 4.7×, and 3.9×, respectively. It is worth pointing out that the proposed method was more effective on larger graphs. The proposed method was over ten seconds faster on LiveJournal, over 50 seconds faster on Orkut, and 3 http://www.math.sci.hiroshima-u.ac.jp/m-mat/MT/SFMT/ 4 https://perf.wiki.kernel.org/index.php  over 400 seconds faster on Twitter than the fastest existing method SpeedPPR. This shows a considerable superiority of the proposed method in terms of the computational efficiency.
To analyze the overhead of the reordering procedure, we further examined the breakdown of the overall running time. The overall running time is divided into reordering time and PPR computation time. Fig. 7 shows the corresponding results. On all the datasets, reordering procedure completed faster than PPR computation. Moreover, we defined the overhead of reordering procedure as the ratio of reordering time to PPR computation time. Except for WebStan, this ratio was 0.24 on average. This low overhead enhances the computational efficiency of the proposed method. Tables 3, 4, and 5 show the cache performance on DBLP (small size graph), LiveJournal (medium size graph), and Twitter (large size graph), respectively. The best result in each column is highlighted in bold. The proposed method greatly reduced both L1-m and L3-m on LiveJournal and Twitter. Notably, Total-m on Twitter was up to 11.6% lower than the existing methods. Since the cache reference is over two orders of magnitude faster than the main memory reference, the cache miss reduction primarily accelerates the proposed method. These results on LiveJournal and Twitter are consistent with the overall running time results shown in Fig. 6. While the proposed method succeeded in reducing L1-m on DBLP, L3-m was nearly identical to the existing methods. This is mainly because the graph size of DBLP is so small that the L3 cache can hold a large part of data. Interestingly, the proposed method reduced L1-r compared to FORA and ResAcc, even though the proposed method needs to go through the reordering procedure before the FP phase.   This shows that DEFP and VCRW improve the computational efficiency.

D. EFFECT OF EACH OPTIMIZATION
We investigated the respective effects of DEFP and VCRW on the running time. Fig. 8 shows the running time of each approach in log scale on Pokec, LiveJournal, and Orkut. We first compare the running time of Algorithm 1 (Baseline) and DEFP in Fig. 8(a). We measured the running time until there are no active vertices with respect to the default r max . Compared with Baseline, DEFP was 3.3×, 4.1×, and 2.4× faster on Pokec, LiveJournal, and Orkut, respectively. Considering that the running time of DEFP includes the reordering time, this speedup shows that DEFP can proceed the FP phase efficiently.
Next, we compare the running time in the MC phase. Since the proposed method performs VCRW after conducting the reordering in the FP phase, the reordered data layout might have a positive effect on the running time in addition to the aggregation technique. To clearly separate the effect of the aggregation technique and the reordering, we measured the running time of VCRW on original layout (VCRW-O) and on reordered layout (VCRW-R). Fig. 8(b) shows the running time of Baseline, VCRW-O, and VCRW-R, respectively. Note that Baseline represents the original approach that performs random walks sequentially. The running time was measured up to the point where each vertex v has performed ⌈r max · d out (v) · ω⌉ random walks. We can see that VCRW-R was 4.7×, 5.2×, and 6.5× faster than Baseline on Pokec, Live-Journal, and Orkut, respectively, which shows a significant speedup. On the other hand, VCRW-R was slightly faster than VCRW-O on the three datasets. Therefore, we confirm that the speedup of the MC phase by VCRW is mostly due to the aggregation technique.

E. EFFECT OF SETTING r max
We examined the effect of setting r max on the overall running time using Pokec and Twitter. Recall that r max determines when the FP phase switches to the MC phase. Theoretically, the default setting can achieve the fastest running time. However, previous works have observed that the best setting differs from the default setting and the running time is too sensitive to r max . Therefore, we measured the overall running time against different r max varying from 10 −2 × r • max to 10 1 × r • max , where r • max denotes the default setting. Fig. 9 shows the corresponding results of SpeedPPR and the proposed method. As previous works have reported, we observed that r • max cannot achieve the fastest running time on both Pokec and Twitter. The proposed method (resp. SpeedPPR) was fastest with r max = 0.25 × r • max and r max = 0.5 × r • max (resp. r max = 0.1 × r • max and r max = 0.1 × r • max ) on Pokec and Twitter, respectively. At the fastest point, SpeedPPR was 2.1× and 2.2× faster than the default setting on Pokec and Twitter, respectively, which shows SpeedPPR is sensitive to r max . On the other hand, in the proposed method, the difference between the fastest running time and the running time with r • max was 1.2× and 1.1× on Pokec and Twitter, respectively. These results show that the proposed method is less sensitive to r max than SpeedPPR.

F. EFFECT OF PARAMETER SETTINGS FOR ACCURACY GUARANTEES
Recall that the parameters ϵ and δ determine the condition of Approximate SSPPR query. Specifically, the value of ϵ determines the acceptable relative error and the value of δ determines the applicable scope of accuracy guarantees. According to numerous demands from applications, PPR queries are conducted with various parameters settings. Motivated by this, we evaluated the overall running time against different ϵ and δ. Fig. 10 shows the overall running time against different ϵ varying from 0.5 to 0.1. We can see that the proposed method significantly outperformed the existing methods with any ϵ on all the datasets except for WebStan. Note that both the proposed method and SpeedPPR showed a linear increase on the overall running time. On the other hand, the overall running time of FORA and ResAcc sharply increased with a decrease of ϵ, especially on Twitter. This is mainly due to the difference in the way to access active vertices during the FP phase. Since FORA and ResAcc adopt a queue-based implementation, active vertices are accessed in a nonconsecutive manner with respect to vertex IDs, which leads to irregular memory access. Alternatively, the proposed method and SpeedPPR adopt an array-based implementation, and both methods can therefore access the active vertices consecutively in the order of vertex IDs. Fig. 11 shows the overall running time against different δ. Letting δ • = 1 n be the default parameter, we measured the overall running time with δ = δ • δ * . Note that we varied δ * from {1, 5, 10}. In addition, Table 6 shows the percentage of vertices that are the target of accuracy guarantees. To calculate the percentage, we need to know the ground truth score π(s, t) for all t ∈ V . Therefore, we ran Algorithm 1 until r sum satisfied r sum < θ , where θ = min{10 −8 , 1 m }, and we then regarded the returned result as the ground truth. Similar to the results varying the parameter ϵ as shown in Fig. 10, the proposed method outperformed the existing methods. Notably, the proposed method with δ * = 10 was faster than the existing methods with δ * = 1 on larger graphs: LiveJournal, Orkut, and Twitter. Therefore, the proposed method can obtain more accurate results than existing methods with the same running cost. Fig. 12 shows the overall running time of the proposed method and FORA+ [23] on LiveJournal, Orkut, and Twitter against different ϵ varying from 0.5 to 0.1. We obtained the source code of FORA+ from the authors. 5 For FORA+, we generated the index with smallest ϵ = 0.1 because the index generated with ϵ = ϵ High cannot be reused to answer PPR queries with a smaller ϵ = ϵ Low < ϵ High .

G. COMPARISON WITH INDEX-ORIENTED METHOD
Overall, FORA+ was faster than the proposed method because FORA+ simply reads the index in the MC phase. Specifically, FORA+ outperformed the proposed method with larger ϵ, and FORA+ was 2.9×, 3.4×, and 1.7× faster with ϵ = 0.5 on LiveJournal, Orkut, and Twitter, respectively. One interesting finding is that the proposed method achieved a comparable running time with smaller ϵ. In particular, the proposed method outperformed FORA+ on LiveJournal and Twitter under rigorous accuracy guarantees, i.e., ϵ = 0.1. It is worth pointing out that the proposed method is over 100 seconds faster than FORA+ with ϵ = 0.1 on Twitter. Table 7 summarizes the online running time with ϵ = 0.1, offline index generation time, the index size, and the graph size. While FORA+ requires O n·log n ϵ space consumption and impractical index generation time, our index-free method has zero precomputation costs. We can see that FORA+ consumed minutes or hours of precomputation time and huge space overheads beyond the graph size, which is unacceptable in real-world scenarios. Assuming that users request rigorous accuracy guarantees and set a small ϵ, the performance of FORA+ is apparently degraded in terms of online query time    and offline precomputation overheads. These results verify that the proposed method can achieve significant computational efficiency without any precomputations.

H. COMPARISON WITH GORDER
Finally, we compared the proposed method with SpeedPPR incorporated with the state-of-the-art graph reordering method Gorder [18]. Gorder is the representative heavyweight reordering method that achieves remarkable speedup by deeply inspecting the connectivity of vertices. Fig. 13 shows the overall running time of SpeedPPR, SpeedPPR+Gorder, and the proposed method on LiveJournal. The result of SpeedPPR+Gorder represents the running time of SpeedPPR on the reordered LiveJournal dataset. We observe that SpeedPPR+Gorder greatly outperformed SpeedPPR, and the average running time of SpeedPPR+Gorder was 1.5× smaller than VOLUME 11, 2023 SpeedPPR. Notably, the proposed method was still faster than SpeedPPR+Gorder. Note that the result of the proposed method includes reordering time. On the other hand, the result of SpeedPPR+Gorder excludes it. Considering that Gorder took 70.1 seconds to complete the reordering procedure on LiveJournal, we confirm that the proposed method is more practical than SpeedPPR+Gorder. Interestingly, we find that the running time of the proposed method is stable. The standard deviations of three methods were 0.29, 0.46, and 0.16, respectively, which shows that the proposed method performed stably regardless of the characteristics of source vertices. This is mainly because the proposed method always reorders the data layout according to a given source vertex.

VI. CONCLUSION
We proposed a cache-efficient approach for fast index-free PPR computation. The proposed method significantly accelerates the FORA algorithm by optimizing the computational procedure of both FP and MC phases. Experiment results using real-world graphs showed that the proposed method reduces the cache miss ratio by up to 11.6% over existing methods on the largest dataset. As a result, the proposed method was faster than the fastest existing method by an average of 2.5× and a maximum of 3.9×. Moreover, the proposed method outperformed the state-of-the-art indexoriented method in query time under rigorous accuracy guarantees. Since various real-world applications, including link prediction, community detection, and social recommendation, utilize the PPR scores, the proposed method can be used to further improve the efficiency of these applications.