Locating highly connected clusters in large networks with HyperLogLog counters

In this paper we introduce a new method to locate highly connected clusters in a network. Our proposed approach adapts the HyperBall algorithm to localize regions with a high density of small subgraph patterns in large graphs in a memory-efficient manner. We use this method to evaluate three measures of subgraph connectivity: conductance, the number of triangles, and transitivity. We demonstrate that our algorithm, applied to these measures, helps to identify clustered regions in graphs, and provides good seed sets for community detection algorithms such as PageRank-Nibble. We analytically obtain the performance guarantees of our new algorithms, and demonstrate their effectiveness in a series of numerical experiments on synthetic and real-world networks.


Introduction
Networks describe the connections between pairs of objects. Examples of networks are ubiquitous and include social networks, the Internet or communication networks. While these examples are very different from an application point of view, they share many characteristics. For example, in many real-world networks, objects have the tendency to cluster together in groups. The task of finding these densely connected spots, or clusters, in the network, has been a subject of vast research.
A quantity that is commonly used to measure the quality of clusters in a graph representation of a network, is the conductance of a cluster. The conductance is based on the min-cut [20], and gives a ratio of the number of edges connected to nodes outside the cluster relative to the number of edges inside the cluster. The use of conductance as a measure to find communities is one of the most useful and important cut-based methods that exists [18].
Another measure that can find clustered groups of nodes in a graph is the number of triangles, because a triangle in a graph is the most clustered subgraph consisting of three nodes. For any subgraph in a network, one can measure its quality as a cluster by counting the number of triangles in this subgraph. Another option is to find its transitivity, which is the ratio of the number of triangles versus the number of wedges in the subgraph, and therefore it tells us how clustered the graph is. Moreover, the task of finding triangles itself has interesting applications in for example, biological networks [21], spam detection [4] or link recommendations [22].
Finding dense parts of a graph, that have a low conductance, a high number of triangles, or a high transitivity, is a computationally demanding task because real-world networks often contain millions or even billions of nodes. In this paper we propose new methods to efficiently compute three measures of clustering -the conductance, the number of triangles, and the transitivity -for the ball subgraphs B r (v): the induced subgraphs that contain all nodes within graph distance r of node v. The identified ball subgraphs of low conductance or high transitivity reveal locations of dense areas in the network and can be used in community detection, for example, as seed sets of other more time-and memory-consuming algorithms such as PAGERANK NIBBLE [1] or the MULTI WALKER CHAIN model [6].
Our proposed algorithms for computing conductance and transitivity use probabilistic HyperLogLog counters to estimate the number of edges, wedges and triangles in ball subgraphs. This class of randomized algorithms stems from the HYPERLOGLOG algorithm [10] for counting the number of distinct elements in large streams of data, such as the number of unique visitors on a web page or the number of different genomes in biological data. These counters give an accurate estimate of large cardinalities, and moreover are memory-efficient. Boldi and Vigna [7] have successfully adapted the HYPERLOGLOG algorithm to the networks context by developing the HYPERBALL algorithm, which counts the number of nodes in a ball subgraph B r (v) for every node v and every radius r. They used this to approximate centrality measures in large graphs and to find the distribution of distances between pairs of nodes in a network of Facebook users [2]. This idea of counting nodes in ball subgraphs has also been extended to counting distinct edges in ball subgraphs [13] or in a stream of edges [25]. In this paper, we demonstrate that the potential of HyperLogLog-type counters on graphs is greater than only counting nodes and edges, but can extend to counting other patterns in networks, and can be used to approximate popular measures of clustering.
The main contribution of this paper is in designing memory-efficient HyperLogLog-based algorithms for computing the conductance, the number of triangles, and the transitivity in ball subgraphs. We analytically derive accurate error bounds for these algorithms, and empirically confirm their high performance. Moreover, we demonstrate applications of our methods to community detection in synthetic and real-world networks. Our results show that the identified highly clustered ball subgraphs perform very well as seed sets of the PR-NIBBLE algorithm [1], improving on previously used benchmarks.
The structure of this paper is as follows. In Section 2, we provide a brief recap on algorithms for counting nodes and edges in graphs using HyperLogLog counters. In Section 3, we extend these algorithms to other patterns in graphs, and we present our new methods for approximating the conductance, the number of triangles, and the transitivity in ball subgraphs. In Section 4, we analytically derive accurate error bounds of the estimators for the conductance, triangle count and transitivity. In Section 5, we experimentally evaluate the performance of our algorithms on a number of synthetic networks. In Section 6, we demonstrate application of our methods to community detection. We conclude in Section 7 with discussion.

HYPERLOGLOG algorithm
The HYPERLOGLOG algorithm is a probabilistic counting technique that estimates the number of distinct elements of a large dataset, called the dataset cardinality. While a naive deterministic approach requires storage of all distinct elements observed so far, the randomized HYPERLOGLOG algorithm is extremely memory-efficient, yet delivers accurate cardinality estimations.
We will now briefly outline the idea of the HYPERLOGLOG algorithm [10]. The input of the algorithm is a multiset M , a stream of data items that are read in order of occurrence. The algorithm uses a hash function h : M → {0, 1} ∞ that assigns a binary string to every element of M . The hash function is deterministic in the sense that it assigns exactly the same value to identical elements of M . However, h is constructed in such a way that its bits can be assumed to be independent Bernoulli random variables with probability 1/2 of 0 and 1. Then one can use the principle of bit-pattern observables: for example, in order to encounter the pattern 00001 at the beginning of a string, one needs to observe, on average, 32 different items. For such estimate, the algorithm needs to store in memory only 'how rare' the 'most rare' observed binary sequence is, for example, the maximal number of zeros observed so far at the beginning of a binary sequence. The name HYPERLOGLOG refers to this extremely low, double-logarithmic, memory requirements. To obtain accurate estimates, the algorithm uses registers. That is, the first b bits of h are used for identifying one of the p = 2 b registers, and the string after that is used to compute the cardinality estimate in this register. The algorithm initialises an empty counter with p = 2 b registers, where every register corresponds to an entry of the counter. The more registers we use, the more precise the cardinality estimate will be. More precisely, the HYPERLOGLOG algorithm returns the following estimate E of the number of unique elements of multiset M : where M[ j] is the cardinality estimate of register j. The pseudocode of the HYPERLOGLOG algorithm is provided in Algorithm A.1 in Appendix A. Overall, the HYPERLOGLOG algorithm uses (1+o(1))p log log(n/p) bits of space [10] for a set of cardinality n, making it an extremely memory-efficient algorithm to estimate cardinalities of large sets. The expectation and the variance of E are given in Theorem 1 of [10], and will be used later in the paper for obtaining performance guarantees of our algorithms: Definition 1 (Ideal multiset [10]). An ideal multiset of cardinality n is a sequence obtained by arbitrary replications and permutations applied to n uniform identically distributed random variables over the real interval [0, 1].
Theorem 1 (from [10]). Let the algorithm HYPERLOGLOG be applied to an ideal multiset of (unknown) cardinality n, using p ≥ 3 registers, and let E be the resulting cardinality estimate.

HYPERBALL algorithm
The HYPERBALL algorithm, introduced in [8], estimates the size of the ball consisting of nodes within graph distance r around a center node using HyperLogLog counters. The algorithm is an adaptation of the HYPERANF algorithm [7], which is based on the fact that the nodeball around node v with radius r, B r (v), can be found iteratively: Definition 2 (Nodeball). The nodeball B r (v) consists of every node in a ball of radius r around node v.
The HYPERBALL algorithm uses one HyperLogLog counter per node and for each iteration, the counters of the neighbours of this node are added to the node's own counter. After each iteration r, the size of this counter is calculated, which equals the estimator of |B r+1 (v)|. By Theorem 1, every estimator is almost unbiased and has a relative error of at most β p / √ p, for β p < 1.046 as soon as every HyperLogLog counter has p > 128 registers. This algorithm particularly excels at its small memory usage and the fact that the size of nodeballs around all nodes are found simultaneously. The pseudocode of the HYPERBALL algorithm is provided in Algorithm A.2 in Appendix A.

HYPEREDGEBALL algorithm
The HYPERBALL algorithm naturally extends from counting nodes to counting edges that can be reached within radius r around a node [13]. The corresponding edgeballs are defined as follows: The edgeball E 0 (v) consists of every edge incident to node v. Then, for r > 1: Equivalently, we can rewrite (5) as The HYPEREDGEBALL algorithm gives an approximation of the number of edges around a node after r iterations, |E r+1 (v)|. Compared to the HYPERBALL algorithm, the only change is in the initialisation phase (Algorithm A.2), since we now count edges instead of nodes. This new initialisation is formalised in Algorithm 1.

Algorithms
We will now show that we can easily extend HyperBall-type algorithms to counting arbitrary, more complex patterns than edges or nodes. In particular, we will show that we can apply HyperBall-type algorithms to approximate three important clustering measures: conductance, triangles, and transitivity. write v, c[v] to disk 8: return SIZE(c[v]), which estimates |E 0 (v)| 9: end for 10:

Conductance
The first quantity of interest that we investigate is conductance. The conductance of a graph G is defined as follows: Definition 4 (Conductance). For a graph G = (V, E) with n = |V | and m = |E|, the conductance of a subgraph S ⊂ G is: where δ (S) = {{x, y} ∈ E|x ∈ S, y / ∈ S} is the boundary of S, and vol(S) is the volume of S equal to the sum of the degrees of the nodes in subgraph S.
In the rest of this paper we use a simplified definition of the conductance, as we assume that the graphs that we analyse are non-empty and the volume of the subgraphs that we analyze is smaller than the volume of its complement: Next, we wish to provide a memory-efficient way of estimating conductance of ball subgraphs. To this end, it seems natural to combine HYPERBALL and HYPEREDGEBALL from Section 2. However, this turns out to be insufficient because the low memory usage implies that we cannot identify the edges of δ (B r (v)).
To overcome this problem, we transform our (undirected) graph into a directed variant of the graph where every undirected edge becomes two directed edges, and we introduce directed edgeballs: The out-edgeball with radius r around node v is defined as follows: Similarly, we can define an in-edgeball: Definition 6 (In-edgeball). The in-edgeball with radius r around node v is defined as follows: The different kinds of edgeballs are illustrated in Figure 1. For our purposes, it is important to notice that when an edge is on the boundary between a node inside and a node outside the ball of radius r, this edge belongs only to the out-edgeball, while all other edges, that have both endpoints of the edge are in the nodeball B r (v), belong to both in-and out-edgeball. This is exactly why the directed edgeballs are helpful for estimating conductance. This is formally stated in the following theorem:  Theorem 2. For an undirected ball subgraph S r (v) it holds that: Proof. Denote by 1{A} the indicator of A, and let E be the set of directed edges obtained from the edge set E by replacing each undirected edge {x, y} by two directed edges (x, y) and (y, x). By definition of the volume of a set of nodes, we obtain which proves (12). Next, using (12), we write The last expression equals to the right-hand side of (11) by the definition of E r (v) and the fact that in the second term each undirected edge is counted twice.
where |E r (v)| and |E − r (v)| are the estimates of |E r (v)| and |E − r (v)|, respectively.

Algorithm 2
The directed HYPEREDGEBALL algorithm. The ADD and SIZE functions of Algorithm A.1 and the COUNTBALL function of Algorithm A.2 is used to find the out-edgeball estimators.

Triangles and wedges
We now present our algorithm that counts the number of triangles within a ball of radius r around node v. We denote this number by ∆ r (v), and its estimator by∆ r (v). The idea is to obtain∆ r (v) using the same HYPERBALL algorithm as for counting edges or nodes, but we initialise the counter with triangles instead of nodes or edges. For this initialisation, we assign a unique hash value to each triangle in the graph. This can be done using algorithms for exact triangle counting such as compact-forward [15] or edge-iterator [3]. In Algorithm 3, we give an example of how this can be implemented. Now denote by w r (v) the number of wedges in B r (v).
Since wedges are open triangles, an estimator w r (v) of w r (v) can be found in exactly the same way as∆ r (v), but in the initialisation we add a wedge {i, v, j} to a counter if i, j ∈ V are neighbours of v. (Note that in line 7 of Algorithm 3 we verify that i and j are neighbours; this is needed for the initialisation of the counter of triangles).

Extension to counting of arbitrary induced subgraphs
The algorithms above for counting triangles and wedges easily extend to counting arbitrary induced connected subgraphs, called graphlets, within balls B r (v). For that, in the initialisation phase, we need to count the graphlets that involve node v for all v ∈ V , and assign a unique hash value to each of these graphlets. After that, we can run the HYPERBALL algorithm to count graphlets in ball subgraphs. This approach potentially can indicate parts of networks with unusual quantities of particular graphlets. However, applying HYPERBALL to general subgraph patterns also comes with important difficulties. First, HYPERBALL counts nodes/edges/graphlets in ball subgraphs, and therefore this approach cannot be easily extended to counting graphlets in subgraphs of any other form. Second, the initialisation phase presents a computational bottleneck because assigning a unique hash value to each graphlet supersedes the computationally demanding task of exact graphlet counting. for each (i, v) ∈ E do 6: for each ( j, v) ∈ E do 7: if (i, j) ∈ E and triangle then write v, c[v] to disk 15: return SIZE(c[v]), which estimates |∆ 0 (v)| or |w 0 (v)| 16: end for We now introduce Theorem 3 and 5 that give a lower and upper bound for the conductance estimator φ B r (v) from (13) based on Chebyshev's inequality and Vysochanskij-Petunin's inequality [24].
Theorem 3 (Chebyshev bound for the conductance estimator). For all v ∈ V , r ≥ 1, the conductance estimatorφ B r (v) , as defined in (13), satisfies: if v ∈ graphlet then 5: ADD(c[v], graphlet) 6: write v, c[v] to disk 7: return SIZE(c[v]), which estimates the number of graphlets in B 1 (v). Theorem 4 (Vysochanskij-Petunin's inequality). Assume that the random variable X has a unimodal distribution with finite mean E(X) and variance Var(X) = σ 2 . Then, for λ /σ > 8/3, By using this inequality instead of Chebyshev's inequality, we can obtain tighter error bounds than the ones in Theorem 3, given in the next theorem.
Theorem 5 (Vysochanskij-Petunin bound for the conductance estimator). If |E r (v)| and |E − r (v)| have a unimodal distribution, then for all v ∈ V , r ≥ 1, the conductance estimatorφ B r (v) , as defined in (13), satisfies: The proof of this theorem is identical to the proof of Theorem 3, but it uses the Vysochanskij-Petunin inequality instead of the Chebyshev's inequality. In Section 5 we will show that the VP bound indeed holds in our numerical experiments by using a statistical test of unimodality [12]. In future research it will be interesting to identify general conditions under which the HyperLogLog counters produce estimators with a unimodal distribution.

Error bounds of the estimator for the triangle count
We again use the Chebyshev inequality in order to find a lower and upper error bound for the triangle count estimator∆ r (v) from Algorithm 3: Theorem 6 (Chebyshev bound for the triangle count estimator). For all v ∈ V , r ≥ 1, the triangle estimator ∆ r (v) satisfies: for a > 0 and η = β p √ p + δ 2 + o 1 (∆ r (v)), where δ 2 = 5 · 10 −4 and o 1 = o(1), as its argument goes to infinity.
The proof of this theorem is again based on Theorem 1, and is given in Appendix B. The Vysochanskij-Petunin inequality [24], which holds whenever an estimator has a unimodal distribution, can also be used in order to get a slightly tighter error bound: Theorem 7 (Vysochanskij-Petunin bound for the triangle estimator). If∆ r (v) has a unimodal distribution, then for all v ∈ V, r ≥ 1, the triangle estimator∆ r (v) satisfies: for λ > 8/3 · Var ∆ r (v) and η as in Theorem 6.
The proof of Theorem 7 goes in the same way as the proof of Theorem 6, but with the Vysochanskij-Petunin inequality instead of Chebyshev's inequality.

Error bounds for transitivity
The transitivity of a graph G is defined as follows: Definition 7 (Transitivity). The transitivity of a graph G equals to where w(G) is the number of wedges, and ∆(G) the number of triangles in G.
In order to find the transitivity of ball subgraphs, we need to find the number of wedges, |w B r (v) |, in these ball subgraphs, which we obtain by using Algorithm 3.

Error bounds of the transitivity estimator
We can find the Chebyshev and Vysochanskij-Petunin error bounds of our estimator for transitivity similarly to Theorem 6 and 7. The transitivity estimator iŝ The expectation and variance of our estimators∆ r (v) andŵ r (v) can be obtained from Theorem 1, which results in the following Chebyshev error bound of the transitivity estimator: Theorem 8 (Chebyshev bound for the transitivity estimator). For v ∈ V, r ≥ 1, the transitivity estimator t B r (v) , as defined in (20), satisfies: for p 1 , p 2 > 0 and ε = p 1 ∆ r (v) +δ 1 +o 1 (∆ r (v)), γ = p 2 w r (v) +δ 1 +o 2 (w r (v)), with δ 1 = 5·10 −5 and o 1 , o 2 = o(1) as their argument goes to infinity.
When this transitivity estimate has a unimodal distribution amongst the nodes, we can again use the Vysochanskij-Petunin inequality in order to get tighter error bounds: Theorem 9 (Vysochanskij-Petunin bound for the transitivity estimator). If∆ r (v) andŵ r (v) have a unimodal distribution, then for all v ∈ V, r ≥ 1, the transitivity estimatort B r (v) , as defined in (20), satisfies:

Performance
To evaluate the performance of our algorithms, we first run a series of experiments on artificial LFR graphs [14]. In LFR graphs, the nodes are divided into pre-defined communities, and one can choose a mixing parameter µ ∈ [0, 1], which is the probability that an edge emanating from a node connects to a node in outside of its community. The LFR model involves a number of other parameters: the minimum and maximum community size (|C min |, |C max |), the average and maximum degree (d, d max ), the power-law exponent of the inverse cumulative distribution of the node degrees (τ 1 ), and the power-law exponent of the inverse cumulative distribution of community sizes (τ 2 ). We have used the LFR graph generator of NetworkX 2.4 in Python 3.6.9 to create LFR graphs with three different sets of parameters as shown in Table 1. We have used graphs with 1000 and 5000 nodes because in these small graphs we can find the exact number of edges, directed edges, wedges and triangles in all ball graphs, and compare the performance of our algorithms to these exact results.  Table 1: Parameters for the generated LFR-graphs For the experiments on LFR graphs, we have used a mixing parameter µ = 0.3, since this was the smallest mixing parameter with no isolated nodes in every generated graph. We have investigated the conductance, number of triangles and transitivity in ball subgraphs of radii 1 and 2. A ball larger than this radius consists of a large part of the entire graph. Figure 2 shows the exact and the estimated conductance in a LFR-3 graph in ball subgraphs of radius 1 (Figure 2a) and radius 2 (Figure 2b). On the horizontal axis, the nodes v are arranged in the order of ascending conductance of B r (v), r = 1, 2. The DIP test for unimodality [12] gives a small p−value of 0.0085, therefore we can reasonably assume that the conductance estimator is unimodal and apply the Vysochanskij-Petunin error bounds (Theorem 5). Figure 2 shows that the Vysochanskij-Petunin bounds are tight and represent well the 95%-margin of the estimation. In order to investigate how the precision improves with increasing the number of registers, We have experimented with different numbers of registers in the LFR-1 graph. The results are shown in Table 2. As we can see from this table, when the number of registers increases, the Vysochanskij-Petunin and Chebyshev error bounds and the experimental error tighten rapidly, and the mean error decreases but keeps oscillating around 0, as expected from Theorem 1 of [10].

Triangles
In Figure 3 we show the number of triangles and their estimates in the LFR-3 graph in ball subgraphs of radii 1,2, and 3. Since the estimate for the number of triangles again has a low p−value on the DIP test for unimodality (p < 0.01), we can again use the Vysochanskij-Petunin bounds. There is a large difference in the number of triangles between the balls of radius 2 and 3, which can be explained by the fact that ball subgraphs of radius 3 contain a large fraction of the entire graph.

Transitivity
In Figure 4, we show the transitivity in the ball subgraphs B 1 (v) (Figure 4a) and B 2 (v) (Figure 4b), together with the bounds from Theorem 8 and 9. The DIP test for unimodality gives a p−value lower than 0.01 so the Vysochanskij-Petunin error bounds can again be used. Interestingly, the transitivity in ball subgraphs of radius 2 is already very small, which means that these ball subgraphs contain significantly more wedges than triangles. This is again an indication that the best communities in these kind of graphs are between the ball subgraphs of radius 1 and the ball subgraphs of radius 2, as also shown in [11]. Note that our estimation of the transitivity based on the HYPERBALL type algorithms is very precise, in fact the precision is much higher than suggested by the error bounds. This large precision may be explained by the fact that the number of wedges in a graph is often very large. This improves the precision of the transitivity estimator as well.

Application to community detection
In this section we will show how our HYPERBALL-based algorithms can help to to detect communities in real-world networks. As an example, we use two large networks with a community structure: COM-DBLP and COM-AMAZON [16,26]. Table 3 summarises the properties of these graphs.
We will identify the communities in these networks using the PAGERANK-NIBBLE algorithm [1] that detects a community by finding a set of nodes with low conductance, starting from a random seed set of (a) Transitivity in B 1 (v)  Table 3: Properties of the used real-world graphs nodes. We propose to enhance this algorithm by using alternative seed sets found by our HYPERBALL algorithms. For example, a seed set can consist of ball subgraphs with the smallest conductance, or the largest density of triangles, or the largest transitivity. We have implemented this approach using 5 different seed sets, each of 100 nodes, in PAGERANK-NIBBLE: 1. φ -seeds: nodes v with smallest conductance in B 1 (v) and B 2 (v); 2. ∆-seeds: nodes v with largest number of triangles of radius 0 and radius 1, |∆ 0 (v)| and |∆ 1 (v)|; 3. t-seeds: nodes v with highest transitivity in B 1 (v) and B 2 (v); 4. degree seeds: nodes with highest degree; 5. random seeds: randomly chosen nodes.
For the PAGERANK-NIBBLE algorithm, we have used a maximum cut size of 200, a teleport probability of α = 0.85 and we calculate the ε-approximate PageRank vector with ε = 10 −8 [1]. We then compare the resulting conductance obtained with seed sets listed above. The results are presented in Figures 5 and 6. For both graphs the lowest conductance subgraphs are found with the φ -seeds. We notice that the found sets of small conductance are often small nearly isolated sets of nodes. Interestingly, using the φ -seeds, PAGERANK-NIBBLE is able to find such sets, while with degree seeds it fails to do so.
In the Amazon graph, t-seeds result in low conductance subgraphs after PAGERANK-NIBBLE, suggesting that the local high transitivity is a good indication for a community. Notice that random seeds also yield low conductance sets, but t-seeds clearly outperform this benchmark. In the DBLP graph, Figure 6, besides the φ -seeds, both ∆-seeds and t-seeds result in sets of low conductance after PAGERANK-NIBBLE. Moreover, when seeds are chosen based on ball subgraphs of radius 2, ∆-seeds work the best. Not only this helps us to detect communities, but we also obtain an insight that as communities have high number and density of triangles, it is important to use this knowledge to detect them. Figure 6 confirms that it is an important feature for community detection in this network: the t-seeds and the ∆-seeds result in a much lower conductance after PAGERANK-NIBBLE in comparison to the degree seeds and random seeds. We conclude that while the performance of different seed sets differ in different real-world networks, our HYPERBALL-based convincingly outperforms the random and degree-based seed sets.

Discussion
In this paper, we showed new applications of the HYPERBALL algorithm to count triangles and wedges. This enables us to approximate statistics like the transitivity and the conductance in ball subgraphs. Our estimates have good precision, for which we have derived explicit bounds. In this paper we demonstrated how these algorithms can be applied to choose good seed sets for community detection and to understand in more detail the structure of the communities.
Moreover, we showed that HYPERBALL can be extended to count the number of graphlets of any form in ball subgraphs. This gives us means to find the areas in large networks with high concentration of graphlets of specific kind. Potentially this will yield new ways to find anomalities in networks [9,5], to distinguish the structure of networks of different nature (e.g. biological, technological or social networks) [17], and to compare real-world networks to mathematical random graph models, where the results on most likely locations of particular graphlets have been explicitly derived in recent literature [23,19]. The bottleneck of the HYPERBALL-type algorithms for local graphlet count is the initialisation that requires, for each node v, the exact count of the graphlets that involve v. How to resolve this bottleneck remains an interesting question for further research.
A HyperLogLog and HyperBall algorithm Algorithm A.1 The HYPERLOGLOG algorithm as described in [10], which approximates the cardinality of a data stream M . Algorithm A.2 The HyperBall algorithm as described in [8], which returns an estimation of the ball cardinality for each node. The functions ADD and SIZE of Algorithm A.1 are used. for each v ∈ V do 13: a ← c[v] 14: for each w ∈ N (v) do 15: a → UNION(c[w], a) 16: end for 17: write v, a to disk 18: end for 19: update the array c[−] with the new v, a pairs 20: r ← r + 1 21: until no counter changes its value 22: return SIZE(c) 23 is a fraction of two other estimator, |E r (v)| and |E − r (v)|. The expectation and the variance of these estimators, as the cardinality of the estimated set goes to infinity, are given in Theorem 1: The coefficient η is defined as follows: The upper bounds are derived by using the fact that |δ 1 (x)| ≤ 5·10 −5 = δ 1 for all x and |δ 2 (x)| ≤ 5·10 −4 = δ 2 for all x, when the number of registers is larger or equal to 2 4 (Theorem 1). Throughout this work we assume that the number of registers is always larger than 2 4 . Our goal is to use these estimators to find a lower and upper bound forφ B r (v) . Following Chebyshev's theorem, we get the following inequalities for p 1 , p 2 > 0 when the number of edges and directed edges in B r (v) tend to infinity: The proof of this theorem is straightforward: we use in Chebyshev's inequality for the triangle estimator, where we substitute the expectation and variance from Equations (37) and (38): for a > 0. Since η = (1), as defined in (27), this concludes the proof.
Theorems 8 and 9 are proved in the same way as Theorems 3 and 5, since in both cases we need to find error bounds of a ratio of two estimators.