Bounding the Bias of Tree-Like Sampling in IP Topologies

It is widely believed that the Internet's AS-graph degree distribution obeys a power-law form. Most of the evidence showing the power-law distribution is based on BGP data. However, it was recently argued that since BGP collects data in a tree-like fashion, it only produces a sample of the degree distribution, and this sample may be biased. This argument was backed by simulation data and mathematical analysis, which demonstrated that under certain conditions a tree sampling procedure can produce an artificail power-law in the degree distribution. Thus, although the observed degree distribution of the AS-graph follows a power-law, this phenomenon may be an artifact of the sampling process. In this work we provide some evidence to the contrary. We show, by analysis and simulation, that when the underlying graph degree distribution obeys a power-law with an exponent larger than 2, a tree-like sampling process produces a negligible bias in the sampled degree distribution. Furthermore, recent data collected from the DIMES project, which is not based on BGP sampling, indicates that the underlying AS-graph indeed obeys a power-law degree distribution with an exponent larger than 2. By combining this empirical data with our analysis, we conclude that the bias in the degree distribution calculated from BGP data is negligible.


Background and Motivation
The connectivity of the Internet crucially depends on the relationships between thousands of Autonomous Systems (ASes) that exchange routing information using the Border Gateway Protocol (BGP). These relationships can be modeled as a graph, called the AS-graph, in which the vertices model the ASes, and the edges model the peering arrangements between the ASes.

Contributions
Our main contribution is our analysis of the degree distribution observed in the BFS tree sample, when the underlying graph has a power-law distribution with an exponent 2 < γ < 3. Under these conditions, we prove that the bias in the power-law is negligible: with high probability, the degree distribution of the high-degree nodes in the sample also exhibits a power-law, with exactly the same exponent γ. We validate our mathematical analysis with simulation results using the DIMES-measured AS-graph as the underlying power-law graph.
Putting this result in the context of the Internet topology, we recall the data collected from the DIMES project is not based on BGP-style tree sampling. Nevertheless, DIMES data indicates that the underlying AS-graph indeed obeys a power-law degree distribution with an exponent γ > 2. By combining this empirical data with our analysis, we conclude that the bias in the degree distribution calculated from BGP data is negligible.
Organization: In the next section we give an overview of the results of [ACKM05] we rely on. In Section 3 we show our main result, that the bias in the degree distribution of a tree-sampled power-law graph is negligible. In Section 4 we sketch an alternative, more rigorous, analysis of a weaker result, that validates some of the approximations we used in our main result. Section 5 describes the results of our simulations. We conclude with Section 6.

The General Framework
The proof of our result is based on the model, sampling process, and the main results described in [ACKM05]. In this section we give a brief introduction to the main results we need. Notation Throughout the paper we use G = (V, E) to denote the underlying graph, and n = |V | to denote the number vertices.
Definition 1 We say that {a j } is a degree distribution of G if G contains a j · n nodes of degree j.
In the [ACKM05] model, the graph G is not a given graph but a random graph chosen out of a family of graphs obeying a given degree distribution {a j }. The basic setting is the configuration model of [Bol85]: for each vertex of degree k we create k copies, and then define the edges of the graph according to a uniformly random matching on these copies.

The BFS Tree Sampling Process
The [ACKM05] model defines a randomized process, that simultaneously produces a random graph G obeying the degree distribution {a j }, and a BFS tree T that represents the sample. Note that for a given graph G, a BFS algorithm is a deterministic algorithm, but different outcomes are possible, depending on the order in which outgoing edges are traversed. In [ACKM05] model, a random choice determines this order.
The sampling process is thought of taking place in continuous time. However, for technical reasons the authors define a non-standard notion of time, which we denote by a capitalized word (Time). In this model, the BFS sample process starts at Time t = 1 with an empty tree T . As the sampling process evolves, Time decreases to t = 0, when the sample tree T includes all n nodes (assuming G is connected).
Before the process starts, for each vertex v there are deg(v) copies of v. Each copy is given a real-valued index chosen uniformly at random from the unit interval [0, 1]. Namely, vertex v has deg(v) indices, chosen uniformly and independently at random from the unit interval [0, 1]. At every Time step t two copies are matched: one copy is a copy of a vertex already discovered, and the other copy is a copy with index t. Such a matched pair forms an edge of the original graph. According to [ACKM05] at Time t the indices of the unmatched copies are uniformly random in [0, t). Let the maximum index of a vertex be the maximum of all its copies' indices. Then at any Time t, the vertices that have not been discovered yet are precisely those whose maximum index is less than t. An edge will be visible, namely, included in the BFS tree, if at the Time its endpoints are matched one of them is a copy of an undiscovered vertex.

Useful Notations and Theorems of [ACKM05]
Let v t be the vertex that has a copy with maximum index t. Denote by P vis (t) the probability that another edge outgoing from v t appears in the BFS tree. Namely, v t is the vertex that was discovered at Time t, and we are interested in the probability that another vertex is discovered through v t .
Using the expectation of P vis (t) as approximation we get from Equation (7) and Lemma 3 of [ACKM05] the following Theorem: 1 Theorem 1 Let G be a connected graph and let {a j } be a degree distribution that is upper bounded by a power-law with an exponent larger than 2. Let µ = j ja j be the mean degree of G. Then it holds that (1)

The Degree Distribution of the Sampled Graph
Our goal in this section is to show that the bias observed in BFS tree sampling regarding the degree distribution of the sampled graph is not significant, when the underlying graph degree distribution obeys a power-law with an exponent γ > 2. We show the above by examining the BFS tree received by the sampling process of [ACKM05], described in the previous section. Finding a BFS tree from a single source is an idealization of the BGP data collection process. Thus, if the bias is negligible when using such BFS process, then we argue that it is very likely to be negligible when using the more general case of BGP. Recall that we focus on a BFS process on a random graph G. Let T be the BFS tree received by this process. Let deg T (v) denote the degree of node v in the BFS tree received from the sampled graph. Let deg G (v) denote the degree of node v in the graph.

Definition 2
We say that a vertex v has a high-degree if deg G (v) ≥ 18. We say that an event occurs with high probability (w.h.p) if it occurs with probability at least 0.16.
The main theorem we prove is the following: 1 In Lemma 3 of [ACKM05] there is a requirement that aj = 0 for j < 3. This implies that with high probability the graph is connected. However, if we assume that the graph is connected, which is the case we are interested in, this requirement can be relaxed.
Theorem 2 If the underlying graph degree distribution obeys a power-law with an exponent γ, where 2 < γ < 3, then with w.h.p the degree distribution of the high-degree vertices of the sampled graph also follows a power-law, with the same exponent value.
We prove Theorem 2 using several Lemmas and Theorems. Throughout the remainder of this paper we use the following setting: Note that since the degree distribution is a power-law with an exponent larger than 2, in this case µ is finite. Our starting point is Theorem 1 [ACKM05]. Our first step is to approximate the following sum, which appears in Equation (1), when the degree distribution obeys the power-law of Definition 3: (2) Lemma 1 Let µ, a k , and γ be as in Definition 3. Then Then for all i ≥ 1 we have that the ith derivative of g(t) is Let us now evaluate g, and its derivatives, at the boundaries of [0, 1]: • g(0) = 0, We will approximate g(t) by an interpolation polynomial f (t) = ℓ b ℓ t ℓ . Obviously, no polynomial has f (i) (1) = ∞ for any i. Thus we will use g's values at t = 0, 1, and the derivatives at 0. The minimal-degree non-trivial polynomial we can use is a cubic f (t) = b 0 + b 1 t + b 2 t 2 + b 3 t 3 . Since g(0) = 0 we get that b 0 = 0. g ′ (0) = 0 implies b 1 = 0, and g ′′ (0) = 0 implies b 2 = 0. Since g(1) = 1 γ−2 we get that b 3 = 1 γ−2 . Thus f (t) = t 3 /(γ − 2) and (3)

Notes:
• Lemma 1 approximates the sum of Equation (2) using a cubic polynomial. This is a somewhat arbitrary choice: one can use any polynomial of degree d ≥ 3 and obtain similar results. Using higherdegree polynomials yields a better-quality approximation around t = 0 since additional derivatives are approximated, but does not necessarily improve the accuracy of the approximation at t = 1 since we can only use g(1) itself. Since we mostly care about the early stages in the evolution, near Time t = 1, we only present the result for the special case of d = 3.
• At present we give no bound on the error of our polynomial approximation. Instead, to validate our results, in Section 4 we present a weaker result proven rigorously without using the approximation.
• No polynomial can to give g (i) (1) = ∞ for any i, so our approximation accuracy is fundamentally limited around t = 1. It would be interesting, and technically more difficult, to approximate the sum using rational functions or other functions with an asymptote at t = 1. We leave this to future work.
In addition to being a building block in proving our main Theorem, the following Lemma 3 is important since it shows that most of the edges are detected early during in the sampling process, near Time t = 1.
Recall that P vis (t) is the probability that the vertex discovered at Time t gives rise to another edge in the BFS tree-i.e., not the edge it was detected with.

Lemma 3 P vis (t) ≈ t 3
Proof: Recall that a k = C · k −γ . Using equation (1) and Lemma 1 we further approximate P vis .
Let w = t 2 . By substituting w in equation (5) we get Using Lemma 1 again we get that Note that at Time t = 0 Lemma 3 gives P vis ≈ 0, which is as expected: at the end of the BFS process no new tree-edges are detected. Furthermore, at Time t = 1 we get P vis ≈ 1, again matching our intuition that at the beginning of the BFS process the edges detected very likely to be tree edges. Moreover, observe that most edges are detected at the beginning of the BFS process.
Recall that in the BFS sampling process of [ACKM05] each copy of a vertex v is assigned a Time index t ∈ [0, 1] (Section 2.2).

Lemma 4 Let v be a vertex with graph degree i and let max-index(v) = t. Then
Proof: We follow the discussion in [ACKM05], and neglect the possibility of self-loops and parallel edges involving a vertex v and its siblings, and ignore the fact that we are choosing without replacement (i.e., that processing each copy slightly changes the number of undetected vertices and the number of unmatched copies). Under these assumptions, the events that each of v's siblings give rise to edges that will be detected in the tree are independent, and [ACKM05] shows that the number of visible edges is approximately binomially distributed as Bin(i − 1, P vis (t)). Therefore Thus, using Lemma 3 it holds that Theorem 3 Let v be a vertex with graph degree i. Then Proof: Since t = max-index(v) is the maximum of i independent uniform variables in [0, 1], its probability density is dt i /dt = it i−1 . Therefore, using Lemma 4, we get This completes the Theorem. Note: For high-degree nodes i i+3 ≥ 6 7 , so for high-degree nodes we have that Theorem 4 Let m = i(i−1) 2(i+3) . Then for every node v it holds that Proof: Recall that the events that each of v's copies give rise to a visible edge are approximately independent. Therefore we can use the Chernoff bound (cf. [MR90]) Then using Theorem 3, we get Note: For high-degree nodes (i ≥ 18) we have that ǫ(i) ≥ 0.16, and for i ≥ 32 we have that ǫ(i) ≥ 0.03. Our main Theorem is now a corollary of Theorem 4. Proof of Theorem 2: As a result of Theorem 4 we get that w.h.p for high-degree nodes Since for high-degree nodes i(i−1) 2(i+3) ≥ 6 7 (i − 1), we have that w.h.p for high-degree nodes 1 + 3

A More Rigorous Analysis
Our analysis of the bias, and especially Lemma 1, used a somewhat cavalier polynomial approximation. In this section we give an alternative derivation of the conservation of the power law tail behavior without relying on the polynomial approximation of the sum. We use a more rigorous approach, but we show a weaker result-that validates the approximations up to multiplicative constants for large k.
Proof: All summands are positive, so the sum is larger than the first summand. Also the sum is increasing with t and equals µ for t = 1.
Proof: Recall that a k = C · k −γ . Using equation (1) and Lemma 5 we further approximate P vis as follows.
Theorem 5 For large enough k there exists c 1 > 0 such that Proof: The upper value follows immediately from the fact that the visible degree is at most the graph degree. By Lemma 6 we have that P vis (t) ≥ C 2 /µ 2 for all 0 ≤ t ≤ 1. Therefore, for a random v it follows that and hence, By the Markov inequality this means that Take α = 1 − C 2 µ 2 + ǫ deg G (v) for some constant ǫ. The probability of a node v to have such a difference between its tree-degree and its graph-degree is at most some constant less than 1, and therefore, a constant fraction of the nodes have a degree proportional to the original degree. Therefore, the tail of the distribution has a power law with exponent at least γ.
Notice that for all γ ′ < γ, there exists some large K * , such that C ′ K γ ′ * > CK γ * . Therefore, the exponent of the power law can not decrease throughout the entire degree sequence.
In fact, since the high degree nodes are discovered almost surely at t ≈ 1, we expect to see P vis ≈ 1 for these nodes, and therefore, the behavior of the tail is almost unchanged. Giving an exact bound near t = 1 is deferred to a future work.

Simulation Results
To further validate our analysis, we conducted a simulation study. We used the data collected by Shavitt and Shir [SS05] in the DIMES project as our underlying graph.
To test whether the choice of the BFS tree root has a noticeable effect on the resulting degree distribution, the graph vertices were split into the following 3 groups, based on their graph degree: 1. Low-degree nodes: 1 ≤ deg G (v) < 35, 2. Medium-degree nodes: 36 ≤ deg G (v) < 70, 3. High-degree nodes: deg G (v) ≥ 71.
From each group we selected 10 nodes at random and constructed a BFS tree for each, where the selected node was a tree root. An average CCDF 2 of degree distribution was then calculated for every group. We compared the resulting curves to the original connectivity data, collected by DIMES. Figure 1 shows the plotted CCDF curves for the three groups, and the curve for the raw DIMES data. The figure clearly shows the familiar power-law curves in all cases, and we can see that the curves are almost parallel graphs, indicating a similar value of the power-law exponent γ. Table 1 contains the computed values of γ for each group. We can immediately see that the values of γ on the sampled trees (2.072-2.101) are very close to the true power-law exponent (2.126), thus validating our analysis that the bias is minor. Furthermore, Table 1 shows that the γ values for the three groups are all close to one another, with a minor decrease in value as the degree of the root grows. Thus, it seems that the value of the power-law exponent in the sampled tree is largely invariant to the degree of the tree root.

Conclusions and Future Work
We have shown that if the underlying graph degree distribution obeys a power-law with an exponent γ > 2 (as is the case in the AS-graph) then with w.h.p the degree distribution of the high-degree vertices of the sampled graph also follows a power-law, with the same exponent value. Therefore, the bias observed in treesampling regarding the degree distribution of the sampled graph is not significant under these conditions. Furthermore, since according the non-tree-sampled data of [SS05] the AS-graph degree distribution does obey a power-law with an exponent γ between 2 and 3, we conclude that the bias observed in the degree distribution of the BGP data is negligible. Thus, the commonly held view of the Internet's topology as having a degree distribution of a power-law form with an exponent 2 < γ < 3 seems to be correct, and unlikely to be a by-product of the BGP data collection process.