Respondent driven sampling and sparse graph convergence

We consider a particular respondent-driven sampling procedure governed by a graphon. By a specific clumping procedure of the sampled vertices we construct a sequence of sparse graphs. If the sequence of the vertex-sets is stationary then the sequence of sparse graphs converge to the governing graphon in the cut-metric. The tools used are concentration inequality for Markov chains and the Stein-Chen method.


INTRODUCTION
Respondent Driven Sampling (RDS), popularised by Heckathorn (1997), is a method to sample from hard-to-reach populations, such as drug users, MSM and people with HIV, and it is being routinely used in studies involving such populations. The sampling procedure is subject to various biases, one of which is a bias towards individuals with higher degrees, as these are more likely to appear in the sample.
How this bias affects the network as a whole has been described by Athreya and Röllin (2016) in the context of dense graph limits. The model considered there is defined in terms of a two-step procedure. First, vertices are sampled according to an ergodic process (the important point to note is that the vertices need not be sampled independently of each other). Second, edges between vertices are sampled independently of each other, where the probability of an edge is determined via a graphon representing the underlying network.
Dense graphs are at one extreme of graph sequences. These are graphs on n vertices with the number of edges being of order n 2 , which is far more than what is observed in real world networks. At the opposite end are sequences of graphs with bounded (average) degree and consequently having order n edges. These have a separate limiting theory which is not quite applicable to many real world networks. There is class of graph sequences between these two extremes, called sparse graphs -these are graphs for which the average degree grows in the number of vertices, but only at sub-linear speed.
The purpose of this note is to extend the work of Athreya and Röllin (2016) to sparse graphs, and to consider more realistic models of sampling. Since RDS data typically comes in the form of trees, the actual graphs are those with average degrees remaining bounded as the number of nodes n grows. We propose a model where "close enough" participants are "clumped" together so that the average degree now grows in n. Our main result is that the random sparse graph sequence obtained through a specific respondentdriven sampling procedure converges almost surely to the graphon underlying the network in the cut-metric, provided the sequence of the vertex-sets is stationary The method of proof in this article is entirely different from that of Athreya and Röllin (2016). This is mainly due to the fact that, unlike in the dense case, subgraphs counts no longer characterise graph convergence. We compare our random sparse graph sequence with an "expected" (deterministic) sparse graph via a concentration inequality. We then use the Stein-Chen method to compare this deterministic sparse graph to a sequence of graphs which are close to the graphon of the underlying network.
The rest of the article is organised as follows. In Section 2 we provide a brief introduction to sparse graph convergence. In Section 3 we describe our model and state our main result (Theorem 3.1). We present the proof of the main result in Section 4. We then conclude with some remarks in a final discussion section on Respondent Driven Sampling and Dense graph sequences.
Acknowledgements: Adrian Röllin was supported by NUS Research Grand R-155-000-167-112. Siva Athreya was supported by CPDA grant from the Indian Statistical Institute and an ISF-UGC project grant.

SPARSE GRAPH CONVERGENCE
This section is a very brief introduction to sparse graph convergence. The convergence of sparse graphs was initiated by Bollobás and Riordan (2009) and then the L p theory was established in Borgs et al. (2014a) and Borgs et al. (2014b). We present the minimal amount of material necessary to formulate and prove our main result. We first define weighted graphs, followed by definition of graphon and conclude with a brief discussion on a convergence result.
Weighted graphs. Consider a graph G, given by its set of vertices V (G) and set of edges E(G). A (edge-)weighted graph G is simply a graph which has, in addition, a weight function β(G) = (β ij (G)) i,j∈V (G) , where, for each {i, j} ∈ E(G), we interpret the value β ij (G) as the weight of that edge. By making the convention that β ij (G) = 0 whenever there is no edge between vertices i and j, the information about E(G) is contained in β(G), so that any weighted graph is determined by V (G) and β(G). Moreover, any unweighted graph can be interpreted as a weighted graph by setting β ij (G) = 1 whenever {i, j} ∈ E(G).
For any weighted graph G and any constant c ∈ Ê, we shall define cG to be the weighted graph on the same set of vertices and edge weights β ij (cG) = cβ ij (G).
For any two graphons κ 1 and κ 2 , we let Since a Lebesgue measure preserving transformation of [0, 1] will not change the norm of a graphon, it is customary to define the cut-metric δ on graphons by where the infimum ranges over all measure-preserving bijections σ : [0, 1] → [0, 1], and where the graphon κ σ is defined as κ σ (x, y) = κ(σ(x), σ(y)). Every weighted graph G is naturally associated with a graphon κ G in the following way. First, divide the interval [0, 1] into intervals I 1 , . . . , I |V (G)| of lengths 1/|V (G)| for each i ∈ V (G). The function κ G is then given the constant value β ij (G) on I i × I j for every i, j ∈ V (G). It is easily verified that κ G is indeed a graphon.
Thus, even if G and G ′ have different set of vertices, we can define their cut-distance through the cut-distance of their associated graphons; that is, If two weighted graphs G and G ′ have the same set of vertices V (G), then it is clear that we can express their cut-distance as Finally, if κ is a graphon and G is a weighted graph, then we will define Convergence to graphon. Let κ be a graphon with κ 1 > 0. Let ρ n > 0 satisfy ρ n → 0 and nρ n → ∞ as n → ∞. Let the vertex set be given by Define G n ≡ G(n, κ, ρ n ) to be the graph defined by connecting i and j with probability min{ρ n κ(U i , U j ), 1}. It is clear that G n is a sparse graph sequence and in Borgs et al. (2014a, Theorem 2.14 and Corollary 2.15) it is shown that, with probability 1, d G n /ρ n , κ → 0 and δ G n / G n 1 , κ/ κ 1 → 0 as n → ∞. In this article we generalise the above result when the vertex labels come from a Markov Chain and the sparse graph is constructed after suitable clumping.

Constructing a random graph from RDS
We shall construct a sparse graph on [n] vertices driven by Respondent Driven Sampling (RDS). We will sample N individuals, labelled X 1 , . . . , X N , where X i ∈ [0, 1]. We note that the label space is chosen arbitrarily to be the unit interval only for the sake of mathematical convenience. After sampling, the individuals are clumped into n equally spaced bins, which we represent by the intervals A n,i = [(i − 1)/n, i/n), where 1 i n (it is understood that A n,n also includes the right-most point 1). We connect i and j if two successive individuals fall into bin A i followed by bin A j or vice-versa. We chose N in such a way that the graph constructed is sparse and we establish an L 1 limit for the same. We begin with a precise definition of the sampling scheme via a Markov chain.
Markov Chain representing RDS. Let κ be a graphon. Let (Ω, F , È) be a probability space, on which we define a Markov chain X = {X k } k 0 with transition probabilities given by Since κ is symmetric, the Markov chain is time-reversible with stationary distribution We shall assume that X 0 D = π, which means the chain is stationary. Then the probability of seeing a transition from dx to dy is given by Sparse Random Graph from RDS. Let κ be a graphon. Let n 1 and N ≡ N (n).
• Let X 1 , . . . , X N be a realisation of the stationary Markov Chain defined in the previous section up to time N .
• Equi-partition the unit interval by the intervals A n,1 , . . . , A n,n . For 1 i, j n with i = j, define if there exists 0 m < N such that either X m ∈ A n,i and X m+1 ∈ A n,j , or X m ∈ A n,j and X m+1 ∈ A n,i , 0 otherwise.
• For 1 i, j n with i = j, connect i and j if I n (i, j) = 1, and leave it unconnected otherwise.
If we choose N (n) appropriately (i.e. N (n) = o(n 2 )) then the above random graph will be a sparse random graph sequence.

Main Result
Let κ be a given graphon, and consider the sparse graph sequence G n ≡ G(n, N, X, κ) defined as in the previous paragraph. We shall make the following assumptions on κ and N .
( 3.3) We are now ready to state the main result.

PROOF OF THEOREM 3.1
To prove our result we will need to define two (deterministic and intermediate) weighted graphs. The first graph is an "averaged" version of G n , which we shall denote by G n ; it is the weighted graph on the vertices [n] with edge weights Denote by n 2 N G n be the weighted graph obtained by scaling the weights of G n by n 2 N (as described in Section 2). The second graph, denoted by H n , is the weighted graph on the vertices [n] with edge weights An,i An,j κ(x, y) κ 1 dxdy.
(4.1) For x, y ∈ [0, 1], let i n and j n be such that x ∈ A n,in and y ∈ A n,jn for all n 1. Observe that by the Lebesgue density theorem, κ(x, y)/ κ 1 = lim n→∞ γ n (i n , j n ) almost everywhere on [0, 1] 2 .
Our strategy will be to show that, for large n, n 2 N G n is close to n 2 N G n , followed by the fact that n 2 N G n is close to H n , and finally that H n is close to κ/ κ 1 . We start with the first lemma, which shows that the distance between n 2 N G n and n 2 N G n goes to 0 almost surely with respect to È. The key ingredient of the proof is a concentration inequality of Paulin (2015). Proof. Note that . As X is Harris recurrent by Assumption (K1), we obtain from Meyn and Tweedie (2009, Theorem 16.0.2) that the Markov chain has finite mixing time t mix . Let ε > 0 be given. Now, changing one point in f (X) will change f by at most 2 edges; that is, f is 1/N -Hamming-Lipschitz. Therefore, by Paulin (2015, Corollary 2.10), where C is a constant that only depends on t mix . Using the union bound, By (3.3) and Borel-Cantelli, the claim follows.
Our second lemma shows that the distance between n 2 N G n and H n goes to 0. The key ingredient of the proof is an application of the Stein-Chen method.  where I m = I[(X m−1 , X m ) ∈ {A n,ij , A n,ji }] with A n,ij = A n,i × A n,j , and note that E n (i, j) = 2N An,i An,j κ(x, y) κ 1 dxdy = 2N n 2 µ n (i, j).
where Z n (i, j) D = Poisson 2N n 2 µ n (i, j) . Now, let E s n (i, j) be a random variable having the size-bias distribution of E n (i, j). Then, the Stein-Chen method (see, for example, Barbour et al. (1992, Theorem 1 where d TV denotes the total variation distance. Note that I m = p(i, j) for all 0 m n, hence I m = I m ′ for all 1 m, m ′ n. Thus, we can use the standard way to construct the size-bias distribution (see for example Goldstein and Rinott (1996)). To this end, let M be a uniformly chosen index from 1 to N , independent of all else. It is not difficult to show that L (E n (i, j)|I M = 1) is the size-bias distribution of E n (i, j). We now construct E s n (i, j) on the same probability space as E n (i, j) in the following way. Consider M as given, and consider a process X ′ = (X ′ 0 , . . . , X ′ N ) with law L (X ′ ) = L (X |I M = 1) = L X (X M−1 , X M ) ∈ A n,ij ∪ A n,ji . Thus, has the size-bias distribution of E n (i, j). If I M = 1, we can couple the two processes X and X ′ perfectly. If I M = 0, we couple the two processes as follows. Condition (3.2) implies that X is Harris-recurrent; that is, Thus, it is possible to couple X and X ′ such that and, similarly, We can easily extend the processes X m and X ′ m so that I m and I ′ m are defined for all m ∈ . Now, let G 1 and G 2 be geometric random variables with success probability δ dominating the coupling time forward and backward in time from M and M − 1 respectively. Note that we can construct G 1 and G 2 such that (G 1 , G 2 ) ⊥ ⊥ X and (G 1 , G 2 ) ⊥ ⊥ X ′ (note, however that (G 1 , G 2 ) ⊥ ⊥ (X, X ′ )). Then, Applying this bound to (4.5), we have, for each i and j, In conjunction with (4.4) and interchanging summation with integration, we arrive at Using (3.3), the claim follows.
Our third lemma shows that the distance between H n and κ/ κ 1 goes to 0. The proof is a basic exercise in real analysis. Proof. To simplify writing, we introduce the notationκ := κ/ κ 1 . Recall that H n is the weighted graph on the vertices [n] with edge weights as in (4.1). Define the graphonκ n byκ n (x, y) = n 2 An,i An,jκ Let g n be the graphon associated with the graph H n , which is given by Now, d H n ,κ d 1 H n ,κ = g n −κ 1 g n −κ n 1 + κ n −κ 1 . Note that, by Taylor's approximation, 0 we have for any x, y ∈ Ê that
We are now ready to prove the main result. It follows immediately from the triangle inequality and the above three lemmas.
Proof of Theorem 3.1. As indicated above using the triangle inequality, we have Application of Lemma 4.1, Lemma 4.2 and Lemma 4.3 completes the proof.

DISCUSSION
We conclude this note, with some remarks on dense graph sequences and Respondent Driven Sampling.
Dense Graph Sequence. We have chosen 0 < α < 1 so as to ensure that the graph sequence was sparse. If α = 1, then we obtain a dense graph sequence. In this case as well, the convergence in the cut-metric would hold but to a "Poissonised" κ in the following sense. Define the graphon f n as f n (x, y) = λ −1 1 − e −λκn(x,y) . Now, By Borgs et al. (2014b, Lemma 5.6), κ n −κ 1 → 0. Hence, using the above this readily implies f n −κ 1 → 0. (5.2) Note that for b 0 and x > 0, So, for any x, y ∈ Ê, we have From this the result follows as in the proof of Theorem 3.1 We note that the Stein-Chen method plays a critical role in proof of Lemma 4.2 when α = 1, as N n 2 → λ > 0; that is, the mean of the Poisson random variable does not converge to 0, so that moment bounds would not suffice to prove Lemma 4.2.
Respondent Driven Sampling (RDS). One common approach in RDS to correct for bias towards high degrees, is to ask participants of the study to estimate their own degree and then weigh the participants by the inverse of their reported degree. This procedure is known as multiplicity sampling, and was first used in the context of RDS by Rothbart et al. (1982). What Theorem 3.1 implies in essence is that one could also clump participants together according to general characteristics (such as age, gender, etc.). If the degree of the participants is captured by these characteristics, the bias towards participants with high degrees would disappear.
It was argued by Heckathorn (2007) that multiplicity sampling cannot in general correct for the bias towards nodes with high degree due to possible differential recruitment, which means that some groups of participants are systematically able to recruit more people than others. Other methods of estimations, including the original estimators of Heckathorn (1997) as well as the clumping procedure proposed in this article, are equally susceptible to differential recruitment bias.
The mathematical reason behind this bias is that the stationary distribution of a onereferral Markov process on a set of types, which is the commonly used mathematical tool to derive RDS estimators, can be different from the stationary distribution of a multi-type branching process with the same transition probabilities if the average number of offspring depends on the types. This was described precisely by Athreya and Röllin (2016), where the two models, a one-referral Markov chain and Poisson-offspring branching process, show substantially different over-sampling of high-degree vertices in the network. In the one-referral Markov chain case, the over-sampling is exactly proportional to the degree, but in the case of a Poisson number of referrals, it is proportional to a quantity that is harder to calculate (the eigenfunction of the mean replacement measure of the branching process). In practice, differential recruitment bias is typically reduced by limiting the number of referrals, traditionally to no more than three. Heckathorn (2007) also proposes a method, called estimation through dual-components, which is supposed to take differential recruitment into account. This is the default method used in the widely-used statistical software RDSAT (see Volz et al. (2012)). The basic idea is to estimate the transition probabilities governing the referrals, calculate the proportion of different types one would expect to see under absence of both bias due to different degrees and bias due to differential recruitment, compare with the actual observed proportions, and then to work backwards to find the true proportions in the population. However, the theoretical justifications in Heckathorn (2007) for the details of the procedure are somewhat opaque.
Open Problems. We conclude the article with a couple of questions that can be explored.
(1) In Athreya and Röllin (2016) a rigourous framework was set up to handle convergence in dense graph limits. For dense graphs, the theory of graphons (whose range is [0, 1]) was used to establish the convergence. Graphons in dense graph setting characterise the limit via convergence of subgraph counts. This aspect applies under several equivalent metrics. One should be able to establish the RDS models used in Athreya and Röllin (2016) to prove convergence in the L 1 metric as in this article. The approach could be one as laid out in proof of Borgs et al. (2014a, Theorem 2.14).
(2) As already mentioned before, in practice, an RDS sample comes typically in the form of a tree, rather than a single chain, and hence, a multi-type branching process, where the types could represent characteristics such as gender, age etc., would constitute a more realistic mathematical model. The stationary distribution of such a branching process is difficult to solve analytically in general, but under additional assumptions, such as considering only finitely many types, a numerical approach would definitely be feasible. In this light, it seems that a statistical theory based on branching process theory, rather than Markov chain theory, could put the framework of dual-components from Heckathorn (2007) onto solid ground or even improve on it.