Asymptotic Seed Bias in Respondent-driven Sampling

Respondent-driven sampling (RDS) collects a sample of individuals in a networked population by incentivizing the sampled individuals to refer their contacts into the sample. This iterative process is initialized from some seed node(s). Sometimes, this selection creates a large amount of seed bias. Other times, the seed bias is small. This paper gains a deeper understanding of this bias by characterizing its effect on the limiting distribution of various RDS estimators. Using classical tools and results from multi-type branching processes (Kesten and Stigum, 1966), we show that the seed bias is negligible for the Generalized Least Squares (GLS) estimator and non-negligible for both the inverse probability weighted and Volz-Heckathorn (VH) estimators. In particular, we show that (i) above a critical threshold, VH converge to a non-trivial mixture distribution, where the mixture component depends on the seed node, and the mixture distribution is possibly multi-modal. Moreover, (ii) GLS converges to a Gaussian distribution independent of the seed node, under a certain condition on the Markov process. Numerical experiments with both simulated data and empirical social networks suggest that these results appear to hold beyond the Markov conditions of the theorems.


Introduction
Network sampling techniques, including web crawling, snowball sampling, and respondentdriven sampling (RDS), contact individuals in hard-to-reach populations by following edges in a social network. This paper uses RDS as a motivating example (Heckathorn, 1997). It is used by the Centers for Disease Control (CDC) and the Joint United Nations Programme on HIV/AIDS (UN-AIDS) to sample populations most at risk for HIV (injection drug users, sex workers, and men who have sex with men) (CDC, 2017;Johnston, 2013). In the most recent survey of the literature (White et al., 2015), RDS had been applied in over 460 different studies, in 69 different countries.
An RDS sample is initialized with one or more "seed individuals" selected by convenience from the population. These individuals participate in the survey and are incentivized to refer additional participants (often up to 3 or 5 participants) into the sample. This process iterates until reaching the target sample size or there are no referrals. All participants are incentivized to take a survey and an HIV test. With this sample, we wish to estimate the proportion of individuals in the population that are HIV+.
The Markov model for the RDS process has provided fundamental insight into RDS sampling (Salganik and Heckathorn, 2004;Goel and Salganik, 2009;Rohe, forthcoming). For example, nodes with more connections are more likely to be sampled (Levin et al., 2009). This creates bias and there are ways to adjust for it (Salganik and Heckathorn, 2004;Volz and Heckathorn, 2008). While the inverse probability weighted (IPW) estimator requires a normalizing constant that is unknown in practice, the Volz-Heckathorn (VH) estimator provides a way to estimate this normalizing constant (Volz and Heckathorn, 2008). More recently, Rohe (forthcoming) studied the variability of the IPW estimators and showed that there are two regimes (low variance and high variance). Let λ 2 be the second eigenvalue of the Markov transition matrix on the social network. Let m be the average number of referrals provided by each node. When m < λ −2 2 , the variance of the IPW estimator decays at rate n −1 , where n is the sample size. However, when m > λ −2 2 , the variance of IPW decays at a slower rate. Later, Li and Rohe (2017) showed that the VH and IPW estimators are asymptotically normal under the Markov model in the low variance regime. More recently, Roch and Rohe (forthcoming) proposed a generalized least squares (GLS) estimator for the high variance regime and showed that the variance of this estimator is O(n −1 ), even when m > λ −2 2 . These previous results are summarized in Table 1. This paper studies the limit distribution of (i) the GLS estimators and (ii) the IPW estimator in the high variance regime. These results also allow for the Volz-Heckathorn Table 1: Summary of properties of IPW and GLS estimators. In the columns, m refers to the number of participants that the typical participant refers into the study and λ 2 is the second eigenvalue of the Markov transition matrix.

Result
Estimator Low variance, i.e. m < λ −2 2 High variance, i.e. m > λ −2 2 Variance IPW O(n −1 ) (Rohe, forthcoming) O(n 2 log m λ 2 ) (Rohe, forthcoming) GLS O(n −1 ) (Roch and Rohe, forthcoming) Distribution IPW&VH Asymptotically normal (Li and Rohe, 2017) Non-trivial mixture [Current paper] GLS Asymptotically normal [Current paper] adjustment. For technical reasons, our analysis of the GLS estimator is restricted to a special case of the Markov model that was first used to study RDS in Goel and Salganik (2009). These technical results make many unrealistic assumptions which we discuss below. In particular, the Markov model allows for resampling of individuals. The results are asymptotic in the sample size, while the population size is fixed. This creates extensive resampling. In some cases, you may have to sample every node in the graph multiple times before the asymptotics kick in. Nevertheless, this model provides fundamental insights into the properties of the estimators and these properties continue to hold under more realistic simulation models in Sections 4 and 5.

A simple motivating example
Here we consider a model studied in Goel and Salganik (2009), which we refer to as the Blockmodel with 2 blocks. In this example, the population that we wish to sample is equally divided into two groups: HIV+ and HIV-. The seed participant is selected from one of the two groups with equal probability. Each participant refers an iid number of offspring, generated from some offspring distribution. With probability p, the referred participant matches the HIV status of the participant that referred them. With probability 1 − p, their statuses differ. Each referral is independent, conditional on the status of the referring participant. Using a sample generated in this way, we wish to estimate the proportion of the population that is HIV+ (in this case, the true proportion is 0.5). Figure 1 displays a motivating simulation from this Blockmodel with 2 blocks. Each sample size is 1000 individuals, sampled from the Blockmodel with p = .95 and offspring distribution 1 + Binomial(2, 0.5). For each sample of 1000, we construct both sample proportion (equivalent to the IPW estimator, see Section 2.3.3) and GLS estimator. This process is repeated 10000 times. Figure 1 displays a kernel density estimate of the resulting distribution.
Many RDS papers discuss the "bias from seed selection". Section 3.1 shows that the IPW and VH estimators have a limit distribution and this limit distribution depends on where the process is initialized (i.e. the "seed" node). If the seed node is randomized (e.g. from the stationary distribution), then in simulations, the limit distribution of the IPW and VH estimators can have multiple modes, where each mode corresponds to a different set of initial conditions. The limit results for the IPW and VH estimators highlight how, conditioned on the seed node, the bias of these estimators decays at the same rate as the variance. So, unconditional on the seed node, this can create multiple modes in the limit distributions of the IPW and VH estimators. Similar to classical results in the reconstruction of evolutionary trees (Kesten and Stigum, 1966), the exact limit distribution does not appear to have a concise and easily interpretable closed form.
While the IPW and VH estimators are not asymptotically normal in the high variance regime, Section 3.2 shows that the GLS estimator is asymptotically normal in this regime and this limit distribution does not depend on where the process is initialized. This pair of results  Figure 1: The model for this simulation is described in Section 1.1. The two left panels show the distribution of sample proportion (i.e. the IPW estimator in this model). The two right panels show the distribution of GLS estimator. Each panel in the top row has two curves corresponding to whether or not the seed node is HIV+. The solid line gives the distribution of the estimator when the process is initialized with an HIV+ node. The dashed line is initialized with an HIV-node. In the bottom row, the seed participant is selected uniformly at random. This figure demonstrates how the limit distribution of the IPW estimator can have two modes which correspond to whether the seed is HIV+ or HIV-. Moreover, the figure shows that the GLS estimator is asymptotically normal and the dependence on the seed node is negligible.
provides additional insight into the notions of "bias" and "variance" for network sampling.
In particular, the GLS estimator is the linear estimator with the smallest variance and that measure variance includes the variability that comes from selecting the seed node (i.e. from the stationary distribution of the Markov process). Because it includes this variability due to seed selection, it adjusts for the seed selection. Another way of saying this is that the GLS estimator reduces "the bias from seed selection". This blurring of the divide between "variance" and "bias from seed selection" highlights one potential problem of conditioning on the seed node in a bootstrap resampling procedure (Baraff et al., 2016); in the high variance regime, conditioning on the seed node removes a large source of variability in the VH estimator.

Background and notation
This section (i) defines the Markov model, (ii) illustrates how this model is particularly tractable when the underlying network is a Blockmodel (White et al., 1976), and (iii) defines the IPW, VH, and GLS estimators.

Markov model
The Markov model consists of (1) a social network represented as a graph, (2) a Markov transition matrix on the nodes of the graph, (3) a referral tree to index the Markov process on the graph, and finally, (4) a node feature defined for each node in the graph. Each of these are defined below. The results in this paper allow for an undirected and weighted graph. Let G = (V, E) be a graph with vertex set V = {1, . . . , N } containing the people and edge set E = {(i, j) : i, j ∈ V are connected} containing the friendships.
Let w ij be the weight of the edge (i, j) ∈ E. For notational convenience, define w ij = 0 if (i, j) / ∈ E. If the graph is unweighted, define w ij = 1 for all (i, j) ∈ E. Throughout this paper, the graph is undirected (i.e. w ij = w ji for all pairs (i, j)). Define the degree of node i as deg(i) = j w ij and the volume of the graph as vol(G) = i deg(i). For simplicity, i ∈ G is used synonymously with i ∈ V .
Define the Markov transition matrix P ∈ R N ×N as . (2.1) Since G is undirected, P is a reversible Markov transition matrix with a stationary distribution π : G → R with π i = deg(i)/vol(G).
Assume that the nodes are sampled with a Markov process that is indexed by a rooted tree T (i.e. a connected graph with n nodes, no cycles, and a vertex 0). T can be random (e.g. a Galton-Watson tree) or nonrandom (e.g. an m-tree). If T is randomly generated, then the Markov process is conditioned on the tree. In a special case, T can be the chain graph (0 − 1 − 2 − 3 − . . . ); this results in the model being a Markov chain. Just as a chain graph indexes a Markov chain, the graph T provides the indexing in this model. For simplicity, σ ∈ T is used synonymously with σ belonging to the vertex set of T. The seed participant is root vertex 0 in T. For each non-root node σ ∈ T, denote p(σ) ∈ T as the parent of σ (i.e. the node one step closer to the root).
Let {X (·) σ : σ ∈ T} be a tree-indexed Markov process on the individuals from the social network G: where D(σ) ⊂ T denotes the set of σ and all its descendants in T. The superscript (·) indicates the initial condition: if the superscript is i ∈ G, X 0 is initialized from i; if the superscript is any distribution ν : G → R (e.g. the stationary distribution π), X 0 is initialized from ν. When we do not want to specify the initial state, we leave off the superscript. Following Benjamini and Peres (1994), we call this process a (T, P )-walk on G.
For each node i ∈ G, let y(i) denote some characteristic of this node, for example whether i is HIV+ or HIV-. Sometimes we regard y as a vector in R N , where N is the number of nodes in G. We want to estimate the population average µ true = i∈G y(i)/N by the RDS sample {y(X σ ) : σ ∈ T}.

A special case: Blockmodel
Consider G as coming from a Blockmodel with k blocks (White et al., 1976). That is, each node i ∈ G is assigned to a block with z(i) ∈ {1, . . . , k}, where each block j contains N/k nodes. If z(i) = z(j), then w i = w j for all ∈ 1, . . . N . Further suppose that if z(i) = z(j), then y(i) = y(j). The Stochastic Blockmodel (Holland and Laskey, 1983) is derived from this model.
The idea behind a Blockmodel with k blocks is clear: people in the same block share the same feature and the same friendship patterns. Goel and Salganik (2009) studied RDS with this model. The motivating example in Section 1 also uses a Blockmodel with 2 blocks.
Let W ∈ R k×k denote the weight matrix between blocks, where W z(i),z(j) = w ij . Define the Markov transition matrix between blocks P ∈ R k×k from W by (2.1). Since W is symmetric, P is reversible. Let {Z (·) σ : σ ∈ T} denote a Markov process indexed by T, where the state space is the block labels {1, . . . , k} and the transition matrix is P. The superscript of Z (·) σ indicates the initial state of Z 0 and agrees with the initial state of X 0 : if X 0 is initialized from i ∈ G, Z 0 is initialized from z(i) and the superscript is z(i); if X 0 is initialized from any distribution ν : G → R, Z 0 is initialized from the distribution µ : {1, . . . , k} → R with µ j = i∈G:z(i)=j ν i . However, for any {σ i 1 , . . . , σ is } ⊂ T and z i 1 , . . . , z is ∈ {1, . . . , k}, σ ) : σ ∈ T}. Instead of studying the Markov process {X (·) σ : σ ∈ T} in Section 2.1, it is sufficient to study Markov process {Z (·) σ : σ ∈ T}. This time the node feature is replaced by the group feature y ∈ R k and the Markov transition matrix is replaced by the Markov transition matrix between blocks P ∈ R k×k .
The Blockmodel is a special case of the Markov model in Section 2.1. In this paper,

Estimators
Denote E π (y) = i π i y i . The theoretical results in this paper study two estimators defined in Section 2.3.1 and 2.3.2. They are unbiased estimators of E π (y). With inverse probability weighting in Section 2.3.3, these estimators become unbiased estimators of µ true . The VH adjustment provides a way to estimate the inverse probability weights.

Sample average
The RDS sample average isμ When X 0 is initialized from π,μ (π) is an unbiased estimator of E π (y). When X 0 is initialized from i ∈ G,μ (i) is an asymptotically unbiased estimator of E π (y) (see Lemma C.1).

GLS estimator
Roch and Rohe (forthcoming) proposed generalize least squares (GLS) in RDS to reduce the variance, particularly in the high variance regime. The GLS estimator is the weighted averageμ where w * minimizes the variance of the weighted average initialized from π w * = arg min w var( σ∈T w σ y(X (π) σ )) s.t.

Inverse probability weighting
In general µ true = E π (y). Soμ andμ GLS are biased estimators for µ true . Inverse probability weighting can adjust for this bias. Define y π (i) = y(i)/(N π i ). IPW estimator and GLS estimator with IPW adjustment are the sample average and the GLS estimator of y π (X σ )'s: When X 0 is initialized from the stationary distribution π, they are unbiased estimates of µ true . However, computing these two estimators requires the average node degree vol(G)/N , which is typically not available in practice. The popular VH estimator replaces vol(G)/N in the IPW estimator with the harmonic mean of the degrees of the RDS samples (Volz and Heckathorn, 2008). Define The VH estimator is the sample average of yπ(X σ )'s. The GLS estimator with VH adjustment uses a similar reweighting, but replaces vol(G)/N with a GLS estimate of E π (1/deg(i)) (Roch and Rohe, forthcoming). VH estimator and GLS estimator with VH adjustment are two asymptotically unbiased estimators of µ true under the (T, P )-walk on G. Theorem 3.1 and 3.2 study the limit distribution of sample average and GLS estimator. By a simple transformation (defining a new node function y π (i) = (N π i ) −1 y(i)), these results can also be applied to IPW estimator and the GLS estimator with IPW adjustment. Corollary 3.2 and 3.3 extend these results to the VH estimator and GLS estimator with VH adjustment.

Additional notation
For two sequences a n and b n , define the following two notations: (i) a n = O(b n ) if and only if |a n | is bounded above by b n (up to constant factor) asymptotically, i.e. ∃k > 0, ∃n 0 , ∀n > n 0 , |a n | ≤ kb n . (ii) a n = Θ(b n ) if and only if a n is bounded both above and below by b n (up to constant factors) asymptotically, i.e. ∃k 1 > 0, ∃k 2 > 0, ∃n 0 , ∀n > n 0 , k 1 b n ≤ a n ≤ k 2 b n .

Main results
This section shows that after proper scaling, the GLS estimator and the sample average both have a limit distribution. For GLS estimator, the limit distribution is a normal distribution. For the sample average, the limit distribution is a non-trivial mixture distribution, where the mixture component is determined by the seed node. This mixture distribution can be multi-modal as illustrated in Figure 1. These results can be further extended to the GLS estimator with VH adjustment and the VH estimator respectively.
The following lemma from Levin et al. (2009) provides the eigendecomposition of the Markov transition matrix P .
Lemma 3.1. (Lemma 12.2 in Levin et al. (2009)) Let P be a reversible Markov transition matrix on the nodes in G with respect to the stationary distribution π. The eigenvectors of P , denoted as f 1 , . . . , f N , are real valued functions of the nodes i ∈ G and orthonormal with respect to the inner product If λ is an eigenvalue of P , then |λ| ≤ 1. The eigenfunction f 1 corresponding to the eigenvalue 1 can be taken to be the constant vector 1.
Assume that the eigenvalues of P are Because it is a Markov matrix, the largest eigenvalue is λ 1 = 1. Let f i be the eigenvector corresponding to λ i that are normalized as in Lemma 3.1. The eigenvector f 1 corresponding to λ 1 is taken to be the constant vector 1. Expanding the node feature y ∈ R N in the eigenbasis yields

Results for sample average, IPW and VH estimator
This section shows that the sample mean, IPW and VH estimators have a limit distribution and this limit distribution depends on where the process is initialized (i.e. the "seed" node).
For each node σ ∈ T, let |σ| be the distance of σ from the root 0. Define {X σ : σ ∈ T, |σ| = t} as the individuals in the t-th generation of the sample. Denote Z t,j as the number of j ∈ G in the t-th generation and define Z t = (Z t,1 , . . . , Z t,N ). Denote the sample average up to generation t asμ t . Superscripts onμ will denote how X 0 is initialized. Let ξ be a generic draw from the offspring distribution of T, then m = Eξ.
Theorem 3.1 studies the limit distribution of the sample averageμ Theorem 3.1. Assume the eigenvalues of the transition matrix P are almost surely and in L 2 as t → ∞, and Moreover, if y, f 2 π = 0, then var(X (i) ) > 0 for any i = 1, . . . , N .
Note that the result is based on the technicial condition that T is a m-tree. The simulations in Section 4 show that the result still holds when T is a Galton-Watson tree. Condition (3.3) in Theorem 3.1 can be weakened to but the statement of the conclusion becomes more involved. See Appendix B for more details.
Using the above result, we can study how the bias and variance of the sample average decays, conditioned on the seed node.
Corollary 3.1. Assume the conditions in Theorem 3.1 hold. When y, f 2 π = 0 and f 2 (i) = 0, the bias ofμ When y, f 2 π = 0, the variance ofμ is an unbiased estimator of µ true . By (3.5), for i, j such that f 2 (i) = f 2 (j), the limit distributions of λ −t 2μ (i) t and λ −t 2μ (j) t are different because X (i) and X (j) have different expectations. Thus the limit distribution of λ −t 2μ (π) t is a non-trivial mixture. The motivating example in the introduction illustrates this mixture. It is further explored with simulation in Section 4.
Theorem 3.1 studies the limit distribution of the sample average. Using the transformation discussed in Section 2.3.3, the result also applies to the IPW estimator. Denote the VH estimator up to generation t asμ V H,t . The following corollary extends the result to the VH estimator.
Similarly, when X 0 is initialized from π, the limit distirbution ofμ V H,t is a non-trivial mixture of the limit distributions ofμ

Results for GLS estimator
For the GLS estimator, the two right panels of Figure 1 suggest that the estimator is not sensitive to the initial distribuiton of X 0 . This section shows that the GLS estimator is asymptotically normal and this limiting distribution does not depend on the initial distribution for X 0 .
Given the referral tree T, define the covariance matrix Σ ∈ R n×n as for any σ, τ ∈ T, where n is the number of nodes in T. Then w * in (2.5) is given by For the Blockmodel with 2 blocks, the GLS estimator has a closed-form expression (Roch and Rohe, forthcoming) where λ 2 is the second eigenvalue of Markov transition matrix between blocks and deg(σ) is the degree of σ ∈ T. Letμ GLS,t be the GLS estimator of RDS samples up to generation t. Based on (3.8), the following theorem shows the asymptotic normality of the GLS estimator.
Theorem 3.2 shows that the GLS estimator is asymptotically normal both in the low variance and high variance regime. Note that the result is based on (3.8) and the technicial condition that T is a m-tree. The simulations in Section 4 suggest that the asymptotic normality of the GLS estimator still holds when T is a Galton-Watson tree, or the model is no longer a Blockmodel with 2 blocks.
Theorem 3.2 studies the limit distribution of the GLS estimator. Using the transformation discussed in Section 2.3.3, the result also applies to the GLS estimator with IPW adjustment. Denote the GLS estimator with VH adjustment of RDS samples up to generation t asμ (·) GLS,V H,t . The following corollary extends the result to the GLS estimator with VH adjustment.
Corollary 3.3. Under the conditions in Theorem 3.2, for any initial distribution ν of X 0 , where y (i) = deg(i) −1 and y (i) = y(i)/deg(i).

Simulation studies
In this section, data are simulated from a Blockmodel with 2 or 3 blocks. As stated in Section 2.2, a Blockmodel with k blocks consists of a reversible transition matrix P ∈ R k×k between blocks, block feature y ∈ R k , and a referral tree T. In this specification, the block feature y is assumed to be centralized, so that E π (y) = 0. For a Blockmodel with 2 blocks, let denote the transition matrix between 2 blocks. The second eigenvalue of P is λ 2 = p + q − 1.
In the simulation settings below, the block feature is given prior to centralization. In fact, all of the 2-Blockmodels use y = (1, 0) T and the 3-Blockmodels use y = (0, 1, 2) T . All of the experiments are based on 5000 simulated datasets.

Sample average
Here we consider the behavior of the sample averageμ t in the high variance regime m > λ −2 2 . In this setting, the asymptotic distribution of λ −t 2μ (π) t is no longer normal, as in the low variance regime. Instead, its asymptotic distribution is a mixture of the distributions of The simulation is performed on two different Blockmodels with 2 blocks. We consider a balanced model with p = q = .95 and an unbalanced model with p = 0.95 and q = 0.85. For both models, T is a Galton-Watson tree with offspring distribution 1+Binomial(2, 1/2). Under these settings, m > λ −2 2 for both models. Figure 2 displays the results of the experiment with t = 50.

GLS estimator
Here we consider the behavior of the GLS estimator in both the low and high variance regimes. The first experiment corroborates the result of Theorem 3.2, namely the GLS x seed node stationary Figure 2: Kernel density estimates of λ −t 2μ t for balanced (the left panels) and unbalanced (the right panels) Blockmodel with 2 blocks over 5000 replicates. For each scenario, the top panel corresponds to the case when X 0 is initialized from group 1 (the solid curve) and group 2 (the dashed curve), the lower panel corresponds to the case when X 0 is initialized from the stationary distribution.
estimator is asymptotically normal in both variance regimes. The simulation is performed on two different Blockmodels with 2 blocks. In the first model (p, q) = (0.95, 0.85); in the second model (p, q) = (0.8, 0.7). For both models,T is a 2-tree. Under these settings, m > λ −2 2 for the first model and m < λ −2 2 for the second model. The two quantile-quantile plots in Figure 3 correspond to the two models. It is clear that the distribution of the GLS estimator gets closer to the normal distribution as the sample size increases.
The second experiment suggests that the asymptotic normality of GLS estimator extends beyond the conditions in Theorem 3.2. We consider a two block model with (p, q) = (0.8, 0.7) and a three block model, where the transition matrix between the blocks is For both models, T is a Galton-Watson tree with offspring distribution 1 + Binomial(2, 1/2). Results for this experiment are displayed in Figure 4.

Analysis of Adolescent Health Data
In this section, we consider numerical experiments where the RDS samples are simulated without replacement from empirically derived social networks. Specifically, we use social networks collected in the National Longitudinal Study of Adolescent Health (Add Health). In the 1994-95 school year, the Add Health study collected a nationally representative sample of adolescents in grades seven through twelve. The sample covers 84 pairs of middle and high schools in which students nominated up to five male and five female friends in their middle or high school network (Harris, 2011).
In this analysis, we consider 25 networks with at least 1000 nodes. All contacts are symmetrized and all graphs are restricted to the largest connected component. The RDS sampling process is initialized from a seed node which is selected with probability proportional to node degree (i.e. the stationary distribution). Then, each participant recruits ξ ∼ 1 + Binomial(2, 1/2) participants uniformly at random from their contacts whom have not yet been recruited. If the participant has fewer than ξ contacts eligible to recruit, then the participant recruits all of their eligible contacts. The RDS process stops when there are 500 participants. If the process terminates before collecting 500 participants, then the process is restarted. For each network, we collect 500 different RDS samples. We generate 2000 such simulated data sets.
We use school-status as the binary node feature and focus on estimating the proportion of the population in high school. We construct a sample average, a GLS estimator and a SBM-fGLS estimator for the proportion of students in high school. The GLS estimator requires an estimate of the covariance matrix Σ, which can be calculated from the Markov transition matrix of the network (typically not available in practice) and equation (6) in Rohe (forthcoming). The SBM-fGLS estimator proposed in Roch and Rohe (forthcoming) estimates Σ using the RDS samples.
Consider a measure of the network bottleneck. Let A ∈ R N ×N denote the adjacency matrix of the network. Define the diagonal matrix D ∈ R N ×N and the matrix L ∈ R N ×N so that whereỹ is the standardized form of the node feature y, so that N i=1ỹ i = 0 and ỹ 2 = 1. λ provides a measure of the network bottleneck; as long as the second eigenvalue λ 2 is not too close to 1, then this quantity will not be close to 1. Table 2 displays theλ of the 25 networks.
estimation and quantile-quantile plots of GLS and SBM-fGLS estimator with VH adjustment corresponding to the 25 networks. We plot these results over 2000 replicates. The 25 subplots are in order of descendingλ. It is clear that the VH estimator has two modes, so these networks are all beyond the critical threshold. Except for networks with extremely strong bottleneck (i.e. with largeλ), the GLS estimators with VH adjustment are approximately normally distributed. The distribution of SBM-fGLS estimator with VH adjustment are not enough close to the normal distribution for some networks, which means that our results for  75,42,15,28,39,40,41,50,34,45,48,36,43,61,54,59,73,44,68,60,58,84,57,49 networks. The red solid line is x = µ true . This figure shows that when the bottleneck of the network is not too strong, both estimators have only one mode.

Discussion
We prove the existence of the limit distribution of the IPW and show that this limit distribution depends on the seed node-thus the limit distribution is a non-trivial mixture distribution when the seed is randomized. This result also shows that the "seed bias" of IPW is non-negligible. We also prove the asymptotic normality of GLS estimator under a certain condition and show that this normal distribution does not depend on the seed node. This implies that the "seed bias" of GLS is negligible. Both results allow for the VH adjustment. The study on empirical social networks as well as the simulated data illustrate that these theoretical results appear to hold beyond the technical conditions given in the theorems.
Inductive step: We prove that if P (n − 1) holds for some unspecified value of n ≥ 2, then P (n) also holds. Assume σ n is a leaf node (i.e. σ n has no descentent) and σ n−1 is the parent of σ n . Then T \ {σ n } is a referral tree with n − 1 vertex. By the Markov property, Additionally, the induction hypothesis that P (n − 1) holds gives The above two equations give (A.1), thereby showing P (n) is true.
Since both the base case and the inductive step have been performed, by mathematical induction the statement P (n) holds for all n ∈ Z + .
Finally we prove (2.2) based on the above result. Assume T has n vertexes. For any

B Proof of Theorem 3.1, Corollary 3.1 and 3.2
When T is a Galton-Watson tree and X 0 is initialized from i ∈ G, is a multitype Galton-Watson process. This serves as the key point in our proof. The following notions and conclusions with respect to multitype Galton-Watson process are from Athreya and Ney (2004); Harris (2002).
Lemma B.2. Let ξ be the right eigenvector of M , and λ be the corresponding eigenvalue. Then Y is a (complex valued) martingale adapted to F t = σ(Z (i) l : 1 ≤ l ≤ t). For the Markov model, M = mP so f j is the eigenvector of M corresponding to the eigenvalue mλ j . According to Lemma 3.1, all λ j and f j are real. For j = 1, . . . , N , define Corollary B.1. For any j = 1, . . . , N , The next theorem is martingale L p convergence theorem (see e.g. Durrett (2010)).
Theorem B.1. If X n is a martingale with sup E |X n | p < ∞ where p > 1, then X n → X almost surely and in L p .
It is essential to derive the variance of Z if mλ 2 j < 1.

(B.4)
Proof of Theorem B.2. From (B.1), (B.2) and the fact that C Since f j is the eigenvector of M corresponding to the eigenvalue mλ j , for every n ≥ 1, where c = max{f T j V k f j : 1 ≤ j, k ≤ N }. This gives (B.4).
Corollary B.2. When m > λ −2 2 , there exist a random variable Y almost surely and in L 2 .
Proof of Corollary B.2. By Theorem B.1, we only need to show sup E(Y Since m > λ −2 2 , by Theorem B.2, var( Z t,j → 0 almost surely and in L 2 .
Proof of Corollary B.3. By Theorem B.2, To prove almost sure convergence, let δ = max{(λ −1 2 λ j ) 2 , mλ −2 2 } ∈ (0, 1). There exists C > 0 such that By the Borel-Cantelli lemma, (λ −1 2 λ j ) t Y (i) t,j → 0 almost surely. Let W t denote the summation of the t-th generation RDS samples, and let S t = t j=0 W j denote the summation up to generation t. Define n t as the number of nodes in T between 0 and generation t, n t = |{σ ∈ T, |σ| ≤ t}|. So the sample average up to generation t isμ t = S t /n t . Superscripts on Z, S and W will denote how X 0 is initialized.
Proof of Theorem 3.1. Since T is a m-tree, then y, f 1 π = E π (y) and Y (i) t,1 = 1. By (3.2), From Corollary B.2 and B.3, almost surely and in L 2 . By definition, The number of samples between 0 and generation t is n t = t l=0 m l . Sinceμ Since lim t→∞ m t /n t = (m − 1)/m, from (B.7) almost surely as t → ∞. To prove L 2 convergence, notice that If a sequence of random variables X n → X in probability, and X n L 2 → X L 2 , then X n → X in L 2 . From (B.8) and (B.10), So the convergence in (B.9) is also in L 2 . By Corollary B.1 and B.2, EY Notice that V k = diag(P k1 , . . . , P kN ) − P k P T k , where P T k = (P k1 , . . . , P kN ) is the k-th row of P . Notice that N j=1 P kj = 1, by the Jensen's inequality for any k = 1, . . . , N . The assumptions f 1 , f 2 π = 0 and f 1 = 1 implies that f 2 is not a constant vector, thus the equality in (B.11) does not hold. Similar to the proof of Theorem B.2, 2 ) > 0 for any i = 1, . . . , N . If y, f 2 π = 0, this gives var(X (i) ) > 0 for any i = 1, . . . , N .
Remark B.1. If condition (3.3) in Theorem 3.1 is weakened to 1 = λ 1 > λ 2 = · · · = λ k > |λ k+1 | ≥ · · · ≥ |λ N | , then (3.4) becomes Proof of Corollary 3.1. The L 2 convergence in Theorem 3.1 implies L 1 convergence. If a sequence of random variables X n Since y, f 2 π f 2 (i) = 0, the bias term decays like Additionally, the L 2 convergence in Theorem 3.1 also yields So the variance term decays like var(μ Proof of Corollary 3.2. By the definition of the VH estimator in Section 2.3.3, . H −1 t is the sample average of y (X (i) σ )'s up to generation t, where y (j) = deg(j) −1 . By Theorem 3.1, H −1 t converges to E π (y ) > 0 almost surely. Additionally, is the sample average of y (X (i) σ )'s up to generation t, where y (j) = y(j)/deg(j). By Theorem 3.1, there existsX (i) ∈ L 2 such that λ −t 2 [μ t − E π (y )] → X (i) almost surely and in L 2 . So almost surely. Notice that E π (y ) = N/vol(G) and E π (y ) = i y(i)/vol(G), this gives the result that almost surely. The mean and variance ofX (i) comes directly from Theorem 3.1.

C Proof of Theorem 3.2 and Corollary 3.3
This section gives the proof of Theorem 3.2. In previous sections, the subscript of the estimators is t or l, which denotes the generation. This section requires us to study each node in a generation. Accordingly we order the nodes of the m-tree T by scanning each level from the root down. For example, for a 2-tree, the root node is 1, its offsprings are 2 and 3, the offsprings of 2 are 4 and 5, the offsprings of 3 are 6 and 7, etc. Without causing confusion, when the subscript is n,μ (·) n denotes the sample mean up to node n, i.e.
It is necessary to introduce a martingale central limit theorem (see e.g. Durrett (2010)). The following two lemmas show that although the limit distribution ofμ (i) t is different between the high variance and the low variance regime under different scaling, it always converges to E π (y) in L 2 . The result is trivial but is used in the proof of Theorem 3.2.
Lemma C.1. Assume the eigenvalues of the transition matrix P are The constant C does not depend on the initial distribution ν.
Proof of Lemma C.1. The proof is similar to the proof of Theorem 3.1. First, for j ≥ 2, From (B.4), for any 0 < δ < 1, var( Z By the Cauchy-Schwarz inequality, By the Cauchy-Schwarz inequality, (C.7) So there exists C > 0 such that for all i ∈ G, So for any initial distribution ν of X 0 , since i∈G ν i = 1, Lemma C.2. Assume the conditions in Lemma C.1 hold. Then for any initial distribution ν of X 0 ,μ n → E π (y) in L 2 .
Proof of Lemma C.2. For a given n, there exists t such that n t−1 ≤ n < n t . Throughout this proof, t is determined by the corresponding n in this way. When n < n t−1 + m t−1 , in base m, n t − n is represented as n can be represented aŝ Similarly we can determine a t−1 such trees by scanning the nodes from right to left in the t-th generation of T and define W 1 t−1 , . . . , W a t−1 t−1 . Next we can determine a subtree of T where the next m t−2 nodes in the t-th generation of T are its (t − 2)-th generation. Similarly we can determine a t−2 such trees by scanning the nodes from right to left in the t-th generation of T and define W 1 t−2 , . . . , W a t−2 t−2 ... Finally we determine a subtree of T where the next m 0 nodes in the t-th generation of T are its 0-th generation. Similarly we can determine a 0 such trees by scanning the nodes from right to left in the t-th generation of T and define W 1 0 , . . . , W a 0 0 . By (C.8), Then by the triangle inequality, a t−1 ≥ 1 and a l ≤ m − 1 for 0 ≤ l ≤ t − 1, nt k=n+1 y(X − → E π (y). By (C.9), the triangle inequality, the fact that n t /n = O(1) and (n t − n)/n = O(1), When n ≥ n t−1 + m t−1 , in base m, n − n t−1 is represented as n − n t−1 = a t−1 m t−1 + · · · + a 1 m + a 0 , (C.11) where a i ∈ {0, 1, . . . , m − 1} for 0 ≤ i ≤ t − 1, a t−1 ≥ 1. Andμ (ν) n can be represented aŝ Similarly we can determine a t−1 such trees by scanning the nodes from left to right in the t-th generation of T and define W 1 t−1 , . . . , W a t−1 t−1 . Next we can determine a subtree of T where the next m t−2 nodes in the t-th generation of T is its (t − 2)-th generation. Similarly we can determine a t−2 such trees by scanning the nodes from left to right in the t-th generation of T and define W 1 t−2 , . . . , W a t−2 t−2 ... Finally we determine a subtree of T where the next m 0 nodes in the t-th generation of T is its 0-th generation. Similarly we can determine a 0 such trees by scanning the nodes from left to right in the t-th generation of T and define W 1 0 , . . . , W a 0 0 . By (C.8), Similar to the previous case, we can prove that for n ≥ n t−1 + m t−1 , when n → ∞, n k=n t−1 +1 y(X − → E π (y) holds for both n < n t−1 + m t−1 and n ≥ n t−1 + m t−1 as n → ∞, this givesμ (ν) n L 2 − → E π (y).
Proof of Theorem 3.2. The proof consists of four main parts. The first part constructs a martingale. The next two parts verify that this martingale satisfies the two conditions in Theorem C.1. The fourth part concludes thatμ GLS is asymptotically normally distributed. Assume the Markov transition matrix between blocks is The second eigenvalue satisfies λ 2 = p + q − 1. For simplicity, assume the node feature y is centralized and normalized such that E π (y) = 0 and var π (y) = 1. At the end of the proof, this assumption will be removed to reach a general conclusion. It is worth noting that the following proof allows X 0 to be initialized from any initial distribution ν.
When T is a m-tree, each node from level 0 to t − 1 is counted m times as a parent. Define a new node feature y = (y 2 1 , y 2 2 ) T . Letμ (ν) n be the sample average of y 's up to node n. By Lemma C.2, as n → ∞. L 2 convergence also implies convergence in probability.