Central limit theorems for network driven sampling

Respondent-Driven Sampling is a popular technique for sampling hidden populations. This paper models Respondent-Driven Sampling as a Markov process indexed by a tree. Our main results show that the Volz-Heckathorn estimator is asymptotically normal below a critical threshold. The key technical difficulties stem from (i) the dependence between samples and (ii) the tree structure which characterizes the dependence. The theorems allow the growth rate of the tree to exceed one and suggest that this growth rate should not be too large. To illustrate the usefulness of these results beyond their obvious use, an example shows that in certain cases the sample average is preferable to inverse probability weighting. We provide a test statistic to distinguish between these two cases.


Introduction
Classical sampling requires a sampling frame, a list of individuals in the target population with a method to contact each individual (e.g. a phone number). For many populations, constructing a sampling frame is infeasible. Network driven sampling enables researchers to access populations of people, webpages, and proteins that are otherwise difficult to reach. These techniques go by many names: web crawling, Respondent-Driven Sampling, breadth-first search, snowball sampling, co-immunoprecipitation, and chromatin immunoprecipitation. In each application, the only way to reach the population of interest is by asking participants to refer friends.
Respondent-Driven Sampling (RDS) serves as a motivating example for this paper. The Centers for Disease Control, the World Health Organization, and 4872 X. Li and K. Rohe the Joint United Nations Programme on HIV/AIDS have invested in RDS to reach marginalized and hard-to-reach populations [6,1]. Each individual i in the population has a corresponding feature y i (e.g. y i ∈ {0, 1} and y i = 1 if i is HIV+). Using only the sampled individuals, we wish to make inferences about the average value of y i across the entire population, denoted as μ (e.g. the proportion of the population that is HIV+). Extensive previous statistical research has proposed various estimators of μ which are approximately unbiased based upon various types of models for an RDS sample [16,17,4]. We note that in the papers cited above (except [4]), RDS is assumed to sample with replacement. Previous research has also explored the variance of these estimators [5,13]. This paper studies the asymptotic distribution of statistics related to these estimators.
Results on asymptotic distributions for RDS are useful for two obvious reasons. First, they allow us to construct asymptotic confidence intervals for μ. Second, they provide essential tools to test various statistical hypotheses. The only central limit theorem considered in the RDS literature studied the case when the tree indexed process reduces to a Markov chain [5]; this presumes that each individual refers exactly one person. Previous research suggests that the number of referrals from each individual is fundamental in determining the variance of common estimators [13]. This paper establishes two central limit theorems in settings which allow for multiple referrals.
The main results apply to both the sample average and the Volz-Heckathorn estimator, which is an approximation of the inverse probability weighted estimator (cf Remark 1). Because the inverse probability weighted (IPW) estimator and its extensions are asymptotically unbiased, these estimators are often preferred to the sample average.

Notation
Following [5] and [13], the results below model the network sampling mechanism as a tree indexed Markov process on a graph. There are many assumptions in this model which are incorrect in practice. However, like the i.i.d assumption, it allows for tractable calculations. In the simulations, we show that the theory derived from this model provides a good approximation for a more realistic sampling model. [12] studies the sensitivities of the estimators to this model.
Let G = (V, E) be a finite, undirected, and simple graph with vertex set V = {1, ..., N } and edge set E. V contains the individuals in the population and E describes how they are related to one another. As discussed in the introduction, y : V → R is a fixed real-valued function on the state space V ; these are the node features that are measured on the sampled nodes. The target of RDS is to estimate μ = N −1 N i=1 y(i). If each sampled node referred exactly one friend, then the Markov sampling procedure would be a Markov chain. Several classical central limit theorems exist for this model; see [8] for a review. The results herein allow for each sampled node to refer more than one node. This is a Markov process indexed not by a chain, but rather by a tree. Denote the referral tree as T. Where the node set of G indexes the population, the node set of T indexes the samples. That is, we observe a subset of the individuals in G with the sample {X τ } τ ∈T ⊂ V . An edge (σ, τ ) in the referral tree denotes that sampled individual X σ referred individual X τ into the sample. Mathematically, T is a rooted tree-a connected graph with n nodes, no cycles, and a vertex 0 which indexes the seed node. To simplify notation, σ ∈ T is used synonymously with σ belonging to the vertex set of T.
For each non-root node τ ∈ T, denote p(τ ) ∈ T as the parent of τ (i.e. the node one step closer to the root). This paper presumes that {X τ } τ ∈T is a treeindexed random walk on G, which was a model introduced by [2]. This model generalizes a Markov chain on G; each transition X p(τ ) → X τ is an independent and identically distributed Markov transition with some transition matrix P that is defined below. Following [2], we will call this process a (T, P )-walk on G. Unless stated otherwise, it will be presumed throughout that the root node of the random walk X 0 is initialized from the equilibrium distribution π of P . It follows that X σ has distribution π for all σ ∈ T.
Unless stated otherwise, this paper presumes throughout that the transition matrix P is constructed from a weighted graph G. Let w ij be the weight of the If the graph is unweighted, then deg(i) is the number of connections to node i. Throughout this paper, the graph is undirected. So, w ij = w ji for all pairs i, j.
We use the term simple random walk for the Markov chain constructed on the unweighted graph (i.e. w i,j ∈ {0, 1} for all i, j). The simple random walk presumes that each participant selects a friend uniformly and independently at random from their list of friends. [10] serves as this paper's key reference for Markov processes. Following the notation in that text, define E π (y) = N i=1 π i y(i) and var π (y) = E π (y−E π (y)) 2 for the function y. In order to estimate μ, we observe y(X τ ) for all τ ∈ T. Because G is undirected, P is reversible and has stationary distribution π with π i ∝ deg(i) for all i ∈ G; this fact is helpful for creating an asymptotically unbiased estimator for μ, particularly under the simple random walk assumption [17].

Remark 1. In general, the quantity of interest
is not equal to E π (y). As such, the sample average of y(X τ )'s is a biased estimator for μ. With inverse probability weighting, define a new function y (i) = y(i)(Nπ i ) −1 and the respective estimator

X. Li and K. Rohe
where n = |T| is the sample size. Then, E π (μ IP W ) = E π (y ) = μ. As such, the sample average of the y (X τ )'s is an unbiased estimator of μ. Unfortunately, the values π i are unknown. In practice, RDS participants are asked various questions to measure how many friends they have in G. Under the simple random walk assumption, π i is proportional to the number of friends of i. Therefore the Volz-Heckathorn estimatorμ Hájek estimator based upon deg(i) [17]. Under the simple random walk assumption, this estimator provides an asymptotically unbiased estimator of μ.
For each node τ ∈ T, let |τ | be the distance of the node from the root; this is also called the "wave" of τ . For every pair of node σ, τ ∈ T, define d(σ, τ ) to be the distance between σ and τ on T (as a graph). For each non-leaf node σ ∈ T, let η(σ) be the number of offspring of σ. A tree is said to be an mtree of height h if η(σ) = m for all σ ∈ T with |σ| < h and η(σ) = 0 for all |σ| = h. Here, both m and h are a natural numbers (i.e. m, h ∈ N). T is said to be Galton-Watson if η(σ) are i.i.d random variables in N. While the theorems below only study 2-trees; the computational experiments in Section 5 suggest that the conclusions of the analytical results are highly robust to replacing the 2-tree with a Galton-Watson tree.
There are two primary concerns about the model and estimator used in the main results below. First, the Markov model allows for resampling. Second, the results below only apply to m-trees, not more general trees. The simulations in Section 5 suggest that the analytic results continue to hold under a more realistic setting that addresses both of these concerns.

Main results
Let T be an m-tree and λ 2 be the second largest eigenvalue of P . The variance ofμ IP W decays at the standard rate if and only if m < λ −2 2 [13]. In other words, if m > λ −2 2 , then var 1 |σ ∈ T : |σ| ≤ h| σ∈T:|σ|≤h y(X σ ) → ∞ as h → ∞. As such, using the traditional scaling, no central limit theorem holds above the critical threshold. Because of this, the theorems focus on the case m < λ −2 2 . When m > λ −2 2 , the simulations in Section 5 suggest that the central limit theorem does not hold for any scaling. Theorem 1 is a central limit theorem for an estimator constructed from the tree-indexed Markov chain. The theorem holds for any function y, any reversible transition matrix with second largest eigenvalue satisfying |λ 2 | = 1, and any m < λ −2 2 .

Theorem 1.
Suppose that P is a reversible transition matrix with respect to the equilibrium distribution π, and that the eigenvalues of P are 1 = λ 1 > |λ 2 | ≥ ... ≥ |λ N |. Without loss of generality, suppose that E π (y) = 0. Define If T is an m-tree with m < λ −2 2 , then The sequence of random variables considered in Theorem 1 are not exactly sample averages, but a reweighted form of sample average. Samples in the same wave are equally weighted, while samples from different waves are not. The following theorem provides a theoretical guarantee on the distribution of sample average for a specific class of transition matrix and node feature. For a vector x, one of the conditions uses the notation x ∞ = max i x i .
is a technical condition on the symmetry ofμ h that is necessary in the proof. The following proposition provides a sufficient condition for (c1).
Proof. Under the conditions of the proposition, the distribution ofμ h is symmetric with respect to 0. Thus E(μ 2k+1

Conditions (c2)-(c3) can be substituted by the following condition (c2'):
Condition (c2') is weaker than (c2) and (c3) combined, but is stronger than (c3) alone. To see this, let f be the eigenfunction of the second eigenvalue, and it follows that |λ 2 | < 1 √ 2 . It can be easily seen that one necessary condition for (c2') is that 4876 X. Li and K. Rohe In other words, all the rows of P must be close to π. As previously discussed, condition (c3) is actually a necessary condition for the central limit theorem [13], in the sense that the variance ofμ h tends to infinity if |λ 2 | ≥ 1 √ 2 . For clarity in the exposition of the theorem and the proof, we have only proved the theorem for the 2-tree. Results for more general m−tree can be proved with a similar technique.

Extension to the Volz-Heckathorn estimator
When P is restricted to be the transition matrix of the simple random walk on G, the following corollary shows that Theorem 2 can be extended to the Volz-Heckathorn estimator [17]. Denoted = 1 N i∈V deg(i) as the average node degree. Following Remark 1, the IPW estimator contains 1/(Nπ i ) which is equal tod/deg(i). The Volz-Heckathorn estimator first estimatesd with the harmonic mean of the observed degrees. Because this harmonic mean converges tod in probability, the following corollary applies Slutsky's Theorem to give a central limit theorem for the Volz-Heckathorn estimator.

Corollary 1. Let T be a 2-tree. Suppose in particular that P is the transition matrix of the simple random walk on G.
Define a new node feature y (i) = y(i)/deg(i). Without loss of generality, suppose that E π y = 0 (this is not equivalent to E π y = 0). Definê .

Illustrating the conditions with a blockmodel
Consider G as coming from a blockmodel with two blocks [11]. Previously, [5] studied RDS with this model. It serves as an approximation to the Stochastic Blockmodel. In particular, suppose that each node i = 1, . . . , N is assigned to a block with z(i) ∈ {1, 2}. Suppose that each block contains N/2 nodes. For suppose that every pair i, j has w i,j = B z(i),z(j) ∈ (0, 1). Thus, under the construction of P in Equation (2.1), Given the structural equivalence of nodes within the same block, it is sufficient to study the conditions (c2) and (c3) with a Markov chain where the state space is reduced to the block labels {1, 2} and the transition matrix is P = B ∈ R 2×2 . See Section C in the Appendix for a further discussion of this fact.
Notice that λ 2 = (p − r)/(p + r) is the second eigenvalue of both P and P.
then conditions (c2) and (c3) are satisfied. This example can be expanded to study a blockmodel with 2K blocks, where Suppose that the outcome y i depends only on the block label, i.e.

Estimating the variance
For some node featureỹ (e.g. HIV status y or the y in Remark 1 that leads to the IPW estimator), letμ denote the sample average. Denote σ 2 μ as V ar T,P (μ), where the subscript T, P denotes that the data is collected via a (T, P )-walk on G. This subsection studies a simple plug-in estimator for σ 2 μ . The following function is essential to expressing σ 2 μ [13]. Definition 1. Select two nodes I, J uniformly at random from the tree T. Define the random variable D = d(I, J) to be the graph distance in T between I and J. Define G as the probability generating function for D, In practice, T is observed. So, the function G can be computed. In many studies there are multiple seed nodes. In these cases, we suggest computing d(I, J) on a tree which has an artificial root node that connects to all of the seeds; this root node could be imagined as an individual that is responsible for finding the seed nodes. In this tree, two different seed nodes would be distance 2 apart.
Because the data has been sampled proportional to π, the plug-in quantity for var π should not explicitly adjust for π. Namely, we have where {T \ 0} contains all nodes except the root node 0 (because p(0) does not exist). Using these plug-in quantities, defineR. Then, the estimator iŝ A popular bootstrap technique for estimatingσ 2 μ resamples y(X τ ) as a Markov process (i.e. in addition to X τ being a Markov process, the bootstrap procedure also assumes that y(X τ ) is Markov) [15]. This model is akin to the blockmodel with two blocks in Section 3.2. The following assumption is weaker than this assumption:

Proposition 2. Under Assumption 1,
While Assumption 1 is weaker than the previous assumption in [15], the next proposition highlights the danger of this assumption. It uses a different assumption which is a rather weak assumption.
Because G is a probability generating function, it is always convex on [0, 1]. As such, we only need to be worried about negative values. Recall that the central limit theorems above only hold when |λ min | < 1/ √ 2 ≈ .7 (the smallest possible value for λ min is −1). Some simulated trees given in the appendix suggest that if G is not convex, it often fails in the neighborhood of −1. As such, the assumption that |λ min | < 1/ √ 2 ≈ .7 is likely to imply Assumption 2. In practice, one observes the referral tree T. Thus, one can compute the second derivative of G. Eigenvalues of P close to negative one arise in antithetic sampling, where adjacent samples are dissimilar. For example, if the population in G was heterosexuals and edges in G represent sexual contacts, then men would only refer women and vice versa. In this case, λ min would be exactly −1. While easily imagined, such settings are not current practice for RDS. As such, large an negative values are uncommon; λ min is likely close to zero.
The following proposition follows from an application of Jensen's inequality. A proof is given in Appendix D.

Proposition 3. Under Assumption 2,
Because Assumption 2 is not very restrictive, the inequality in Proposition 3 highlights the danger in breaking Assumption 1 (and thus the Markov model in [15]); breaking Assumption 1 leads toσ 2 μ underestimating the variance.

Numerical results
In this section, we illustrate the theoretical results on simulated data. The simulations are performed on networks simulated from the Stochastic Blockmodel [7]. The four colors in Figure 1 correspond to four different networks, from four different parameterizations of the model. Each of the four networks has N = 5, 000 nodes, equally balanced between group zero and group one. The probability of a connection between two nodes in different blocks is r and the probability of connection between two nodes in the same block is p. To control the eigenvalues of the 5000 × 5000 transition matrix, consider the transition matrix between classes given by P = E(D) −1 E(A). The second eigenvalue of P is [14] λ 2 (P) = p − r p + r , where expectations are under the Stochastic Blockmodel. In our simulation, the second eigenvalue of the actual transition matrix is typically very close to λ 2 (P). We take p + r = 0.01 in all four Stochastic Blockmodels so that the average degree is about 25. As such, λ 2 (P) is actually controlled by p − r.
For each of the four networks, we carry out four different sampling designs. Let T be either a 2−tree or a Galton-Watson tree with E(η(σ)) = 2. For the Galton-Watson tree, the distribution of η(σ) is uniform on {1, 2, 3}. For each T, we consider both with replacement sampling (i.e. the (T, P )-walk on G) and without replacement sampling (i.e. referrals are sampled uniformly from the friends that have not yet been sampled). Note that the conditions of Theorem 2 may be violated when either the Galton-Watson tree or without-replacement sampling is used. We take the first 8 waves of T as our sample. As such, the sample size is roughly N/10. For each social network and sampling design, we repeat the sampling process 2000 times and computeμ = 1 n n i=1 y(X i ) for each sample. The Quantile-Quantile (Q-Q) plot ofμ is shown in the left panel of Figure 1; note that the Q-Q plot centers and scales each distribution to have mean zero and standard deviation one. In addition, we repeat the above simulation for the Volz-Heckathorn estimator, and the Q-Q plot ofμ V H is shown in the right panel of Figure 1. It is clear from Figure 1 that there are two patterns of distribution: when λ 2 < 1/ √ m ≈ 0.7, i.e. λ 2 =0.5 or 0.6, the Q-Q plots ofμ andμ V H approximately lie on the line y = x for all sampling design; when λ 2 > 1/ √ m ≈ 0.7, i.e. λ 2 =0.8 or 0.9, the Q-Q plot ofμ andμ V H departs from the line y = x. Taken together, Figure 1 suggests that the distribution ofμ andμ V H converges to Gaussian distribution if and only if m < λ −2 2 . In fact, when m > λ −2 2 , the distribution of the estimators has two modes. The relationship between the expectation of the offspring distribution and the second eigenvalue of the social network determines the asymptotic distribution of RDS estimators, regardless of the node feature, the particular structure of the tree or the way we handle replacement.

Discussion
A recent review of the RDS literature counted over 460 studies which used RDS [18]. Many of these studies seek to estimate the prevalence of HIV or other infectious diseases; for these studies, a point estimate of the prevalence is insufficient. These studies have used confidence intervals constructed from bootstrap procedures and from estimates of the standard error. These standard error intervals implicitly rely on a central limit theorem and this paper provides a partial justification for such techniques, so long as m ≤ 1/λ 2 2 . Figure 1 suggests that if m is larger than 1/λ 2 2 , then the simple estimators (μ andμ V H ) are no longer normally distributed.
The theorems in this paper do not apply to general trees, only to m-trees. If T is a Galton-Watson tree with E(η(σ)) < λ −2 2 , then the simulations support the following conjecture: where σ 2 can be computed from the results in [13]. To prove this result requires a more careful study of the structure of {X σ } σ∈T . We leave this problem to future investigation.

Appendix A: Proof of Theorem 1
In the appendix, we give a proof of the theorems and propositions in the paper. First, we give an outline of the proof of our main theorem. Consider the martingale where {F h } is a filtration to be defined later. Using the Markov property and the estimation of var(Y h ), we show that the martingale difference sequence satisfies the condition of the martingale central limit theorem. In this section, P will be a reversible transition matrix with eigenvalues 1 = λ 1 ≥ |λ 2 | ≥ ... ≥ |λ N | and corresponding eigenfunctions f 1 , ..., f N satisfying k f i (k)f j (k)π k = δ ij for any i, j. We refer to [10] for the existence of such eigen-decomposition. Unless stated otherwise, expectations are calculated with respect to the tree indexed random walk on the graph. We begin with some lemmas.

Lemma 1. (Lemma 12.2 in [10]) Let P be a reversible Markov transition matrix on the nodes in
If λ is an eigenvalue of P , then |λ| ≤ 1. The eigenfunction f 1 corresponding to the eigenvalue 1 is taken to be the constant vector 1. If X(0), . . . , X(t) represent t steps of a Markov chain with transition matrix P , then the probability of a transition from i ∈ G to j ∈ G in t steps can be written as where y, From the reversibility of the Markov chain and Lemma 1, we have and the lemma is proved. The next lemma gives the expression of var(Y h ).
Proof. For k = 0, 1, ..., h, denote by s hk the number of ordered pairs (σ, τ ) such that |σ| = |τ | = h and d(σ, τ ) = 2k. Then s h0 = m h , and Central limit theorems for network driven sampling The next lemma is a convergence argument which we will use in the proof of Theorem 1.

Lemma 4 (Slutsky's lemma). If
The following theorem from [3] is essential to the proof of our main theorem.

Theorem 3 (Martingale central limit theorem). Suppose that {Z
Now we are ready to prove our main theorems.

Proof of Theorem 1.
Define Y h in the same way as Theorem 1. Without loss of generality, suppose that E π (y) = 0. Since m < λ −2 2 , Then y is also a function on the state space. We will first argue on the new node feature y and then convert back to y. Define Then {Z h , F h } h≥1 is a martingale difference sequence. We will verify that {Z h , F h } h≥1 satisfies (1) and (2) in Theorem 3.

X. Li and K. Rohe
We have For any σ ∈ T, denote by p(σ) the parent node of σ. Z h can also be expressed as It follows from the definition of V h and the Cauchy-Schwarz inequality that in probability, where Central limit theorems for network driven sampling 4885 = var π (y ) − var π (P y ), and condition (1) in Theorem 3 is satisfied. Similarly, we have where C 0 , C 1 , C are constants. Thus E(Z 4 h ) ≤ C for any h, and Condition (2) is also satisfied. From Theorem 3, we have From Lemma 4 and the definition of y , in distribution, where σ 2 = var π (y ) − var π (P y ) = var π (( √ mP − I) −1 y) − var π (P ( √ mP − I) −1 y). The proof is now complete.

B.1. Proof of moments convergence
Let X r be the root of the 2-tree. Define and Our key observation is that the left and right subtree can be seen as i.i.d copies of the whole tree given the left and right child of the seed, which makes it possible to build a relationship between γ k,h (i) and γ k,h−1 (i). Only condition (c3) is needed throughout the proof.
We need the following Lemma.

Lemma 5. Let {a h } be a sequence satisfying
Proof. Without loss of generality, suppose that c h = 0 and and the lemma is proved.
We use an induction on k. First, we will prove that γ 1 = 0. In fact, from Lemma 1, for all i, and |γ 1,h (i) − γ 1 | = O(ρ h ) for γ 1 = 0. Now we move from k − 1 to k. Without loss of generality, suppose that γ 2,h (i) > 1 for all h, i (or we can multiply y with a large constant). It follows that γ 2k,h (i) ≥ (γ 2,h (i)) k > 1 for all k. We can decompose γ k,h (i) into If k is even, then we know that where the first inequality follows from Jensen's inequality and the second inequality follows from our assumption that γ 2k,h (i) > 1 for all k. Likewise, if k is odd, then we have Let X lc and X rc be the left and right child of the root and T l and T r the left and right subtree, and we have

X. Li and K. Rohe
If k = 2, Equation B.2 and B.4 reduce to Thus by setting δ 1 = 0 we have ν h = P h ν 1 + h k=1 P k δ h−k , and it is not hard to verify that all the components of ν h (i.e., every γ k,h (i)) converge to γ 2 = π t ν 1 + ∞ h=1 π t δ h with rate ρ h . Now suppose that k > 2. Since k is fixed, there are a fixed number of terms in S 1 as h goes to infinity. Since |γ l,h (i) − γ l | = O(ρ h ) for all i ∈ S and l < k − 1, we have Thus, Let h tend to infinity in Equation (B.5), we have Now suppose that ξ 1 ∼ N (0, γ 2 ) andγ k = Eξ k 1 . Let ξ 2 be an i.i.d copy of ξ 1 . Thenγ Hence {γ k }, k ∈ N also follows Equation (B.7). Since γ 1 =γ 1 = 0 and γ 2 =γ 2 , we have γ k =γ k for every k, and the argument is proved.

B.2. Proof of uniform sub-gaussianity
To prove thatμ h are uniformly sub-gaussian for all h, we need to show that there exists some θ such that for all . Let c 1 be a large constant to be defined later and and for all and h. Again we use an induction on . Since γ 1,h (i) = O(|λ 2 | h ), we can choose c 1 large enough such that the inequalities in Equations B.8 and B.9 hold for all (h, ) with h = 1 or = 1. Suppose that Equations B.8 and B.9 are verified for all ≤ k. We will prove that they are also true for = k + 1.
By condition (c1) and (c2), we know that From our assumption of induction we have Thus, Then s 2k+2,h ≤ (1 + M 2 −h ) 2k+2 (I 1 + I 2 ). We have where the last equality follows from Equation (B.7). On the other hand It can be directly verified that for all m, Thus, Combining Equation (B.11) and B.13, we have (B.14) Therefore, and the theorem is proved.

B.3. Proof of Corollary 1
By Theorem 2 and Slutsky's lemma, it suffices to prove thatd →d in probability.
→ 0 for all > 0, and the corollary is proved.

Appendix C: Reducing the state space of the Markov chain
This section justifies the simplification in Section 3.2. Recall that P ∈ R N ×N is a Markov transition matrix on N nodes, where each node i is assigned to one of two classes z(i) ∈ {1, 2}, and Let X t ∈ {1, . . . , N} for t ∈ 0, 1, . . . be a Markov chain with transition matrix P that is initialized from the stationary distribution.
One can construct a Markov chain {Z t } t on the block labels {1, 2} that is equal in distribution to {z(X t )} t . Define Z t ∈ {1, 2} for t = 0, 1, . . . as a Markov chain with transition matrix P = B and initialize Z 0 from the stationary distribution of P. Induction shows that {Z t } t is equal in distribution to {z(X t )} t .
The following is a proof of Lemma 6 from [10].

Proof of Lemma 6.
var The following is a proof of Proposition 3.

Proof of Proposition 2.
In the case whenỹ(i) = μ + σf (i), this implies that y = μf 1 + σf j . By the orthonormality of the eigenvectors, ỹ,f 2 π varπ(ỹ) = 1{j = } for > 1. As such, the inequality in equation (D.2) holds with equality. Proposition 3 presumes that G is convex. Figure 2 plots G for twenty different Galton-Watson trees with offspring distribution p(0) = .1, p(1) = .1, p(2) = .3, p(3) = .5. This offspring distribution has expected value 2.2. The construction of each tree was stopped when it reached 5000 nodes; if it failed to reach 5000 nodes, then the process was started over. In these simulations and in others not shown, G is often convex. When it is not convex, the second derivative of G(z) is positive when z is away from −1. This simulation was selected because it shows that even when the trees are sampled from the same distribution, even when there is nothing strange about the offspring distribution (e.g. all moments are finite), even when it is a very big tree, even under all of these nice conditions, some of the trees have a convex G and some of the trees have a non-convex G. Similar results hold when the trees have 500 nodes; the only thing that changes is that the red regions extend slightly further away from −1.