Correcting for differential recruitment in respondent-driven Correcting for differential recruitment in respondent-driven sampling data using ego-network information sampling data using ego-network information

: Respondent-Driven sampling (RDS) is a sampling method de- vised to overcome challenges with sampling hard-to-reach human populations. The sampling starts with a limited number of individuals who are asked to recruit a small number of their contacts. Every surveyed individual is subsequently given the same opportunity to recruit additional members of the target population until a pre-established sample size is achieved. The recruitment process consequently implies that the survey respondents are responsible for deciding who enters the study. Most RDS prevalence estimators assume that participants select among their contacts completely at random. The main objective of this work is to correct the inference for departure from this assumption, such as systematic recruitment based on the characteristics of the individuals or based on the nature of relationships. To accomplish this, we introduce three forms of non-random recruitment, provide estimators for these recruitment behaviors and extend three estimators and their associated variance procedures. The proposed methodology is assessed through a simulation study capturing various sampling and network features. Finally, the proposed methods are applied to a public health setting.


Introduction
Respondent-driven sampling (RDS, Heckathorn (1997)) is a link-tracing network sampling method widely used to study hard-to-reach human populations connected by a network of social ties. Despite its wide adoption, statistical inference from RDS data continues to be subject to various sources of bias. Among them, is the bias induced by the individuals behaviors, such as participants preferentially recruiting subgroups of the population and/or individuals systematically electing not to participate in the study. Those behaviors are referred to as differential recruitment and they may be of particular interest due to the many documented instances in the context of RDS studies. The main contribution of this paper is to correct the inference for differential recruitment.
RDS draws its name from the recruiting method: previous respondents choose which of their network contacts will be recruited next by passing on uniquelyidentified coupons. Researchers limit the branching of the sample, and therefore reduce the number of highly-dependent pairs in the sample, by limiting the number of coupons (Gile and Handcock, 2010;Rohe, 2015). This has proven a highly effective sampling method in many populations, and is widely used, especially in settings at high risk for HIV Malekinejad et al., 2008;Montealegre et al., 2013) The primary object of inference from RDS data is population prevalence, possibly of a disease, a demographic characteristic, or a behavior. Such inference relies on many strong assumptions which are sometimes unrealistic in practice (Gile and Handcock, 2010). In particular, despite the growing empirical evidence (Frost et al., 2006;Iguchi et al., 2009;Liu et al., 2012;Mccreesh et al., 2012) that participants choose recruits in systematic ways, most inferential methods assume respondents make recruitments completely at random among their contacts in the target population. Sensitivity analysis performed with simulated and real data demonstrate that non-random recruitment potentially yields large biases in RDS estimators when the non-random recruitment patterns are associated with the object of inference (Frost et al., 2006;Gile and Handcock, 2010;Tomas and Gile, 2011;Lu et al., 2012;Verdery et al., 2015;Shi, Cameron and Heckathorn, 2019). There are also several available methods to diagnose non-random recruitment (Wejnert and Heckathorn, 2008;Liu et al., 2012;Yamanis et al., 2013;Gile, Johnston and Salganik, 2015;McLaughlin, 2016). The contribution of this work is to propose mathematical definitions for differential recruitment (DR), present a framework to estimate sampling probabilities in presence of DR and to correct prevalence estimators for the induced bias.
No estimator for RDS data dominates the others Tomas and Gile (2011); each has strengths and weaknesses. Therefore, rather than introducing a single proposed estimator, we propose extensions of three existing estimators: those introduced by Volz and Heckathorn (2008), Salganik and Heckathorn (2004), and Lu (2013). The estimator proposed by Lu (2013) is a modification of that of Salganik and Heckathorn (2004) which is partially robust to some forms of differential recruitment. Our work improves all three estimators in the case of three types of differential recruitment. Like Lu's estimator, our estimators require additional data beyond that typically collected in RDS: data on the composition of each respondent's local network contacts. It is this information that makes the recruitment biases identifiable in the context of social network homophily, or tendency for like to be connected to like (see Crawford et al. (2018) for a discussion of this non-identifiability). In addition, and unlike Lu's estimator, we consider differential recruitment based on both the outcome variable of interest, or based on another variable. This is especially important when the outcome of interest, such as HIV status, is not visible to a participant's contacts, making it impossible to collect local network composition from respondents (UNAIDS, 2014).
In the next Section, we introduce Respondent-Driven Sampling and the current estimators. The proposed prevalence estimators along with the parametrization of the three form of differential recruitment are discussed in Section 3. A comparison of their performance under various sampling conditions and network features is assessed in a simulation study presented in Section 4. Section 5 presents an application to a public health setting. Finally, we conclude with a discussion of the proposed methods in Section 6.

Respondent-driven sampling
An RDS sample begins with a set of seeds selected by the researchers, by some (strategic) convenience mechanism. Beginning with these seeds, each respondent is given a small number of uniquely-identified coupons to pass to contacts in the target population, making them eligible for enrollment. Respondents are compensated for their time completing the survey, and also given a small recruitment incentive for each successful recruitment. A key feature of data collection is that each respondent is asked to report her or his number of contacts in the target population. For the methods proposed in this paper, it is also necessary to collect data on the composition of the local contact network in terms of any variable which is expected to influence recruitment choices, as well as, for some estimators, on the outcome of interest.

Notation and sampling methodology
Consider a population of N individuals, also the nodes of the network, with labels 1, 2, ..., N . These are connected by social ties represented by a sociomatrix Y ∈ {0, 1} N ×N , such that y ij = 1 if i connects to j, and y ij = 0 otherwise. We assume an undirected network such that y ij = y ji ∀ i, j ∈ {1, 2, ..., N }.
We use the vector z ∈ {0, 1} N to represent the outcome of interest. We refer to z as "infection status," z i = 1 for infected persons in reference to the wide use of RDS in HIV applications, and for ease of discussion, although clearly it may represent any binary outcome of interest. For some of our proposed extensions, we also consider another binary nodal variable, denoted x ∈ {0, 1} N , a non-outcome variable which may influence sampling decisions. Each node also has a degree, or number of contacts in the target population. We denote this d i = N j=1 y ij , and assume it is accurately reported by respondents. Recall that our inferential goal is to estimate the prevalence of infection status. We denote the true value as μ = 1 N N i=1 z i . We estimate μ using a sample of size n. The vector S ∈ {0, 1} N denotes sampling such that: We sometimes refer to sets of nodes based on their variable values. We use calligraphic letters with superscripts to refer to sets of nodes with common values of a variable. For example S 1 = {i : S i = 1}. S 0 , X 1 , X 0 , Z 1 , and Z 0 are defined similarly. We also sometimes refer to sums. In particular, we sometimes sum degrees over categories of nodal covariates. These are denoted: We also refer to counts of sampled nodes in each category:

Prevalence estimators
In this Section, we briefly describe three of the prevalence estimators we later extend in Section 3: the estimators developed by Volz and Heckathorn (2008), by Salganik and Heckathorn (2004) and by Lu (2013). They are respectively denotedμ V H ,μ SH andμ Lu and referred to as the VH, the SH and the Lu estimators.
All three estimators are based on an adapted version of the design-based estimator introduced by Hansen and Hurwitz (1943), the HH estimator. The HH estimator depends on p i , that is, the probability that node i is selected at a given draw. The complexity of RDS and the lack of knowledge about the social network structure however prevent the exact determination of the sampling probabilities. For this reason, the described RDS estimators rely on a modified HH estimator, the HH-style estimator, where the sampling probabilities are instead estimated and denotedp i .
All RDS estimators discussed in this section assume that RDS may be well approximated by a discrete Markov chain (MC) on the state space of the network nodes to estimate the sampling probabilitiesp i . Conceptually, the transitions of the MC from one state (e.g. node i) to another (e.g. node j) represents the peer recruitment (e.g. j was recruited by i) as if nodes were constrained to recruit exactly one node by transition. Furthermore, even though individuals may only participate once in an RDS study, the MC representation allows sampling with replacement. The transitions of the MC are also assumed completely at random (Salganik and Heckathorn, 2004;Gile and Handcock, 2010). This is equivalent to presuming that participants recruit completely at random among all their contacts, or alters, in the target population. Finally, the recruitment process is assumed to occur on a single component network solely constituted of reciprocated ties. In summary, RDS is represented by a random walk (RW) on the nodes of a fully connected undirected network.
The described RW is mathematically fully specified by a transition probability matrix P , where p ij , is the probability that node j is selected conditional on the process' current state i. This probability is equal to p ij = y ij /d i ∀ i, j ∈ {1, 2, ..., N }, since node i is constrained to recruit completely at random among its alters. Panel 1b of Figure 1 illustrates a simple case of a transition matrix P for the undirected network shown in panel 1a.
Under the presumed network structure, the with replacement MC is irreducible and all states are positive recurrent. The combination of these properties results in the existence of a unique stationary distribution denoted π (Ross, 2014). As demonstrated by Salganik and Heckathorn (2004), under random recruitment, the unique stationary distribution of this RW on the network nodes is given by: This stationary distribution may be interpreted as the proportion of time the process visits each state in the long run. The RDS estimators developed under this framework assume the sampling starts at stationarity. If this holds, participants' estimated sampling probabilities are proportional to their degree The resulting estimated sampling probabilities are then used in the prevalence estimators proposed by Volz and Heckathorn (2008), Salganik and Heckathorn (2004) and Lu (2013). For instance, the VH estimator takes the form of a generalized HH-style estimator: (2.2) As for the SH and Lu estimators, they have been derived based on the idea that, for an undirected network, the number of cross ties from infected to uninfected individuals should be equal to number of cross ties from uninfected to infected individuals. This observation leads to the following relation: where C 10 (C 01 ) is the proportion of cross ties from infected (uninfected) to uninfected (infected) individuals and where D 1 (D 0 ) is the average degree of infected (uninfected) individuals. These quantities are not directly observed from the RDS sample. Therefore, the SH and Lu estimators replace these quantities with their estimated counterparts. The estimation of the average degree for each infection status for both the SH and Lu estimators is determined through a generalized HH-style estimator. Although SH and Lu estimators rely on the same average degree estimators, they differ in the estimation of the proportion of cross ties. The SH methodology estimates the proportion of cross ties based on the observed unweighted recruitment patterns, whereas Lu's estimators for C 10 and C 01 are based on a generalized HH-style estimator. The latter rely on the degrees partitioned with respect to the outcome variable (d z i0 and d z i0 ), which have not traditionally been collected in RDS surveys. However, incorporating this information has the potential of greatly reducing the DR bias (Lu, 2013;Verdery et al., 2015), provided this information is reported accurately. Expressions for the SH and Lu estimators may be found in equations (2.4) and (2.5). As pointed out by Beaudry, Gile and Mehta (2017), estimators of the SH form may be expressed as a function of the VH estimator: where r uv = i,j S ij 1 [zi=u,zj =v] , u, v ∈ {0, 1} and S ij = 1 if node i recruited node j and S ij = 0 otherwise. In short, r uv represents the observed number of recruitments from nodes with infection status "u" to nodes with status "v".

Variance estimators
In this section, we describe the bootstrap procedure introduced by Salganik (2006), referred to as the SH bootstrap, as well as the extension proposed by Lu (2013), to estimate the variance of RDS prevalence estimators. In both cases, the bootstrap procedure consists of three steps. First, resamples of size n are drawn from the observed RDS samples. This step is repeated a large number of times, denoted B. Second, an RDS prevalence estimate is calculated for each of the B replicates. Third, a confidence interval is calculated based on the B prevalence estimates.
One of the main differences between the two bootstrap procedures lies in their resampling step, which is designed to preserve the transition probabilities of the estimated matrix P partitioned on the outcome variable. The bootstrap procedures are derived to be consistent with the respective estimated C 01 and C 10 . In the case of the SH bootstrap, the probability of sampling node j given that the RW process is at node i is equal to: where k refers to the individual who recruited node j in the actual RDS survey. Lu (2013) proposed two bootstrap procedures. We consider the one which produces transition probabilities corresponding to the methodology employed inμ Lu . The bootstrap procedure sequentially samples nodes in a RW fashion. The RW probabilities of transitions depend on the nodes membership to the sets Z 1 and Z 0 . For instance, if the recruiting node is in Z 1 , then the probability of transitioning to a node in Z 0 (Z 1 ) is equal toĈ Lu 10 (1−Ĉ Lu 10 ). By construction, the bootstrap transition probabilities are consistent with the estimated proportion of ties between the infection groups in the network.

Differential recruitment
In this section, we extend the RDS estimators presented in Sections 2.2 and 2.3 to improve inference in the presence of differential recruitment (DR). We begin by operationalizing the concept of DR. Then, we specify new transition matrices reflecting three distinct forms of DR and derive the associated random walk stationary distributions. A maximum likelihood estimator is subsequently proposed to estimate the DR parameters required to estimate the sampling probabilities. Lastly, we present the extended versions ofμ V H ,μ SH andμ Lu , and corresponding measures of uncertainty.

Specification of differential recruitment
Under random recruitment, participants are assumed to recruit among their alters completely at random. Because recruitment is a social act, it is naive to assume this is always the case. Systematic violations of the random recruitment assumption are referred to as DR.
DR may arise in a variety of ways. Participants may favor the recruitment of individuals based on their characteristics (nodal attributes), or based on the nature of their relationship (tie attributes). The nodal characteristic inducing DR is represented by the indicator vector x ∈ {0, 1} N whereas the tie characteristic is represented by the symmetric indicator matrix W ∈ {0, 1} N ×N . Although we treat the binary case for the vector x and matrix W , these objects do not necessarily need to be binary. We provide some results for the general case where x is a categorical variable with G ∈ {2, 3, ...} categories in the appendix.
Consistent with Tomas and Gile (2011), DR on the nodal attributes may be classified into two categories: within group and between group DR. Within group DR occurs when participants preferentially select alters similar to themselves, such as contacts of the same ethnic group, whereas between group DR results from all classes of respondents preferentially recruiting their contacts with a given characteristic. Gile, Johnston and Salganik (2015) find, for example, that respondents in four studies of injecting drug users in the Dominican Republic seem to systematically recruit their employed contacts more often than their unemployed contacts, perhaps due to the recruiters elevated confidence that these more reliable contacts would follow through in participating in the study. DR on the tie attribute is the result of participants preferably selecting individuals on the basis of their relationship with them. For example, Wang et al. (2005) found that 78.9% of respondents in an MDMA users study reported being recruited by a friend as opposed to 14.9% by an acquaintance and 3.4% by a relative. Participants' actual tie composition differing from those proportions would be evidence of recruitment based on tie characteristic. In this section, we address these three forms of DR.
The magnitude of differential recruitment behavior is quantified by parameter φ. In each case, this parameter represents the ratio of the probability of selecting a member of the target population with the nodal or tie preferred attribute to the probability of recruiting a member without it. For example, survey participants systematically recruiting males with a probability twice as high as other genders is denoted φ = 2. Also, a recruitment regime completely at random implies that φ = 1.
Definitions for the three parameters are presented in Table 1. The subscripts b, w, t indicate the form of DR, that is, between groups, within groups, and on tie attribute, respectively.

Sampling probabilities
To estimate the sampling probabilities, we derive the stationary distributions of the random walks with DR. We define in this Section the transition matrices Table 1 Specification of three forms of DR under the RW scheme.
Rt indicates the state of the RW at step t.

DR Form Parametrization
Between groups characterizing the three Markov chains and prove the existence and uniqueness of their stationary distributions contingent on some network features. The transition matrices, P , specify the conditional probabilities of getting to any states given the previous state visited. For instance, p ij is the probability of next selecting node j given that node i is the recruiting node. Figure 2 shows simple examples of transition matrices for each of the three cases of DR of magnitude two (φ = 2). Let us suppose that the size of the nodes in figure 2a and 2b represents a nodal attribute inducing DR. For instance, let the large nodes indicate that the individual resides in neighborhood N 1 as opposed to living in neighborhood N 2 depicted by the smaller size nodes. Under the previously introduced notation, x = {0, 0, 1, 0, 1, 0} where the nodes are arranged in alphabetical order. Figure 2a illustrates the case of between group DR so that all classes of participants favor the recruitment of nodes in N 1 . This represents a hypothetical situation where every participant systematically favors the recruitment of their contacts living in the neighborhood where the study is conducted, for instance. As the left hand side of Figure 2a suggests, when the RW is in state B, the probability of selecting node C or E (p BC = p BE = 2/6) is twice as high as the probability of selecting node A or D (p BA = p BD = 1/6), that is, φ b = 2. The entire transition matrix P for this small network may be found in the right hand side of Figure 2a. More generally, P is given by: in the random walk (RW) with between group DR. For φ b = 1, that is, for a recruitment regime completely at random, di as expected. The transition probabilities for the general case where x has multiple categories is given in the appendix.
The derivation of within group DR transition probabilities is similar. However, instead of always favoring individuals residing in N 1 , participants recruit more heavily alters living in the same neighborhood as themselves. Node B in Figure 2b for instance recruits A or D with a probability twice as large as the probability of selecting node C or E (φ w = 2). Consequently, we obtain the following expression for the within group transition probability between node i and j: An illustration for transition probabilities for tie attribute DR is provided in Figure 2c. Thicker ties in the plot on the left panel signify that the relationship type induces DR. Participants may exhibit the tendency to recruit more frequently close friends than acquaintances for example. According to this figure, only six entries in the underlying matrix of tie attributes W are equal to one, w AD , w AE , w BD , and since W is assumed symmetric, the corresponding reciprocal relationships w DA , w EA and w DB . Under this RW, B is twice as likely to select D over the other incident nodes. The entire matrix P for this example is provided in the right panel of Figure 2c, but the expression for any entry p t ij is given by: We now consider the stationary distributions which are used as sampling weights in the extended versions ofμ V H ,μ SH andμ Lu . To ensure the Markov chains (MC) are irreducible, we strictly consider random walks on fully connected undirected networks where self ties are not permitted, a standard assumption for RDS, and assume φ > 0. Finally, we assume a finite network to ensure the MC is positive recurrent. If those conditions are met, then there exists a unique stationary distribution for each of those stochastic processes (Ross, 2014).
Result 3.1. Let R t denote the state at step t of a MC on the nodes of a fully connected undirected network without self ties. Assume that this MC has the following transition probabilities: Then the stationary distribution of this random walk is such that: Result 3.2. Let R t denote the state at step t of a MC on the nodes of a fully connected undirected network without self ties. Assume that this MC has the following transition probabilities: where φ w > 0. Then the stationary distribution of this random walk is: Result 3.3. Let R t denote the state at step t of a MC on the nodes of a fully connected undirected network without self ties. Assume that the MC has the following transition probabilities: where φ t > 0. Then the stationary distribution of this random walk is: The proofs for Results 3.1, 3.2 and 3.3 as well as the stationary distribution for the general case of between-group DR may be found in the appendix.
The resulting stationary distributions all involve the φ parameters which are generally unknown since the sampling is driven by the respondents. However, these parameters may be estimated by maximizing the following likelihood functions: where R is an n-dimensional random vector specifying the state of the RW, S 0 is the set of seeds in the RDS study, and p * ij is the DR transition probabilities from node i to node j where * ∈ {b, w, t} identifies the form of DR.
The resulting estimate for φ's may be replaced in the stationary distributions so that the estimated stationary distributions for node i ∈ {1, 2, ..., N } in equations (3.4), (3.5) and (3.6) respectively become proportional to: which are consistent estimators under the RW and network models assumed in this manuscript.

Extended prevalence estimators
In this section, we discuss the adjustments to the prevalence estimators to address DR. With the exception of theμ SH , the extension consists in replacing the random recruitment sampling probabilities with the appropriate DR ones.
The extended prevalence estimator based on the VH estimator is: ( 3.11) where * ∈ {b, w, t} identifies the form of DR. Correspondingly, every generalized HH-style estimator entering theμ Lu estimator, that is,Ĉ Lu 01 ,Ĉ Lu 10 ,D 0 andD 1 , is modified in the same fashion, thus leading to: ( 3.12) Under non-random recruitment, the observed recruitment patterns fail to provide an unbiased estimate of the socio-matrix cross ties. The systematic over recruitment of infected individuals for instance, translates into larger r 01 and r 11 than expected at random and therefore, compromises the estimation of the cross tie proportions. Consequently, in addition to replacing the selection probabilities with the DR stationary distribution, modifications to the estimatorŝ C SH 01 andĈ SH 10 need also to be considered to adapt the SH methodology for DR. We propose the following estimators: (3.14) where φ z * is an estimate of the relative probability of being recruited for infected individuals when compared to uninfected individuals and where: for k, l ∈ {0, 1} such that r w kl (rw kl ) represents the observed number of recruitments from nodes with infection status "k" to nodes with infection status "l" which are (not) preferentially recruited. The resulting extended SH estimator is:μ * (3.17) The estimators shown in equations (3.11), (3.12), and (3.17) are consistent estimators for μ under the sampling and network assumptions discussed in this manuscript. A detailed discussion on the asymptotic properties of the estimators is provided in the appendix.
We note that, through the sampling probabilities, all extended prevalence estimators require ego-network composition on the DR characteristic, that is, This information has not traditionally been collected in RDS surveys (Lu, 2013), however an increasing number of studies now include this information (Liu et al., 2009(Liu et al., , 2012Crawford et al., 2018). We also note thatμ b V H.dr is the only estimator not requiring the ego network composition on the outcome variable, that is, d z i1 and d z i0 , ∀ i ∈ S 1 . This is especially relevant in the context of RDS as important outcome variables may not be visible to network contacts. Not relying on such information to estimate the prevalence while alleviating the effect of DR therefore constitutes a substantial contribution to the current RDS prevalence estimation methodology.
For simplicity purposes, sections 3.2 and 3.3 presented results for the specific case of binary DR variables. Estimators for the general case of between group DR along with a simulation study are discussed in the appendix. This illustrates that our methodology may be adapted to treat various specifications of DR by simply carefully defining the recruitment process and deriving the associated stationary distribution.

Uncertainty of the estimators
So far, we have discussed methodology to estimate the prevalence of an outcome variable z with RDS data when participants preferentially recruit individuals based on their characteristics or relationships. In this section we develop methodology to assess the uncertainty of the proposed estimatorsμ * V H.dr , μ * SH.dr and ofμ * Lu.dr . We focus on the between group DR. Specifications for the other forms of DR are included in the appendix. Our proposed variance estimators extend the bootstrap procedure proposed by Lu (2013) summarized in Section 2.3.
Under the random recruitment assumption, uniform edge sampling assures the proportion of cross recruitment from one infection group to the other is an unbiased estimator for the proportion of cross ties in Y , the network adjacency matrix. However, in the presence of DR, the observed proportion may be severely biased. Since the objective of the bootstrap procedure is to reflect the recruitment process of RDS as closely as possible through the RW representation, we emphasize that the transition probabilities in the variance estimator are designed to model the recruitment process and not necessarily the proportion of cross ties. Therefore, one of the fundamental differences between our proposed methodology and the SH and Lu's procedures is that the transition probabilities are driven by the DR characteristic and its associated parameter φ rather than by the outcome variable. More specifically, we propose the following uncertainty estimation algorithm: 1.
Step 1 -Re-sampling: After selecting the first individual completely at random, the next individuals are selected sequentially with probability varying based on their membership in the sets X 1 or X 0 and also, based on the membership of the recruiting individual. Table 2 summarizes the transition probabilities for each of the four possible cases.

Table 2
Transition probabilities used in the bootstrap procedure under between group DR, where i is index of the recruiting node and j = i is the index of any available node for recruitment.
Step 2 -Estimation: Prevalence estimates are computed for all replicates. In the present case, the prevalence is determined based on the extended prevalence estimator for which the variability is estimated. For example, if the objective is to estimate the variability ofμ b V H.dr , then the prevalence estimates are calculated with equation (3.11). 3.
Step 3 -Confidence Interval: The first two steps are repeated a large number of times B. The standard deviation calculated from the obtained prevalence estimates is used to construct a studentized confidence interval.
We propose two variations of the bootstrap procedure. Both approaches begin by generating replicates (step 1) based on the transition probabilities calculated with the originalφ b , that is,φ b used in the DR prevalence estimate. For each of these replicates, a newφ b is calculated from the re-sampled data. The resulting set ofφ b 's is referred to as the bootstrappedφ b 's and are used to determined the prevalence estimates in step 2. The only difference between the two variations of the procedures is that under the second version, step 1 is repeated twice. Once with the transition probabilities determined based on the originalφ b and again, based on the bootstrappedφ b 's.
The resulting bootstrap procedures are intended to capture the uncertainty pertaining to the sampling process assuming a random walk approximation to RDS. Also, by recalculatingφ b for each replicate, we adjust for the variability of this parameter. However, neither variability due to a super-population model nor variability induced by other RDS-specific characteristics is reflected in these bootstrap estimators.
To estimate the variance ofμ * SH.dr , we have also extended the original SH bootstrap discussed in Section 2.3. New transition probabilities are derived based on equations (3.13), (3.14) and (3.15) to capture the various forms of DR.

Simulation study design
The complexity of the RDS sampling method prevents an analytical assessment of the performance of the proposed prevalence estimators. Therefore, we present a simulation study to compare their performance under a variety of sampling conditions and network features. In this section, we present the design and results from the simulation study. The simulation study was performed with the statistical software R (R Core Team, 2018) and the packages statnet  and RDS (Handcock, Fellows and Gile, 2015).

Network features
There is a vast body of literature studying the tendency of people to form ties with individuals with whom they share common attributes (Kandel, 1978;McPherson, Smith-Lovin and Cook, 2001;Currarini, Jackson and Pin, 2009). Therefore, it is critical for this simulation study to evaluate the sensitivity of the proposed methodology to this social behavior, known as homophily.
Exponential-family random graph models (ERGMs) (Frank and Strauss, 1986;Hunter, Goodreau and Handcock, 2008;Hunter and Handcock, 2006) provides the flexibility to incorporate this feature. This may be done by parametrizing the model so that the rate of ties among a certain group of individuals, η 11 , differs from the rate of ties among members belonging to different groups, η 10 . We use the following ERGM parametrization for undirected networks: where H T g(y, x) is equal to: and where where Y(x) is the space of all binary undirected networks of N nodes consistent with x. This allows to control the overall propensity of forming a tie as well as the density of ties among alike members and across groups. In summary, this model formulation allows for homophily, which we define as: The simulated networks were generated using τ = 1 (no homophily) and τ = 5 (elevated homophily) with respect to the DR variable x. The rate of ties was also chosen so to produce an average degree of ten. The vector x contained 35% of individuals with the DR characteristic, μ x = 0.35, and the outcome variable was positive 20% of the time, μ z = 0.20. Infected individuals were selected among the individuals with the DR characteristics. A total of one thousand networks of size N = 1000 nodes were generated, for each of the values for τ using the R packages statnet .

Sampling
The simulated RDS process in this study is intended to exhibit features approaching those of actual RDS studies. For instance, the nodes are sampled without replacement. Also, ten seeds initiate the sample instead of one as assumed by the estimators. Such seeds are selected according to the stationary distribution. Each sampled node subsequently recruits a maximum of two participants. A smaller number of recruits is allowed when there are less than two unsampled alters connected to the recruiting node. Nodes are presumed to recruit under one of the three recruitment regimes: • recruitment completely at random (φ = 1), • moderate differential recruitment (φ = 2), or • elevated differential recruitment (φ = 4) with respect to the DR variable x or to the tie attribute matrix W . Nodes receiving an invitation to participate into the survey are presumed to systematically accept the invitation. The sampling process stops when the target sample size of two hundred is attained. One RDS sample is drawn from each network. In summary, the six basic scenarios correspond to all possible permutations of the three levels of DR and the two levels of network homophily.

Results: Point estimates
Results from the simulation study for between group DR are presented in Figure 3 (and in Table 6 of the appendix). Figure 3 displays results for the six scenarios described in the previous section. The two levels of network homophily are shown on the horizontal panels, τ ∈ {1, 5} and the three levels of DR on x are shown on the vertical panels, φ ∈ {1, 2, 4}. Estimates for the prevalence of z obtained from the six estimators are summarized by box plots which appear in the following order within each scenario: Lu.dr . Estimators are grouped into three categoriesμ V H ,μ SH , andμ Lu on the x-axis of each scenario and the box plot color within each category indicates the specific version of the estimator: original estimators (purple), and the extended estimators for DR proposed in this paper (yellow). The true population parameter μ is represented by the horizontal blue line.

The blue horizontal line represents the true population prevalence for the variable z.
We first observe that all estimators have virtually no bias in the scenarios where no DR is simulated. In addition, in comparison withμ V H andμ SH , Lu and the extended DR estimators display a reduced and similar variability. The reduction in the uncertainty is partly attributable to the fact that although φ is approximately equal to one on average, it slightly varies from this value in any particular simulated sample. These small departures from recruitment completely at random are corrected for in the extended estimators and therefore, produce estimates with smaller errors. Forμ V H.dr , the variability introduced by the network homophily offsets this effect. For Lu and the extended SH and Lu estimators, the decrease in variability is also explained by the improved estimation of the cross-recruitments.
Our simulation also corroborates the findings discussed in various studies that DR induces strong biases inμ V H andμ SH (Frost et al., 2006;Gile and Handcock, 2010;Tomas and Gile, 2011;Lu et al., 2012;Verdery et al., 2015;Shi, Cameron and Heckathorn, 2019). This holds even for a moderate value for φ. A between group DR of magnitude two, for instance, yields a relative bias of approximately 34% in the original VH and SH estimators in scenarios without homophily and a relative bias of 61% with homophily. As pointed out by Lu (2013) and Verdery et al. (2015), the estimator proposed by Lu, which incorporates information about ego-network composition, is far more robust to DR than the original estimators under all assessed scenarios. Without network homophily, a relatively small bias remains when φ = 1. The remaining bias is explained by the fact that the estimator relies on sampling probabilities which do not account for DR. Conversely, our simulation study suggests that, when the outcome variable differs from the DR characteristic, a significant relative bias (33% for φ = 2) remains in presence of network homophily.
As observed in Figure 3 all discussed extended estimators reduce, in some cases substantially, the DR bias when compared to their original counterpart. To assess the performance of the extended estimators more formally, we tested how significantly different the mean square error of the estimators were using a pairwise Bonferroni procedure at a family-wise error rate of 5% for each of the six scenarios. Firstly, we assumes that only ego-network composition on the DR characteristic could reasonably be trusted (d x i0 and d x i1 ). Under this assumption, onlyμ V H ,μ b V H.dr andμ SH may be computed asμ b SH.drμ Lu andμ b Lu.dr also require ego network composition on the outcome variable (d z i0 and d z i1 ). The estimators included in the best set of estimators identified by this procedure are indicated by blue crosses in Figure 3. We observe that onlyμ b V H.dr systematically appears in the best set of estimators. Secondly, we presumed that participants may also accurately report on the ego-network composition with respect to the outcome variable. The estimators included in the best set of estimators are labelled with the red X's. Those results suggest that onlyμ b Lu.dr consistently appears in the best set of estimators.
These conclusions also hold when the outcome variable coincides with the DR characteristic (z = x), with the exception that Lu estimator is less sensitive to this class of DR. Results are presented in the appendix as well as results for the other two forms of DR.
The analysis presented above supposes that the outcome variable z is closely related to the DR characteristic. A simulation study has also been performed to assess the performance of the estimators when the variable inducing DR is unrelated to z. The results indicated that the DR bias is smaller in instances where the variable x is not closely related to the outcome variable z.
Additional scenarios under between group DR have also been simulated to heuristically evaluate the magnitude of seed selection bias, such as when seeds are selected strictly among infected individuals. We found that, except when there is no DR and the network is homophilous, the DR methodology still provides substantial bias reduction compared to the alternative estimators.
Finally, a simulation study where DR takes place on a variable with multiple groups is also presented in the appendix. We discuss results under scenarios under which the DR model is correctly specified as well as misspecified.

Results: Variance estimates
In this section, we assess the performance of the proposed bootstrap variance estimators described in Section 2.3 and Section 3.4 at various levels of between group DR and network homophily. We also evaluate the impact of DR on the overall inference by comparing coverage rates of the 95% confidence intervals for the traditional RDS estimators and their extended versions.
The performance of the uncertainty estimators is evaluated by comparing the estimated standard deviation (σ) to our best estimates of the true variability which consists of the standard deviation of the simulated prevalence estimates under each scenario (s's). Figure 4 presents the coverage rate of the 95% confidence intervals (upper plot) along with the difference between σ and s (lower plot). The results are shown for the nine bootstrap procedures under between group DR on x while making inference about the z. These figures are organized in the same way as Figure 3, that is, the two horizontal panels display the results for the two levels of network homophily (τ ∈ {1, 5}) and the vertical panels are divided according to the DR parameter φ ∈ {1, 2, 4}. The estimators are presented in the following order within each scenario: The first row of the upper plot shows that, without homophily, the 95% confidence intervals' coverage rates for the extended bootstrap procedures are either similar (no DR) or significantly better (DR > 1) than the original VH and SH or slightly better than Lu's variance estimators. This improvement is mostly attributable to the reduction in bias in the point estimation. However, in the case of the VH estimators, the higher coverage is also explained by the improved estimation of the variance as it may be seen in the bottom plot of the same figure.
The results displayed in the second row of the two plots were produced with homophilous networks. Although the coverage rates are significantly lower than for non homophilous networks, we may conclude that, in the presence DR, the proposed inferential procedures provide higher coverage rates than the presented alternative methods. However, the poor coverage of the confidence intervals produced withσ(μ SH.drv1 ), andσ(μ SH.drv2 ), is mostly due to the remaining bias in the point estimates.
The comparison of the coverage rates further demonstrates that when the ego network composition on the DR characteristic is available, then inference may be improved by using the extended VH methodology. In the presence of strong network homophily however, it is preferable to also incorporate ego network data on the outcome variable through the Lu extended methodology.
The results presented in Figure 4 only suggest a slight outperformance of the bootstrap procedures where the estimated φ are resampled (v2) over the versions without this resampling step (v1).
The analysis presented in this section corresponds to the case in which the DR is related to the variable x, but the inference is performed about the outcome variable z. The appendix includes results for the case where the outcome variable and the DR characteristic coincide. The conclusions are similar, with the exceptions that the extended SH and Lu's methodologies perform better than the cases presented here whereas the results for the extended VH procedures deteriorate. Overall though, the proposed methods still outperform the studied alternatives in presence of DR.

Application
The US Centers for Disease Control and Prevention (CDC) use RDS for behavioral surveillance of people who inject drugs (PWID) and high risk heterosexuals (HRH) in 25 US cities every 3 years (Gallagher et al., 2007). While the resulting data are highly sensitive human subjects data, and also do not include known true values to which to compare our results, we conduct our analyses on simulated datasets created to match the characteristics of fourteen CDC studies of PWID . These simulated datasets are data-matched on: • Estimated prevalence of and mean degree by one high-homophily binary variable denoted z • Estimated prevalence of and mean degree by one moderate-homophily binary variable denoted x • The joint prevalence (therefore relationship between) z and x.
Simulated networks of size N = 10, 000 nodes were generated as simulations from exponential random graph models (ERGMs) using the statnet R package R Core Team, 2018). These networks are ideal for applying the proposed methodology because they reflect realistic characteristics of two related variables which may influence sampling, in a setting in which the true prevalence of the outcome variable z is known. One hundred network replicates for each of the fourteen realistic networks were simulated. In addtion, RDS samples were drawn for each of the simulated networks starting with ten seeds. Everyone recruited up to 2 individuals, without replacement, until a sample size of 500 was attained. Recruitment was assumed to follow a between group differential recruitment scheme, where φ = 2. The simulated distributions of the prevalence estimates for this application are shown in Figure 5. Every rectangle represents one of the fourteen simulated population conditions. The populations are ordered based on the prevalence of the outcome variable, which are represented by the blue horizontal lines. Within each area, the distribution of the six prevalence estimators discussed in this manuscript are displayed, that is, the three original estimators in purple paired with their respective extended version in yellow. For the purpose of the illustration, z is assumed to be equal to x, that is, the DR occurs on the outcome variable. As it may be observed from Figure 5, whileμ Lu is considerably more robust to DR than the VH and SH, all three versions of the DR estimators reduce the DR bias. The root mean square error (RMSE) is smaller for the extended estimators and is minimized with theμ b Lu.dr under every scenario. Figure 6 shows the coverage rates of the 95% confidence intervals constructed based on the original methodology and the two extensions of the bootstrap variance estimators. The results are consistent with those obtained in the simulation study. The highest coverage rates are produced from the Lu extended procedures. Also, similar to previous results, the original VH and SH procedures produce very low coverage rates, sometimes none of the intervals contains the true value even for a relatively small value of DR (φ = 2). Although not presented in this section, inference on an outcome variable z different than x has also been performed. The induced DR on z resulting from DR on x was however minimal thus suggesting a weak relation between the two variables. In most populations, the RMSE was also minimized byμ b Lu.dr . However, in some cases,μ Lu produced a similar but lower RMSE thanμ b Lu.dr . As for the coverage rates of the 95% confidence intervals, Lu DR procedures appeared to outperform or perform similarly to other methods in the majority of the populations thus highlighting the advantage of utilizing both ego network information when available.

Discussion
Sampling hard-to-reach populations is a challenging problem. RDS has provided ways to circumvent some of the issues specific to those populations that make the use of traditional sampling methods impractical. However, the sampling process under RDS is out of the control of the researchers conducting the studies and therefore, this sampling method is highly susceptible to biases induced by participants' behaviors. The main contribution of this work is to introduce inferential methodologies correcting existing RDS prevalence estimators and their uncertainty estimators for biases induced by various forms of differential recruitment (DR), where DR is understood as a systematic departure from recruitment completely at random. The methodology addresses biases induced by DR arising from participants preferentially recruiting a sub-group of their contacts or from a systematic nonresponse of a sub-group of their contacts. However, it does not explicitly distinguish the source of the DR bias.
The estimators presented in this work suppose participants' sampling probabilities may be estimated from the stationary distribution of a random walk (RW) on the state space of the network nodes. The derivation of the stationary distribution for the original estimators assumes that participants recruit completely at random among their contacts in the target population. Our approach modifies this assumption and instead proposes three sampling schemes under which participants systematically recruit individuals based on one of their nodal characteristics or based on their type of relationship with them. By explicitly defining those sampling schemes we were able to derive the RW characterizing those behaviors and their associated stationary distributions. The revised estimators rely on the stationary distributions of the modified RW. Results from the simulation study show that this methodology greatly reduces biases induced by the various forms of DR. However, these methods require additional data about participants' ego-network compositions.
The comparison of the estimators' bias in our simulation study suggests that μ b Lu.dr generally outperforms the discussed alternative estimators. However, this estimator requires participants to correctly report the ego-network composition on both the outcome variable and on the characteristic driving DR. In the context of RDS, it might be unrealistic to assume participants will be able to provide the outcome variable ego-network composition, as in the case of disease, this information is generally not visible to other members of the target population. Under such circumstances, it is recommendable to instead useμ b V H.dr to estimate the prevalence, as this estimator does not require this information and produces smaller biases thanμ V H andμ SH in presence of DR.
We have also proposed uncertainty estimators. The uncertainty is estimated through a bootstrap procedure capturing the variability associated with the RW sampling as well as with the estimation of the magnitude of the DR parameter. Results from the simulation study show that the variance estimators perform relatively well with non homophilous networks. Combined with the lower bias of the point estimate, this significantly improves the coverage rates of the 95% confidence intervals from the original version of the VH and SH estimators in presence of DR. The coverage rates under those assumptions are also greater than under Lu's methodology, but to a lesser extent. With homophilous networks however, the proposed bootstrap procedures tend to underestimate variability. This may be explained by the fact that those procedures do not reflect some of the RDS specific features. Although the underestimation of the variance affects the width of the 95% confidence intervals, the coverage rates forμ b V H.dr and μ b Lu.dr are significantly better than those produced by the conventional estimators when with φ = 2 or 4. We conclude that the proposed extended methods improve the inference in the presence of DR despite the underestimation of the variance.
We have applied the proposed methods to data simulated to match some features of data collected by the US Centers for Disease Control and Prevention (CDC) for behavioral surveillance among people who inject drugs (PWID). The findings from this analysis are similar to the conclusions obtained from the simulation study. In presence of DR, there is a significant reduction in the bias for the extended versions of the VH and SH estimators and a moderate reduction in the case of the extended version of the Lu estimator. The example also highlights that in cases where the outcome variable differs from the DR characteristic, the improvement is not as significant when the DR is weakly related to the outcome variable.
Although the methodology presented in this manuscript addresses specific forms of DR, we believe it provides a general framework where one could assess different forms of DR by deriving new stationary distributions for alternative random walks. For instance, one might consider treating nonuniform DR. Consequently, we believe that the proposed methodologies are promising and could significantly improve traditional estimators when participants do not recruit at random.

A.1. Stationary distributions with DR
In this section, we prove Results 3.1-3.3, which correspond to the stationary distributions of the random walk with three forms of differential recruitment.
Therefore, we have that: where K is a normalizing constant such that N i=1 π b i = 1. Therefore, π b i satisfies the global balance equations for all i ∈ {1, 2, ..., N } and π b = {π b 1 , π b 2 , ..., π b N } is the stationary distribution for this RW.

Proof Result 3.2.
By assumption, Therefore, we have that: where K is a normalizing constant such that N i=1 π w i = 1. Therefore, π w i satisfies the global balance equations for all i ∈ {1, 2, ..., N } and π w = {π w 1 , π w 2 , ..., π w N } is the stationary distribution for this RW.

Proof Result 3.3.
By assumption, Therefore, we have that: where K is a normalizing constant such that N i=1 π t i = 1. Therefore, π t i satisfies the global balance equations for all i ∈ {1, 2, ..., N } and π t = {π t 1 , π t 2 , ..., π t N } is the stationary distribution for this RW.

A.2. Consistency of the prevalence estimators
This section discusses the consistency of the estimatorsμ * V H.dr ,μ * Lu.dr and μ * SH.dr and the necessary conditions for them to be consistent. All derivations are based on the assumptions stated in the manuscript, that is, that the sampling is performed through a random walk over a fully connected undirected network. This asymptotic framework is consistent with Volz and Heckathorn (2008) and Goel and Salganik (2009), who assume a RW on the state space of the nodes of a fully connected network, but contrasts with the framework used by Li et al. (2017), who consider asymptotic unbiasedness based on RDS being represented as a tree indexed Markov process.
These proofs addresses consistency as the length of a Markov chain (MC) sample drawn from a fixed network increases. We therefore condition on the fixed network structure, and these proofs are agnostic to the network distribution from which the network was drawn. Finally, since we let the MC sample size go to infinity, the estimatorsμ * V H.dr ,μ * Lu.dr andμ * SH.dr are expressed in a slightly different form than shown in the manuscript. In particular, we now assume that i represents the indexing of the MC sample order instead of the indexing of the nodes in the network. Proof. By the properties of the maximum likelihood estimators,φ * is a consistent estimator for φ * . As such, equations (3.8)-(3.10) are consistent estimators for equations (3.4)-(3.6), respectively. Therefore, the estimators for the sampling probabilities are consistent for the true sampling probabilities. By Hastings (1970) is a consistent estimator for 1. This is due to the fact that, as per equation (3.12) when c → 1, thenμ * Lu.dr →μ * V H.dr and as such, using Result A.1,μ * Lu.dr is consistent for μ. Since the estimators for the sampling probabilities are consistent for the true sampling probabilities, then: is a consistent estimator for T 01 , where, T 10 is the number of ties from nodes in Z 1 to nodes in Z 0 and T 01 is the number of ties from nodes in Z 0 to nodes in Z 1 . Since the network is assumed undirected, we have that T 10 = T 01 . We therefore conclude that c is a consistent estimator for 1. Consequently,μ * Lu.dr is a consistent estimator for μ provided SH.dr is a consistent estimator for μ, it suffices to show that is a consistent estimator for 1. This is due to the fact that, as per equation (3.17) when c → 1, thenμ b SH.dr →μ b V H.dr and as such, using Result A.1,μ b SH.dr is consistent for μ.
Using similar arguments as in the ones used in the proof of Results A.1 and A.2, we note that does not converge to 0 for large n, where T 1· and T 0· are the total number of ties for nodes in Z 1 and Z 0 , respectively. Therefore, for c to be consistent for 1, we need to show thatĈ is a consistent estimator for T0· T1· . We first observe thatĈ SH.b , where s kl = r kl /(n − 1) for k, l ∈ {0, 1} is the proportion of recruitments from nodes in Z k to nodes in Z l in the MC of n steps. Secondly, let us define the corresponding random variable S kl,n = 1 n−1 n−1 g=1 1 [zg =k∩zg+1=l] . Then, E[S kl,n ] = i∈Z k j∈Z l p b ij π b i , which, for fixed φ b , leads to: where K is the normalizing constant of the stationary distribution and where T kl is the number of ties from nodes in Z k to nodes in Z l for k, l ∈ {0, 1}. Also, the variance for these random variables is: where q = {q 1 , ..., q n } is a possible sequence of n states of the MC, Ω is the sample space of q, and where π q1 n−1 t=1 p b qtqt+1 is the probability of observing the sequence of states q. Consequently, the variance tends to zero as the sample size increases since Chebychev's inequality and the asymptotic properties of the MLE, we conclude that s kl is a consistent estimator for E [S kl,n ], and therefore,Ĉ SH.b 10 is a consistent estimator for Similarly,Ĉ SH.b 01 is a consistent estimator for T01 T0· . We recall that under the assumption that the network is undirected, T 01 = T 10 and therefore, this shows that is a consistent estimator for T0· T1· , which also proves that c is consistent for 1.

A.3. Multilevel between group DR
Suppose that the between group DR variable x is such that x ∈ {1, ..., G} N . Therefore, there are G possible values for x i , for i ∈ {1, ..., N }. Without loss of generality, also suppose that the G-th group is the reference group for the DR, which we define as the group serving as the comparison to measure the magnitude of the DR: Then, assuming a RW on the nodes of a single component undirected network with transition probabilities: it is possible to demonstrate that the stationary distribution of this RW is such that Since the φ xi 's are unknown, they are estimated by maximizing the following likelihood function: where p ri−1ri are the transition probabilities according to equation A.1 for each of the observed recruitment. Finally, the extended VH and Lu prevalence estimators remain as shown in equations 3.11 and 3.12, withd * i now proportional toφ xi G g=1 d g iφ g to take into account the multilevel between group DR. We have performed an additional simulation study to evaluate the estimators under between group DR with multiple groups. In this simulation, we created a DR variable x with 3 groups such that: φ 1 = 4, φ 2 = 2, and φ 3 = 1. We assumed that 35% of the population belonged to the first group, 35% to the second group, and 30% to the last group. Also, we assumed the prevalence of the outcome of interest to be 20%, equally split between the first and second group. We applied the multilevel methodology for 3-group between group DR, first assuming that the model specification was correct (with g1) and then assuming that the researcher was unaware of the first group (without g1). Therefore, the model misspecification scenario ignores the groups with the highest level of DR. We calculatedμ V H ,μ V H.dr ,μ Lu andμ Lu.dr . The results are shown in Table 3. We note the good performance ofμ V H.dr andμ Lu.dr under the scenario "with g1" when making inference for either the prevalence of g2 (35%) or the outcome variable (20%). Model misspecification does not seem to alter significantly the performance ofμ V H.dr norμ Lu.dr under the simulated scenario when estimating the prevalence of g2. However, we note thatμ V H.dr is biased when estimating the prevalence of the outcome variable. Therefore, we would recommend usinĝ μ Lu.dr to minimize the impact of model misspecification.

A.4. Transition probabilities in uncertainty estimators
In this section, we present the transition probabilities used in our extension of Lu's bootstrap procedure for the cases of within group and tie differential recruitment. In both Table 4 and Table 5, i represents the index of the recruiting node and j = i the index of any available node for recruitment.

A.5. Simulation study results
In this Section, we present results from the simulation study, which correspond to the results presented in Figure 3. Also, we present additional results. Figure 7 and 8 correspond to the point estimate and variance estimation simulation study results when the object of inference is the prevalence of the DR characteristic x. This variable is also assumed to be the one inducing DR. Figure 9 compares the point estimate results for the three forms of differential recruitment for a moderate level of DR (φ = 2) and for two levels of network homophily.