Sampling and Estimation in Hidden Population Using Social Network

Characteristics of hidden populations (e.g. population of injection drug users) cannot be studied using standard sampling and estimation procedures. This article considers methods for estimating the population proportion of hidden population using social network. We compare the sampling and estimation technique of respondent-driven sampling with the simplified sampling procedure based on Markov-chain model and discusses the equivalence of these procedures. These procedures fail to provide formulae for estimating the variances of their estimators due to the complexities of their methods. We describe a simplified sampling procedure for collecting data on both the population and its social network, and provide a simple formula to estimate the population proportion efficiently. We further derive a formula to compute an estimate of the variance of the proposed estimator using the delta method. Simulation study is provided to illustrate the new sampling and estimation method. Citation: Zhao Y (2017) Sampling and Estimation in Hidden Population Using Social Network. J AIDS Clin Res 8: 667. doi: 10.4172/2155-6113.1000667


Introduction
Special populations that cannot be studied using standard sampling and estimation procedures are called hidden populations. For example, the populations of injection drug users, men who have sex with men, illegal immigrants, and the homeless. Consistent estimation of the size of these populations are crucial for researchers and policy makers.
Salganik and Heckathorn [1] provide a comprehensive review of sampling and estimation methods for studying hidden populations, including targeted sampling (Watters and Biernacki [2]) and timespace sampling (Muhir et al. [3]). They mention that these methods often fail to provide accurate estimates of the true values. They further point out that a common drawback of most methods is that they fail to use the social network relationships in many hidden populations, that is the network of relationships among the real people in the population, for example the network of friendships. They propose a sampling and estimation method based on a snowball-type sampling design (Coleman [4]), called respondent-driven sampling, which makes use of the social network relationships in a hidden population to collect information from the population of interest such that unbiased estimations of the population characteristics are possible. However, the consistency of their estimator depends on the assumption that individuals are randomly recruited into the study. Other problems with the multi-wave or the snowball-type sampling designs are the costs in regards to both time and money.
To avoid using the multi-wave sampling procedure Zhao [5] introduces a Markov-chain model for estimating the social network relationships. It computes the long run transition probabilities based on the Markov theory, then estimate population proportion using the result of Salganik and Heckathorn [1]. However, none of Salganik and Heckathorn [1] and Zhao [5] provides formulae for estimating the variances of their estimators due to the complexities of their sampling and estimating procedures.
In this article we describe a simplified sampling procedure to collect information on both the population and its social network relationships simultaneously. We derive a consistent estimator of the population proportion of hidden population based on the simplified sampling design. Organization of the rest of the article is as follows. In Section 2 we briefly review the respondent-driven sampling method of Salganik and Heckathorn [1] and the Markov-chain model of Zhao [5], and discuss the equivalence of these two approaches. A simplified sampling and estimating procedure is described in Section 3. A formula for estimating the variance of the proposed estimator is derived. Section 4 provides simulation study to examine the small sample performance of the proposed method. Section 5 gives a brief discussion to conclude the results.
Assume that a population is divided into 2 groups A and B, and they are connected by social network relationships, say friendships. Let N A and N B be the total number of people in group A and B respectively, P A =N A /(N A + N B ) and P B =1 − P A be the finite population proportion for group A and B respectively. The object is to estimate P A and P B . Let D Ai be the number of friendships of the i th individual in group A. It's also called the degree of the i th individual. The total number of friendships radiating from individuals in group A is (1) Here T AB =T BA is the total number of friendships radiating from members in group A to group B or vice versa.

Abstract
Characteristics of hidden populations (e.g. population of injection drug users) cannot be studied using standard sampling and estimation procedures. This article considers methods for estimating the population proportion of hidden population using social network. We compare the sampling and estimation technique of respondent-driven sampling with the simplified sampling procedure based on Markov-chain model and discusses the equivalence of these procedures. These procedures fail to provide formulae for estimating the variances of their estimators due to the complexities of their methods. We describe a simplified sampling procedure for collecting data on both the population and its social network, and provide a simple formula to estimate the population proportion efficiently. We further derive a formula to compute an estimate of the variance of the proposed estimator using the delta method. Simulation study is provided to illustrate the new sampling and estimation method. above. Salganik and Heckathorn [1] show that In practice, respondents are selected from the social network based on the respondent driven sampling design of Heckathorn [6], where a small number of initial seeds is selected first, then current seeds randomly recruit other friends into the sample, and the recruiting process continuous until the required sample size is reached. Let r AB be the total number of recruitments from individuals in group A to individuals in group B, r AA be the total number of recruitments from individuals in group A to other individuals in the same group, and the same for r BA and r BB . Based on the random recruitment assumption C AB , C BA , D A and D B can be consistently estimated by , and , respectively. Here n A and n B are the total numbers of individuals selected from groups A and B respectively. Then the population proportions can be estimated by substituting (3) to (2). Salganik and Heckathorn [1] show that these estimators are asymptotically unbiased regardless of how the initial seeds are selected.

The Markov-chain model
An important contribution of Zhao [5] is that they propose a onewave sampling design to collection information about the population and its social network relationships. In this design selected individuals are required to recruit all their friends into the study, information on how many friendships they have in group A and group B is recorded respectively, and random recruitment assumption is not required. Furthermore they describe a Markov-chain model for the social network relationships. Instead of using groups A and B, they define 2 states, A and B, and it is assumed that each individual is either in state A or B but not both. Suppose individuals are selected using respondentdriven sampling design, and let P AB be the probability that a randomly selected individual in state A will recruit an individual in state B, and P AA =1−P AB be the probability that a randomly selected individual in state A will recruit an individual in state A. Similarly P BA and P BB can be defined as above. Then the transition probability matrix for a first order Markov-chain model can be denoted as Under the condition that P is an ergotic irreducible transition matrix, in the long run the probability that an individual in state A will be selected is and π B =1-π A . Then to estimate the population proportion Zhao [5] recommends using the results of Salganik and Heckathorn [1] as In practice to compute estimates of the population proportions P A and P B using the above formulae we need to estimate D A , D B , P AB and P BA . However, if we substitute (4) to (5) directly, we get Comparing the estimators in (2) and (6), it is easy to see that P AB , P BA and C AB ,C BA are essentially measuring the same quantities in the two different models, and the two methods are therefore equivalent.
Neither Salganik and Heckathorn [1] nor Zhao [5] provide variance estimators for their estimators because of the complexities of their sampling and estimating techniques. Next we describe a simplified sampling and estimation procedure for estimating P A and P B , and the corresponding variances of the estimators.

A Simplified Sampling and Estimating Method
We consider the one-wave sampling design of Zhao [5]. Let A and B represent the two groups A and B in the same settings as Salganik and Heckathorn [1]. We defined new random variables Z Ai and Z Bi which represent the total number of friendships radiating from the i th individual in group A to individuals in group B and the total number of friendships radiating from the i th individual in group B to individuals in group A respectively. Here the within group friendships are ignored. We define they represent the average degree of associations from group A to group B and from group B to group A respectively. If we treat {Z Ai : i = 1, · · · , N A } and {Z Bi : i = 1, · · · , N B } as two sub-populations, then Z A and Z B are the corresponding sub-population means.
As T AB = T BA from (7) we can derive that (8)

Substituting (8) to (9)
we obtain Therefore consistent estimates of P A and P B can be obtained if both Z A and Z B can be estimated consistently. We know that Z A and Z B only contain the between group friendships and the within group friendships are completely ignored. The above result indicates (i) consistent estimation of the proportions P A and P B can be achieved using only the information of the between group friendships; and (ii) the one-wave (or two-wave) sampling design of Zhao [5] can be further simplified and for the individuals selected in the sample we only need to record the information on how many friendships they have in the other group.
In practice assume that a sample is drawn from a target population with two groups A and B. We will record the total number of friendships radiating to the other groups, Z Ai or Z Bi , for each individual selected from group A or B. Let z A and z B be the corresponding estimators of the sub-population means Z A and Z B respectively, then the proportions P A and P B can be estimated as and the variances can be estimated using the delta method as In the appendix (Appendix 1) we show that our proposed estimators for P A and P B are equivalent to Salganik and Heckathorn's [1] estimators, however, they are much simplified which allow us to construct a formula to estimate their variances analytically.

Simulation Study
In this section we use simulation study to examine the small sample performance of the proposed sampling and estimation method. We consider the setting similar to that of Salganik and Heckathorn [1].
The numbers of friendships D' Ai s and D' Bi s are generated using exponential distribution with means µ A and µ B for groups A and B respectively, and D' Ai s and D' Bi s take the closest integer values. let I denote the interconnectedness, and T AB =T BA =I × min(R A , R B ). We generate data for N A =3, 000, N B =7, 000, µ A =20, µ B =10, and I=0.6. We select simple random sample of size n A and n B from group A and B independently. Equations (11) and (13) are used to estimate P A and P B and the corresponding standard errors (se.'s). Table 1 shows the results for estimation P A based on 10, 000 replications for different sample sizes (n A , n B ). We see that all the biases are close to 0, the means of se.'s are close to the empirical standard deviations (sd.), and the 95% coverage probabilities are close to the nominal value. The results indicate that the overall performance of the proposed method is acceptable for practical implementation.

Discussion
This research describes a simplified sampling and estimation procedure for estimating the population proportion for hidden population. The new method makes significant improvements of Salganik and Heckathorn's [1] methodology by simplifying the formula of Salganik and Heckathorn's [1] estimator, and providing analytic formula for estimating the variance of the proposed estimator. The simplified estimator indicates that consistent estimate of the population proportion does not depend on the information of within group social network relationships, which allows us to further simplify the one-wave sampling procedure of Zhao [5]where the random recruitment assumption is not required.
The new sampling and estimation method is motivated by the initial idea of simplifying the sampling procedure of the respondent-driven sampling in Zhao [5]. They propose the one-wave sampling design where information on the social network relationships is observed completely for each individual selected in the sample and random recruitment is not required. We would expect that the social network relationships can be estimated more efficiently in the new sampling design. However, they fail to supply a new estimator to compute consistent estimates of population proportions, and they eventually use Salganik and Heckathorn's [1] estimator which is functionally complicated and analytic variance estimation is not available.
In applied statistics simple and efficient methods are always respectable. We hope that the proposed methods can be used to improve some studies in epidemiology and social problems.