A Greedy Algorithm for Representative Sampling : repsample in Stata

Quantitative empirical analyses of a population of interest usually aim to estimate the causal effect of one or more independent variables on a dependent variable. However, only in rare instances is the whole population available for analysis. Researchers tend to estimate causal effects on a selected sample and generalize their conclusions to the whole population. The validity of this approach rests on the assumption that the sample is representative of the population on certain key characteristics. A study using a non-representative sample is lacking in external validity by failing to minimize population choice bias. When the sample is large and non-response bias is not an issue, a random selection process is adequate to ensure external validity. If that is not the case, however, researchers could follow a more deterministic approach to ensure representativeness on the selected characteristics, provided these are known, or can be estimated, in the parent population. Although such approaches exist for matched sampling designs, research on representative sampling and the similarity between the sample and the parent population seems to be lacking. In this article we propose a greedy algorithm for obtaining a representative sample and quantifying representativeness in Stata.


Introduction
Randomization is a simple procedure that can often ensure that selection bias is not incurred, when attempting to estimate a causal effect.In the randomized controlled trial, arguably the best experimental design, randomly selected groups can only be randomly different on all observed and unobserved background covariates.When the group sizes are large, the simple randomization is expected to achieve balance in group sizes and covariates, since the random error is small.However, when group sizes are not large and balance is required on many covariates, alternative approaches are often considered.In stratified randomization, the covariates define strata of 'similar' subjects and a simple randomization is performed within each stratum.Practically, a small number of covariates can be factored in this design since the number of strata increases exponentially with covariates.A widely used deterministic alternative is the minimization method, which attempts to minimize the total imbalance between groups in selected covariates (Pocock and Simon 1975;Begg and Iglewicz 1980).In observational studies, propensity score matching is widely used to minimize covariate bias.Subjects are matched across groups on their propensity scores, the conditional probabilities of assignment given a group of observed covariates (Rosenbaum and Rubin 1983).More recently, a 'fine balance' method was proposed, which attempts to provide identical marginal distributions of the covariates of interest within each group, without a one-to-one matching of subjects (Rosenbaum, Ross, and Silber 2007).An alternative to propensity score matching is coarsened exact matching, under which the covariates of interest are reduced (or 'coarsened') into acceptable categories (bins) that define a range of strata.Each observation is then assigned to a stratum and one-to-one matching is performed within strata with 2 or more observations (Blackwell, Iacus, King, and Porro 2009;Iacus, King, and Porro 2009).
In representative sampling the aim is to obtain a sample that is as representative of a population as possible on selected available characteristics, possibly even before the 'main' data collection.A study using a sample that is a close match to its parent population on key covariates would be considered to possess external validity, allowing for valid generalizations from the sample to the population.Although the aim is different, representative sampling shares some of the challenges of randomization/minimization and matching.A simple random sampling algorithm is easy to implement but might fail to deliver a representative sample.Besides the potential inability to provide balance on key covariates, a simple random process would be affected by the possible presence of non-participation bias if the whole population was not available.Another issue can be logistical constraints, which might limit the selection of the sample from a population subset with different characteristics, in which case a random process would not provide a representative sample of the population.Therefore, a more deterministic approach to the sampling process may be more suitable.Such a framework does not seem to exist for representative sampling, although the problem shares some of the features of minimization, propensity score matching, coarsened exact matching and fine balance.However, it also faces unique challenges.In all methods outlined above either one-to-one matching is performed or a distance statistic is selected to quantify the distance between subjects (e.g., Mahalanobis distance) and is then used in the group allocation process -or both.For representative sampling, one-to-one matching is not relevant and even in its absence, as in 'fine balance', methods are not applicable to this context.In addition, the use of the term 'representative' is rather arbitrary and sometimes studies do not provide a comparison between the sample and the population characteristics on which the sample is deemed to be representative.Even when a comparison is presented the representativeness of the sample is not quantified.
We propose a new greedy algorithm in Stata (StataCorp.2011b) that uses common discrete and continuous distribution comparison methods to provide a sample that is as representative as possible of (i) a measured and available population, or (ii) one or more theoretical distributions.The algorithm can be fully or partly deterministic and at each step it selects the case that minimizes the overall difference between the distribution(s) of the sample and the population.It also reports an overall measure of representativeness that can offer standardization and transparency and acts as a criterion on the external validity of a study.

Methods
Let us assume a population (or pool) of size N from which we wish to draw a sample of k cases, which is the most representative in terms of the distribution of n 1 continuous variables X 1 , X 2 , ..., X n 1 and n 2 discrete variables Y 1 , Y 2 , ..., Y n 2 .We define the most representative sample as the one whose overall distance from the population or the theoretical distributions is the smallest.A simple description of the representative sampling algorithm is: Step 1: A percentage of the sample is randomly selected.
Step 2: Each case in the eligible pool is temporarily added to the current sample in turn and the distance to the population or a theoretical distribution, for each of the n 1 plus n 2 variables, is estimated.Since the process can be computationally taxing, an early stop rule can be employed that stops the search when a sample that provides an overall distance below an arbitrarily set threshold is identified.
Step 3: An overall measure of distance is estimated across all variables for each of the possible inclusions and the case whose inclusion leads to the smallest overall distance is added to the sample.
Steps 2 and 3 are repeated until k cases are selected.In other words, the algorithm is of O(N • k) complexity and attempts to arrive at a solution for the overall distance minimization problem by selecting the local minimum for each case added to the sample.The methods used for Step 2 vary by sampling type (population or theoretical) and test type (asymptotic or exact) and are presented in Sections 2.1 and 2.2.Methods for Step 3 are consistent across sampling type and are presented in Section 2.3.

Population sampling
When the user requests the current dataset to be used as the whole population the second step involves two-way Kolmogorov-Smirnov tests for continuous variables and χ 2 tests (or Fisher's exact) for discrete variables.
For each continuous variable X i , at each case selection stage, two-way Kolmogorov-Smirnov tests are performed across all eligible cases.Assuming we have already sampled m cases (randomly and/or deterministically) from population N , the Kolmogorov-Smirnov statistic calculated for the inclusion of the m + 1 case is: In (1), F 1,m+1 and F 2,N are the empirical cumulative distribution functions of the sample and population respectively, for variable X i .The expression converges to the Kolmogorov distribution and using the respective table we can test the hypothesis of common sample and population distributions for the sample and obtain an asymptotic p value.If the user requests asymptotic tests a correction is applied to the p value by using a numerical approximation technique (StataCorp.2011a).If exact tests are requested the exact p value is calculated by a counting algorithm (Gibbons and Chakraborti 2011).In all cases, the p value obtained gives the probability of the two distributions being as different as observed (or more), under the assumption of a common distribution.
For each binary or categorical variable Y i , at each case selection stage, χ 2 or Fisher's exact tests are performed across all eligible cases.Assuming we have already sampled m cases (randomly and/or deterministically) from population N , we evaluate the homogeneity between population and sample (including case m + 1 in the latter) through the χ 2 statistic: where K is the number of categories for variable Y i , O ij is the number of observations for the ith row and jth column of the K by K table and The statistic described in (2) asymptotically follows a χ 2 distribution with (K − 1) 2 degrees of freedom and a p value is obtained from the relevant table.At the user's request, Fisher's exact test and its extension for r ×c tables is alternatively used.The test is a permutation test that uses every possible table to calculate a probability of observing a table that gives at least as much evidence of heterogeneity as the one observed under the homogeneity assumption.The version implemented in Stata is the algorithm proposed by Mehta and Patel (1983).

Theoretical sampling
Alternatively, the user might not wish to treat the dataset as the whole population but as a pool from which to draw a sample that is representative in terms of one or more theoretical distributions.In that case, the second step of the algorithm involves one-way Kolmogorov-Smirnov tests for continuous variables and one-sample tests of proportion (or binomial probability tests) for discrete variables.
Similarly to the population sampling scenario, for each continuous variable X i , at each case selection stage, one-way Kolmogorov-Smirnov tests are performed across all eligible cases.The Kolmogorov-Smirnov statistic, after the inclusion of the m + 1 case would be: In (3), F m+1 is the empirical cumulative distribution function of the sample and F the theoretic cumulative distribution function, e.g., a normal cumulative distirbution function with mean and standard deviation provided by the user, for variable X i .In this case, expression √ m + 1D m+1 converges to the Kolmogorov distribution and we test the hypothesis the sample is normally distributed with the specified parameters and obtain a corrected asymptotic p value (StataCorp.2011a).
For each binary variable Y i , at each case selection stage, one-sample tests of proportions or binomial probability tests are performed across all eligible cases.For the inclusion of the m+1 case, for example, we compare the proportion observed in the sample to the hypothesized proportion through the statistic: where p and p 0 are the observed and hypothesized proportions respectively.The statistic described in (4) asymptotically follows a normal distribution, which can be used to obtain an asymptotic p value.Alternatively, an exact p value is calculated using the binomial distribution.In that case the upper and lower one-sided p values are provided by ( 5) and ( 6): where κ is the observed number of successes.The two-sided p value is The p value obtained from all tests in this section gives the probability of the sample distribution being as different to the theoretical as observed (or more), under the assumption the sample is drawn from that reference distribution.

Overall distance measure
Assuming the performed tests, n 1 for continuous and n 2 for discrete variables, are independent, we can combine their results using Fisher's combined probability test (Fisher 1932).The test uses the p values from independent tests of the same null hypothesis and calculates the test statistic described in (7): When all null hypotheses are true, χ 2 F will have a χ 2 distribution with 2(n 1 + n 2 ) degrees of freedom.Following from that, a p value can be determined to inform the decision of rejection or not of the overall null hypothesis: in our case, common sample-population distributions (population sampling) or the sample being drawn from hypothesized distributions (theoretical sampling).Although the role of the procedure as a composite test has been criticized due to the fact that results consistent with the alternative hypothesis influence the test statistic disproportionately more than the results consistent with the null (Rice 1990;Whitlock 2005), this property is arguably desirable in representative sampling.Rice (1990) provides an example to illustrate the issue; if two studies with p values of 0.999 and 0.001 were combined the resulting p value under Fisher's method would be 0.008.However, this asymmetric sensitivity to small p values can be beneficial when we wish to maximize representativeness across all provided distributions and not 'on average'.In other words, we wish for the algorithm to select the path that leads as far from rejection of any of the null hypotheses as possible: at each selection step it includes in the sample the case that leads to the lowest χ 2 F and the highest p value.Therefore, the algorithm, at each selection step, selects the sample with the highest probability of being the least different to the population (observed or hypothesized), under the assumption of common distributions.The formula in (7) can easily be edited to account for unequal weights across the performed tests.bincat(varlist) Binary and categorical variable(s) on which 'representativeness' will be based.For theoretical sampling only binary variables are allowed.mean(numlist) List of means for continuous variables.Order must correspond to order in option cont.Only required for sampling using one or more theoretical distributions.sd(numlist) List of standard deviations for continuous variables.Order must correspond to order in option cont.Only required for sampling using one or more theoretical distributions.perc(numlist) List of percentages for continuous variables.Order must correspond to order in option bincat and percentages need to be in the (0, 100) range.Only required for sampling using one or more theoretical distributions.seednum(#) Random seed number; the default is 7.

randomperc(#)
The percentage of cases that will be randomly selected at the start of the algorithm.The percentage must be in the [0, 100] range and the default value is 10.Setting to zero will provide a completely deterministic sample and to 100 a completely random sample.srule(#) Early stopping rule that speeds up the process, using the p value for Fisher's combined probability (χ 2 ) test.It must be in the [0.5,1) range and once a sub-sample is identified for which the p value is above the one specified, that sub-sample is selected and the search is stopped early (without going through all cases).Then the algorithm proceeds to select the next case in the sample using the same decision rule.This option is a compromise and the smaller the threshold value, the less likely that the resulting sample will be a close match.
wght(numlist) Variable weighting, to give greater importance to some variables in the process.User needs to provide as many numbers as there are variables and the total weight needs to add up to 100.Assigning order is fixed and must correspond to the variables provided under the cont and bincat options, with all continuous variables (if any) prioritized, followed by binary and categorical variables (if any).Note that the overall matching measures reported in r(chi2) and r(p) are using the weighted scores and therefore quantify the matching under the provided weights assumption.But the individual variable matching measures are unaffected.retain(varname) If a sampling variable to be retained is provided the program will sample on top of the current sample.This option is provided for batched sampling (running time can be very long for large samples and populations), and replacing cases that have withdrawn or become unavailable.For example, if the idea is to sample 100 representative patients to be enrolled in a study the researcher might wish to replace patients who did not agree to participate in the first instance.Cases that are not eligible for selection and cannot be dropped since they define the population should be set to missing in variable varname prior to executing the repsample command.
exact Use exact tests instead of asymptotic approximations.For population sampling, this option increases computation time considerably since it calculates exact p values in the twosample Kolmogorov-Smirnov tests (for continuous variables) and Fisher's exact test (for categorical and binary variables).For theoretical sampling, the increase in computation time is not as dramatic since only binary variables are affected with the use of the binomial probability test.
force Force replace sample information variable repsample, if present in the dataset.Cannot be used along with option retain(repsample), but can be used with that option if another variable name is specified.

Saved results
The command creates binary variable repsample which contains the sample information.In addition it saves the results for the tests used (for the final sample only) in scalars in r() (Table 1).

Motivational example
In England, primary care is delivered by over 8,200 family practices (called general practices).
Although they share many contract characteristics and they uniformly participate in the Quality and Outcomes Framework, a remuneration scheme that rewards practitioners for good clinical practice, they vary greatly in practice list size and characteristics of the area they serve.Therefore, primary care studies that wish to investigate practices attempt to recruit a sample that is as representative as possible of all English practices on these key characteristics.This is usually a two step process: (i) a large pool of practices (five to ten times the size of the required sample) is asked to be considered for participation and (ii) a sample representative of all English practices is selected from those that agreed to participate.However, considering that sample sizes are usually small, stratified random sampling approaches are limited in what they can deliver.Usually, two variables are the practical limit on which to stratify and variables need to be reduced to two or three categories with loss of information.On occasion, practices that withdraw from a study might need to be replaced, an issue which leads to complications if the respective stratified random sampling cell from which to replace is empty.
As Let us assume we wish to recruit practices only from the North-East of England for logistical reasons.However, the distributions for the area differ somewhat from the distributions for the whole of the North of England.Although, on average, practices are located in areas of similar deprivation, they tend to be larger and are more often located in rural area compared to all Northern English practices: . sum listsize soaimd07 if shaname=="North East", detail If we wish to recruit a sample of 20 practices from the North-East with distributions that match those of all Northern English practices in terms of deprivation, list size and rurality, as closely as possible, we can do so with the following code: . qui gen repsample=.

Performance
A small simulation was performed to investigate the performance of the algorithm.We generated data for 200 observations on two normally distributed continuous variables (Mean = 0; SD = 1) and one binary variable (p = 0.15).Performance of repsample under the default options (asymptotic, randomperc(0) and population sampling) was compared to a computationally inexpensive random sampling approach, for samples of 10, 30 and 50 cases and matching on one continuous, two continuous or two continuous & one binary variable.The measure of comparison was the mean absolute difference between sample and population means and (for the two continuous variables) standard deviations, over 1000 repetitions (Table 2).As expected, samples with the purely random approach match the population better as the size of sample increases.The trend is the same with the repsample algorithm but the absolute differences in all scenarios are much lower, compared to the random approach.
Unfortunately the command can be computationally very expensive, especially when the pool from which to sample is large.The command was executed fully deterministically (i.e., with randomperc(0)) in various scenarios on an Intel Core i5-2500 CPU (3.30GHz) computer with 8GB of RAM, running Windows 7 64bit and Stata 12.1 SE and the completion times are presented in Table 3.More specifically, we varied the sampling approach (population or theoretical; exact or asymptotic), the pool size from which to sample (100, 1000 or 10000) and the number of variables used in the process (1 to 3).Reported times are approximate since they can be greatly affected by other processes executed in parallel.Note that we did not execute a population-exact sampling for a pool of size 10000 since the running time would likely be a few months.

Conclusions
We have presented a simple greedy algorithm for representative sampling from a population or theoretical distributions.Although computational time can be very large for large samples and populations, the algorithm was created having small samples in mind.For small samples, stratified random sampling has practical limitations that are very difficult to overcome and the suggested deterministic method might be a better alternative in terms of: Complexity: Stratified random sampling might involve numerous trial and error steps to identify the practical limits of the method and choose the best possible approach.
Controlling for more variables: Stratified random sampling rarely involves more than two variables, while there is no limit under the repsample algorithm (of course, the more variables are included the worse the matching will be for each variable individually).
Outcome, i.e., 'representativeness': For small samples a random selection often fails to deliver a representative sample on all important variables, while the more deterministic repsample algorithm is more likely to find a better solutions for the problem and provides a sample that is closer to the population or hypothesized distributions.
The algorithm has certain limitations.First, users must realize that repsample does not guarantee a sample that is representative but one that is as representative as possible under the parameters specified.Second, the algorithm only indirectly takes into account relationships between the included variables, through closely matching each univariate distribution.Third, when quantifying overall 'representativeness' under Fisher's combined probability test the combined tests are assumed to be independent.Although, dependence can introduce anticonservative bias to the test (i.e., p values for χ 2 F are biased towards zero) and methods have been developed to deal with the issue (Brown 1975;Kost and McDermott 2002), this is less of a problem in the context of a minimization algorithm.In repsample we are mainly interested in the relative ranking of the p values and not their absolute values; assuming the introduced bias is uniform across cases, rankings should be unaffected.Fisher's method does not even seem to be the best approach for combining independent p values (Davidov 2011) but its simplicity and computational speed make it attractive in this ranking context.

Table 2 :
Mean absolute difference between population and sample for means (standard deviations) when sampling using one continuous, two continuous, or two continuous & one binary variable.

Table 3 :
Approximate running time (in seconds) for sampling using one continuous, two continuous, or two & one binary variable.