Empirical likelihood inference with public-use survey data

Public-use survey data are an important source of information for researchers in social science and health studies to build statistical models and make inferences on the target finite population. This paper presents two general inferential tools through the pseudo empirical likelihood and the sample empirical likelihood methods. Theoretical results on point estimation and linear or nonlinear hypothesis tests involving parameters defined through estimating equations are established, and practical issues with the implementation of the proposed methods are discussed. Results from simulation studies and an application to the 2016 General Social Survey dataset of Statistics Canada show that the proposed methods work well under different scenarios. The inferential procedures and theoretical results presented in the paper make the empirical likelihood a practically useful tool for users of complex survey data.


Introduction
proposed the empirical likelihood approach for making inference from independent and identically distributed random samples. He showed that the empirical likelihood ratio statistic for the population mean has a standard limiting chi-squared distribution, and used this result to obtain confidence intervals for the population mean similar to the classic parametric method. Qin and Lawless (1994) demonstrated that empirical likelihood can be combined with estimating equations for statistical inferences with more general parameters. The development of empirical likelihood as a general inferential tool has been one of the major advances in statistics in the past three decades.
Empirical likelihood was in fact first introduced in the sample survey context by Hartley and Rao (1968) as the scale-load likelihood, but their focus was on point estimation of a finite population mean under simple random sampling and stratified simple random sampling. Chen and Qin (1993) studied empirical likelihood under simple random sampling using the formulation of Owen (1988), and Zhong and Rao (2000) studied empirical likelihood confidence intervals on the finite population mean under stratified simple random sampling. For general sampling designs involving unequal probability sampling with or without stratification, there have been several proposed approaches on empirical likelihood for complex surveys, including the pseudo empirical likelihood method of Chen and Sitter (1999) and Wu and Rao (2006), the population empirical likelihood method of Chen and Kim (2014), and the empirical likelihood method of Berger and Torres (2016) and Oguz-Alper and Berger (2016). However, all existing methods require the first order inclusion probabilities from the initial survey design and are developed under the setting that detailed design information is available. In addition, the use of calibration constraints for inference with existing approaches requires that auxiliary information, such as known population means or totals, is available to survey data users.
In practice, public-use survey data are released to users and such data sets often report only the variables of interest and the final survey weights {w i , i ∈ S} obtained by adjusting for unit nonresponse and calibration on auxiliary variables selected by the producer of the data, where S denotes the set of units included in the released data file. Furthermore, the data file provides B columns of final replication weights {w (b) i , i ∈ S} designed for variance estimation. The following table shows a typical format of public-use survey data files as seen by the users.
(1) i · · · w (B) i 1 y 11 y 12 x 11 x 12 x 13 w 1 w (1) 1 · · · w (B) 1 2 y 21 y 22 x 21 x 22 x 23 w 2 w (1) 2 · · · w (B) 2 3 y 31 y 32 x 31 x 32 x 33 w 3 w The final replication weights {w (b) i , i ∈ S} are one of the most crucial parts in creating public-use survey data files. Different versions of bootstrap replication weights, such as those developed by Rao and Wu (1988) and Rao, Wu and Yue (1992) for stratified multi-stage designs, are commonly reported with the data file. Final replication weights are typically obtained by subjecting the basic replication weights (such as the bootstrap weights) to the same unit nonresponse adjustment and calibration procedures. None of the existing empirical likelihood methods is applicable for statistical inferences with public-use data files because the first order inclusion probabilities, the calibration variables and the associated known population means or totals are not reported on the data file and are not available to users.
The main purpose of this article is to develop empirical likelihood methods for statistical analysis with public-use survey data files. We consider two general approaches: the first is based on the pseudo empirical likelihood and the second uses the sample empirical likelihood. We present design-based inferential procedures and theoretical results on two general statistical inference problems with the vector of finite population parameters defined through the census estimating equations: the maximum empirical likelihood estimators and the empirical likelihood ratio test on a general linear or nonlinear hypothesis. Design-based variable selection through a penalized pseudo or sample empirical likelihood is discussed. We also present a bootstrap procedure under single stage survey designs for creating valid replication weights with theoretical justifications. Simulation results and an application to the General Social Survey 2016 public-use data file released by Statistics Canada are included.
The basic settings are described in Section 2. Main theoretical results are presented in Section 3. A bootstrap procedure under single stage survey designs to create valid replication weights is described in Section 4 with theoretical justification given in the Appendix. Results from simulation studies are reported in Section 5. The application to the General Social Survey 2016 public-use data file is presented in Section 6. We conclude with some additional remarks in Section 7. Our presentation on sample empirical likelihood follows Zhao, Haziza and Wu (2018), and the discussion on pseudo empirical likelihood follows Zhao and Wu (2019). Proofs and technical details of several main theoretical results have similarities to Qin and Lawless (1994), Zhao, Haziza and Wu (2018) and Zhao and Wu (2019), and are presented in the Appendix.

Empirical Likelihood and Estimating Equations for Complex Surveys
Let U = {1, 2, · · · , N} be the set of units in the finite population, where N is the population size. Let (y i , x i ) be the measures of the study variable y and auxiliary variables x for unit i. Let F N = {(y i , x i ), i = 1, · · · , N} represent the survey population and let {(y i , x i ), i ∈ S} be the survey sample data. Let π i = P (i ∈ S), i = 1, · · · , N be the first order inclusion probabilities.
Survey data are a major source of information for official statistics, where the focus is often on descriptive population quantities such as population means or quantiles. Complex surveys are also frequently used by researchers in social sciences and medical and health studies for statistical modelling. Under both scenarios, the finite population parameters θ N of dimension p can be defined as the solution to the census estimating equations where g(x, y, θ) is an estimating function of dimension r, and θ ∈ Θ, a compact subset of R p with 1 ≤ p ≤ r. Under normal circumstances we have r = p but over-identified scenarios with r > p do arise in practice due to additional calibration constraints or known moment conditions over certain variables.
Standard empirical likelihood inference with independent observations, as introduced by Owen (1988) and with parameters defined by estimating equations, as discussed by Qin and Lawless (1994), consists of three ingredients: where ℓ(p) given by (2.2) is the empirical log-likelihood function and p = (p 1 , · · · , p n ) is the probability measure over the n sampled units, equation (2.3) is the normalization constraint to ensure that p is a discrete probability measure, and equations (2.4) are the constraints induced by the parameters θ. The use of log(p i ) implicitly requires that p i > 0.
Naive applications of the standard empirical likelihood methods to complex survey data do not produce valid results under the design-based framework. There have been three major modified approaches in the survey sampling literature on using the empirical likelihood method for complex survey data, and their relations to the standard empirical likelihood ingredients (2.2), (2.3) and (2.4) can be described as follows.
(1) The pseudo empirical likelihood approach (PEL): Chen and Sitter (1999) are the basic design weights, while constraints (2.3) and (2.4) remain unchanged. The use of ℓ PEL0 (p) is motivated by the fact that ℓ PEL0 (p) is the Horvitz-Thompson estimator for the conceptual census empirical log-likelihood function N i=1 log(p i ). Wu and Rao (2006) used a modified version ℓ PEL1 (p) = n i∈Sd i (S) log(p i ), whered i (S) = d i / j∈S d j , which facilitates the construction of the pseudo empirical likelihood ratio confidence intervals for population parameters. Rao and Wu (2010a) extended the method for multiple frame surveys and Rao and Wu (2010b) developed a Bayesian pseudo empirical likelihood method to survey data analysis. However, all the existing results on pseudo empirical likelihood methods focus primarily on inferences for a scalar parameter. General statistical tools involving a vector of parameters with the pseudo empirical likelihood are not available.
(2) The population empirical likelihood approach (POEL): Chen and Kim (2014) defined the population empirical log-likelihood function as The survey data and parameters are forced into the "population system" through the constraints i∈S ω i π −1 i = 1 and i∈S ω i {g(x i , y i , θ)π −1 i } = 0. Chen and Kim (2014) focused on Poisson sampling and rejective sampling, and the method has not been developed for general unequal probability sampling designs or general inferential problems for analytical use of survey data.
(3) The sample empirical likelihood approach (SEL): The method was first mentioned very briefly by Chen and Kim (2014) as a remark but detailed exploration was not pursued in their paper. The idea is to use the standard empirical log-likelihood function ℓ SEL0 (p) = i∈S log(p i ) from (2.2) and the standard normalization constraint (2.3) but modify the constraints induced by the parameters as i∈S p i {g(x i , y i , θ)π −1 i } = 0. A related formulation was presented by Berger and De La Riva Torres (2016) and Oguz-Alper and Berger (2016). They used l (m) = i∈S log(m i ), where the m i satisfy the so-called design constraint i∈S m i π i = n. The constraints for the parameters are specified as i∈S m i g(x i , y i , θ) = 0. It can be seen that, if we let p i = m i π i n −1 , the formulation is equivalent to the one proposed by Chen and Kim (2014). The sample empirical likelihood method has been further developed in a recent paper by Zhao, Haziza and Wu (2018) as a general inference tool for survey data analysis under the assumption that the first order inclusion probabilities π i and other related design and population information are available.
Unfortunately, none of the existing empirical likelihood methods can be used directly for statistical analysis with public-use survey data files since the initial inclusion probabilities π i are not available, and calibration variables along with their known population totals are typically not given to the end users of the data files. On the other hand, the availability of final survey weights and replication weights for public-use data sets provides a unique opportunity to develop empirical likelihood as a general statistical tool for survey data analysis.

Empirical Likelihood Inference with Public-Use Survey Data
3.1 Public-use survey data and basic assumptions Consider the following version of a micro survey data file, which is released by the survey agency for public use: where the y i and x i are possibly vector-values survey variables included in the data set, the w i is the final survey weight for unit i after unit nonresponse adjustment and/or calibration weighting, and n is the final sample size. Also included in the data file are B final replication weights w associated with unit i. The detailed survey design information such as the original design weights d i = 1/π i and the known auxiliary population information are assumed to be unavailable to the users of the data file. It is also assumed that the finite population size N is unknown.
The survey weighted estimating equations for the vector of parameters θ N defined by the census estimating equations (2.1) are given bŷ For standard scenarios where r = p, i.e., the number of equations is the same as the number of parameters, the survey weighted estimatorθ N for θ N is the solution to (3.1). Let g i (θ) = g(x i , y i , θ) and assume that g i (θ) is a smooth function of θ. The approximate design-based variance ofθ N has the well-known sandwich form (Binder, 1983) V ar θ is the design-based variance. There have been attempts to address hypothesis testing problems involving a single component of the vector of parameters θ N under the estimating equations framework, see, for instance, Binder and Patak (1994), but general hypothesis testing procedures are not available in the literature.
We consider smooth estimating functions and allow over-identified estimating equations system with r ≥ p. Practically useful results for the special case r = p and for a scalar parameter (i.e., p = 1) will also be spelled out. For asymptotic development, we assume that there is a sequence of finite populations and a sequence of survey designs with both the population size N and the sample size n going to infinity; see Isaki and Fuller (1982) for further detail. We use N → ∞ to denote the limiting process. Note that θ N refers to the true vector of the finite population parameters. Throughout the paper, we use · to denote the Euclidean norm and L → to denote convergence in distribution under the designbased framework. Let O p (·) and o p (·) be the stochastic orders under the same framework.
We consider the following basic assumptions for the public-use survey data file and the estimating functions g i (θ).
Assumption 1. The final survey weights (w 1 , w 2 , . . . , w n ) and the finite population values F N = {(y i , x i ), i = 1, · · · , N} satisfy conditions that ensureÛ n (θ N ) = i∈S w i g i (θ N ) is asymptotically normally distributed with mean zero and variance-covariance matrix of the order O(N 2 /n).
i g i (θ N ) be the replicate version ofÛ n (θ N ) = i∈S w i g i (θ N ) using the bth set of replication weights (w is a design-consistent estimator of the variance-covariance matrix V ar Û n (θ N ) | F N .
The original design weights, the nonresponse adjusted weights and the calibration weights usually satisfy Assumption 1. It is part of the foundation for design-based inference. Assumption 2 is the guiding principle for public-use data file producers on how to create replication weights and for research activities on replication methods for variance estimation in surveys. Note that Assumption 2 does not necessarily require a large B for the given data set, as shown by the results presented in Kim and Wu (2013). Most survey organizations, including Statistics Canada, use B = 500 for producing public-use survey data files in their current practice. See the example of General Social Survey presented in Section 6.
Assumptions 3-5 are standard regularity conditions for asymptotic development for finite populations with complex survey data. The inclusion of the factors N −1 or N −2 in the quantities presented in Assumption 5 is for convenience in asymptotic orders. They are not required for computational purposes as they all cancel out in the main results to be presented in the next two subsections. The pseudo empirical likelihood approach of Section 3.2 and the sample empirical likelihood approach of Section 3.3 are formulated using the final weights w i . The empirical likelihood ratio statistics for both approaches do not have standard χ 2 asymptotic distributions, since design-based variances require information from additional columns of replication weights in the dataset.

The pseudo empirical likelihood approach
Letw i (S) = w i / k∈S w k , i ∈ S be the normalized final survey weights. The pseudo empirical log-likelihood function is defined as For the special case of equal final survey weights, we havew i (S) = 1/n and ℓ PEL (p) = i∈S log(p i ). Maximizing ℓ PEL (p) subject to the normalization constraint (2.3), i.e., i∈S p i = 1, givesp = (p 1 , . . . ,p n ), wherep i =w i (S). Letp(θ) = (p 1 (θ), . . . ,p n (θ)) be the maximizer of ℓ PEL (p) under the normalization constraint (2.3) and the parameter constraint (2.4), i.e., i∈S p i g i (θ) = 0, for a fixed value of θ. It can be shown thatp which can be solved using the modified Newton-Raphson method presented in Chen, Sitter and Wu (2002) and the R code described in Wu (2005). The maximum pseudo empirical likelihood estimatorθ PEL is the maximizer of ℓ PEL p(θ) = n i∈Sw i (S) log p i (θ) with respect to θ. For the special case r = p, the estimatorθ PEL is the solution to which is the same as the customary survey weighted estimating equations estimatorθ N . The pseudo empirical log-likelihood ratio statistic for θ is given by We can re-write the maximum pseudo empirical likelihood estimator of θ N asθ PEL = arg max θ∈Θ r PEL (θ). The following theorem presents asymptotic properties of the estima-torθ PEL . Note that the quantities W 1 (θ N ), Γ(θ N ) and Ω(θ N ) are defined in Assumption 5.
Proofs of Theorem 1 and Theorems 2-6 presented below resemble the proofs in Zhao et al. (2018). Details are presented in the Appendix. The proof of Theorem 1 is also similar to the proof of Theorem 1 in Qin and Lawless (1994).
Corollary 1. Under the assumptions in Theorem 1 and r = p (i.e., the number of equations is the same as the number of parameters), the asymptotic variance-covariance matrix V 1 for θ PEL reduces to V 1 = Γ −1 Ω(Γ ′ ) −1 .
Suppose we want to test the simple hypothesis: The pseudo empirical log-likelihood ratio statistic for testing H 0 is given by The asymptotic distribution of LR PEL (θ N0 ) is given by the following theorem.
Theorem 2. Suppose that Assumptions 1, 3, 4 and 5 hold. Then where Q ∼ N(0, I r ), I r is the r × r identity matrix, r is the dimension of the estimating functions g i (θ), and Corollary 2. Under the assumptions in Theorem 2 and r = p, we have ∆ 1 = Ω 1/2 W −1 1 Ω 1/2 . In particular, if r = p = 1, then where χ 2 (1) denotes the standard χ 2 random variable with one degree of freedom.
We further consider pseudo empirical log-likelihood ratio test for a general linear or nonlinear hypothesis H 0 : R(θ N ) = 0 against H 1 : R(θ N ) = 0, where R(θ N ) is a k × 1 vector-valued functions with k ≤ p and R(θ N ) = 0 imposes k constraints on the vector of parameters θ N . Let Θ R = θ | θ ∈ Θ and R(θ) = 0 be the restricted parameter space under H 0 . The restricted maximum pseudo empirical likelihood estimator of θ under H 0 is defined asθ R PEL = arg max θ∈Θ R r PEL (θ). The pseudo empirical log-likelihood ratio statistic for testing H 0 versus H 1 is given by Theorem 3. Suppose that Assumptions 1, 3, 4 and 5 hold. If the function R(θ) is twice continuously differentiable and Φ(θ N ) = ∂R(θ)/∂θ| θ=θ N has rank k, then Let δ j , j = 1, · · · , p be the non-zero eigenvalues of the r × r matrix ∆ 1 . The asymptotic distribution of LR PEL (θ N ) given in Theorem 2 can be alternatively represented by p j=1 δ j χ 2 j (1), where χ 2 j (1), j = 1, · · · , p are independent random variables, all following the same distribution as χ 2 (1). Similarly, the distribution of the quadratic form Q T ∆ R 1 Q given in Theorem 3 can be alternatively represented by k j=1 δ R j χ 2 j (1), where δ R j , j = 1, · · · , k are the non-zero eigenvalues of the matrix ∆ R 1 . Practical implementations of the theoretical results generally require the estimation of the asymptotic variance V 1 for Theorem 1, the matrix ∆ 1 for Theorem 2 and ∆ R 1 for Theorem 3. This amounts to estimating the involved components W 1 , Γ, Ω and Φ. By the simple "plug-in" method, we can estimate the term where v Û n (θ PEL ) is the replication variance estimator outlined in Assumption 2 using the replication weights from the survey data file.
The distribution of the quadratic forms Q ′ ∆ 1 Q and Q ′ ∆ R 1 Q may also be approximated by the Rao-Scott (RS) correction method Scott, 1981, 1984). For instance, the first-order RS correction leads to

The sample empirical likelihood approach
The sample empirical likelihood approach described in §2 can be adapted for public-use survey data. We start with the standard empirical log-likelihood function ℓ SEL (p) = i∈S log(p i ).
Maximizing ℓ SEL (p) under the normalization constraint (2.3), i.e., i∈S p i = 1, givesp i = n −1 , i ∈ S. The constraints for the parameters θ defined through (2.1) are formed using the weighted estimating functions w i g i (θ) and are given by Letp(θ) = (p 1 (θ), . . . ,p n (θ)) be the maximizer of ℓ SEL (p) under the normalization constraint (2.3) and the parameter constraints (3.4) for a fixed θ. It follows from standard empirical The empirical log-likelihood ratio statistic for θ under the current setting is given by Letθ SEL = arg max θ∈Θ r SEL (θ) be the maximum sample empirical likelihood estimator of θ N . We have the following major results on the asymptotic properties ofθ SEL .
Theorem 4. Suppose that Assumptions 1, 3, 4 and 5 hold. Then The results presented in Theorem 4 under the sample empirical likelihood are similar to those in Theorem 1 for the pseudo empirical likelihood, with the crucial differences in defining W 1 for Theorem 1 and W 2 in Theorem 4. For the special case r = p, the estimatorθ SEL is attained as the global maximum point withp i = n −1 and is the solution to i∈S w i g i (θ) = 0, which coincides with the survey weighted estimating equations estimator.
The sample empirical log-likelihood ratio statistic for testing H 0 : θ = θ N is similarly defined as for the given θ. We have the following results parallel to Theorem 2 and Corollary 2. Once again, the differences are between W 1 and W 2 involved in the asymptotic distributions.
Theorem 5. Suppose that Assumptions 1, 3, 4 and 5 hold. Then Corollary 4. Suppose that the assumptions of Theorem 5 hold. If r = p, then For a general linear or nonlinear hypothesis H 0 : The sample empirical log-likelihood ratio statistic for testing H 0 against H 1 is given by Theorem 6. Suppose that the assumptions of Theorem 3 hold. If the function R(θ) is twice continuously differentiable and Φ(θ N ) = ∂R(θ)/∂θ| θ=θ N has rank k, then 2 Ω 1/2 , and Φ = Φ(θ N ). The term W 2 for the sample empirical likelihood is different from W 1 for the pseudo empirical likelihood and can be estimated bŷ The other two component Γ and Ω can be respectively estimated bŷ where v Û n (·) is given in Assumption 2.

Design-based variable selection
Public-use survey data may contain observations on many variables. Variable selection is a useful technique when fitting a statistical model involving many covariates. The pseudo empirical likelihood and the sample empirical likelihood provide design-based approaches to variable selection through a penalized pseudo or sample empirical likelihood method.
The tuning parameter τ n for the penalized pseudo empirical likelihood or the penalized sample empirical likelihood needs to be appropriately selected by a data-driven method. Various techniques have been proposed in the literature, including the generalized crossvalidation method and the BIC method. Further details can be found in Fan and Li (2001) and Wang et al. (2007).
Let θ N = (θ N1 , · · · , θ Np ) ′ be defined by (2.1). The maximum penalized pseudo empirical likelihood estimator of θ N is defined asθ PPEL = arg max θ l PPEL (θ) and the maximum penalized sample empirical likelihood estimator of θ N is defined asθ PSEL = arg max θ l PSEL (θ). Both estimators enjoy the design-based oracle property for variable selection in the sense that

Bootstrap Calibrated Empirical Likelihood Methods
One of the most crucial features of public-use survey data files is the inclusion of replication weights. The guiding principle for the creation of replication weights is that they provide valid results on variance estimation as outlined in Assumption 2. The major results presented in Section 3 involve the estimation of the design-based variance Ω using the replication weights, and inferential procedures are developed based on the limiting distributions presented in the theorems and corollaries.
A highly attractive approach for practical implementations of the EL-based tests is the bootstrap calibration method. The asymptotic distributions are approximated by the empirical distribution of the replicate copies of the empirical likelihood ratio statistic using the bootstrap weights. However, theoretical justifications of the bootstrap calibration method can be a challenge task and need to be developed case-by-case. In this section, we describe a bootstrap procedure for scenarios where the survey design is single-stage PPS sampling with small sampling fractions and the final survey weights are the calibration weights with known population totals of auxiliary variables. Theoretical justifications of the procedure are given in the Appendix.
Let T x be the known population totals for the vector x of auxiliary variables used in the calibration. Let d i = 1/π i be the original design weights and let (y i , x i , d i ), i ∈ S be the preliminary survey dataset. The calibration weights w i are obtained by minimizing a distance measure D(w, d) between w = (w 1 , . . . , w n ) and d = (d 1 , . . . , d n ) subject to the calibration constraints i∈S w i x i = T x . There are different distance measures available for calibration weighting. Wu and Lu (2016) contains an overview on computational algorithms and finite sample behaviours of weights from alternative calibration weighting methods. We consider the simple chisquare distance D(w, d) = i∈S w i − d i 2 /d i , which leads to closed form expressions for the final calibrated weights w i . Let (y i , x i , w i ), i ∈ S be the final survey dataset without replication weights.
We present bootstrap procedures for the sample empirical likelihood method on testing The procedures are also valid for the pseudo empirical likelihood method. The proposed bootstrap procedures consist of the following steps.
1. Select a bootstrap sample S * of size n from the original sample S using simple random sampling with replacement. Denote the bootstrap sample data by {(y i , x i , w i ), i ∈ S * }. Note that S * may contain duplicated units from S.

Compute the set of bootstrap weights {w
is the Horvitz-Thompson estimator of the population totals T x using the initial dataset.
3. Define the bootstrap version of the sample empirical likelihood ratio function r SEL (θ) as Compute the bootstrap version of the estimatorθ * SEL = arg max θ∈Θ r * SEL (θ) and the bootstrap version of the SEL ratio statistic LR * 4. Repeat Steps 1-3 a large number B times, independently, to obtain B values of the bootstrap version of the SEL ratio statistic as {LR * (1) Let b α be the upper α quantile from the empirical distribution of the values of the bootstrap version {LR * (1) It is shown in the Appendix that this confidence region has correct asymptotic coverage probability.
The bootstrap procedures described above can be implemented through additional columns of replication weights to produce a public-use data file. Let {w * i , i ∈ S * } be a set of bootstrap weights described in Step 2. Let h i be the number of times that unit i ∈ S is selected in S * . Note that 0 ≤ h i ≤ n and i∈S h i = n. The bth set of replication weights are constructed as {w Repeat the process for b = 1, · · · , B, independently, to create B sets of replication weights. The bootstrap version LR * SEL (θ SEL ) of the SEL ratio statistic can be computed by using the (x, y) from the data file in conjunction with the set of replication weights.
The finite population parameters θ N = (θ N0 , θ N1 , θ N2 , θ N3 ) ′ under the linear regression model are defined as the solution to the census estimating equations With a large N, the values of θ N are almost identical to the model parameters for the superpopulation. Our simulation studies focus on examining the size and power of the proposed pseudo and sample empirical likelihood ratio tests. We consider α-level tests for two hypotheses: In survey practice, the process of creating the final survey weights w i and the final replication weights w (b) i , b = 1, 2, . . . , B can be very complicated. It depends on the original survey design, the scenarios for nonresponse, and the amount of known auxiliary information for calibration weighting. The replication weights often involve ad hoc approximations since many complex survey designs do not have precise bootstrap procedures or other resampling methods to produce final replication weights for general inferences. Rao and Wu (1988) and Rao, Wu and Yue (1992) contain further details on the topic. To make repeated simulation runs feasible, we consider single stage unequal probability sampling for the initial survey design, with the inclusion probabilities π i proportional to x i3 . The final survey weights and the final replication weights are created under two scenarios: A. The final survey weights are calibrated over the known population totals of the x 1 and x 2 variables but unit nonresponse is not involved.
B. The final survey weights are adjusted for uniform unit nonresponse and calibrated over the known population totals of the x 1 and x 2 variables.
For each of the two scenarios, there are two major tasks for each simulated sample: compute the final survey weights w i and create valid final replication weights w For single stage PPS sampling without replacement with a negligible sampling fraction, the with-replacement bootstrap procedures described in Section 4 produce final replication weights that satisfy Assumption 2 and are also valid for the bootstrap calibration method described in Section 4. Let S 0 be the set of initial sampled units and n 0 be the initial sample size under the original survey design and let S be the set of units included in the final sample and n be the final sample size.
Under Scenario A, we have S = S 0 and n = n 0 in the absence of unit nonresponse. The final weights are calibrated over the known population totals of x 1 and x 2 . The replication weights are created based on the method described in Section 4. Under Scenario B, let d i = 1/π i be the initial design weights, i ∈ S 0 . With uniform unit nonresponse, each unit in S 0 has a constant probability to be a respondent, and the final set S of respondents has a random sample size. The unit nonresponse adjusted survey weights are computed as This is the so-called ratio adjustment for uniform unit nonresponse and the adjusted survey weights satisfy i∈S d 0i = j∈S 0 d j . Treating the set of adjusted weights {d 0i , i ∈ S} as the "original" design weights, the final survey weights and replication weights under the calibration constraints are created by following the same procedures used in Scenario A.
Simulation samples of size n = 400 are selected for Scenario A from the population by the randomized systematic PPS sampling method (Goodman and Kish, 1950;Hartley and Rao, 1962). For Scenario B, initial samples of size n 0 = 571 are selected by the same PPS sampling method. The unit response probabilities are set to be uniform at 0.7, resulting in final samples with expected sample size E(n) = 400. For both scenarios, we choose the finite population sizes as N = 20, 000 and 4, 000 such that the sampling fractions are n/N = 2% and 10%, the first case represents negligible sampling fractions and the second case is for non-negligible sampling fractions. The final survey weights and the B = 500 sets of final replication weights are created for the given scenario.
We compute the power of the PEL and SEL ratio tests for H 0 : θ N1 = 1.0 versus H 1 : θ N1 = b and for H 0 : θ N1 = θ N2 versus H 1 : (θ N1 , θ N2 ) = (b 1 , b 2 ) for selected values of b and (b 1 , b 2 ). The power for b = 1.0 and (b 1 , b 2 ) = (1.0, 1.0) represents the size of the test, which is set at the level 0.05. Results are based on 2, 000 simulation runs. As a warning message for possible misuse of the PEL and SEL based tests, we first show that naively assuming the limiting distributions of the PEL and the SEL ratio tests with public survey data files as standard chisquares leads to invalid results. The sizes of the tests under Scenario A with different settings are presented in Table 1. It is apparent from Table 1 that the test sizes are off by a large margin relative to the nominal value 0.05 for all cases ranging from 0.141 to 0.194 for the first test and 0.186 to 0.264 for the second test.
The limiting distributions of the PEL and the SEL ratio tests generally follow the distribution of a quadratic form presented in Section 3. We consider four methods to determine the critical region for each test: I. Monte Carlo approximations to the distribution of the quadratic form using the estimated eigenvalues and the weighted χ 2 distribution; II. The first-order Rao-Scott correction method; III. The second-order Rao-Scott correction method; IV. The Bootstrap calibration method as described in Section 4. We also included a fifth method for comparisons: V. The Wald-test based on the point estimatorθ and the variance estimator v(θ) for θ = θ N1 or θ = θ N1 − θ N2 using standard normal approximation to (θ − θ)/{v(θ)} 1/2 . Method I uses the limiting distributions presented in Section 3. Methods I, II and III all require the estimation of eigenvalues of the matrix ∆ 1 , ∆ R 1 , ∆ 2 or ∆ R 2 . The bootstrap calibration method IV is extremely time consuming for repeated simulations and the results are only included for Scenario A with 500 simulation runs.  Tables 2-5 can be summarized as follows.
(1) All three approaches (i.e., PEL, SEL and Wald) have test sizes close to the nominal value 0.05 for almost all cases. The PEL based tests perform the best in terms of valid test size while the SEL based tests have a few cases with sizes bigger than 0.065. (2) The tests are generally more powerful when the error variance σ 2 is smaller (the cases with σ 1 and σ 2 ), where the auxiliary variables used for calibration weighting have stronger correlation to the response variable. (3) Both the first and the second order Rao-Scott corrections (entries under II and III) provide similar results compared to the ones using the actual limiting distributions (entries under I). (4) The validity of the replication weights is justified for cases with small sampling fractions but the results based on the estimated eigenvalues (entries under I, II and III) seem to work well even if n/N = 10%. (5) The bootstrap calibration method (entries under IV) works very well for n/N = 2% for all cases. For cases with the large sampling fraction n/N = 10%, the size of the test for H 0 : θ N1 = θ N2 with σ = σ 1 is around 0.02 for both PEL and SEL, showing the sensitivity of the replication weights on the bootstrap calibrated tests. (6) The Wald test has similar performance to SEL based tests in some cases but is less powerful in some other cases.
Further investigation on the performance of the empirical likelihood methods for parameters defined through nonsmooth estimating functions is reported in the Appendix.

An Application to the GSS 2016 Dataset
The General Social Survey (GSS) is an annual cross-sectional survey conducted by Statistics Canada since 1985. The survey gathers data on social trends in order to monitor changes in the living conditions and the well-being of Canadians, and to provide information on specific social policy issues. The 2016 GSS focused on Canadians at Work and Home, and collected information on the lifestyle behaviour of Canadians that affects their health and well-being, both in workplace and home. The survey covered individuals aged 15 years and older living in private households in the 10 provinces of Canada. Public-use GSS micro data files, which include the final survey weights and 500 sets of bootstrap weights, can be accessed through Statistics Canada's Research Data Centre (RDC) or the Data Liberation Initiative (DLI) at major Canadian universities.
We analyzed a subset of the GSS 2016 data file using the pseudo empirical likelihood and the sample empirical likelihood methods developed in this paper. We explored the relationships between the response variable y on job satisfaction and a set of 14 covariates through logistic regression analysis. The y variable is dichotomized from the original 5-point likert scale, i.e., y = 1 if either "Very satisfied" or "Satisfied" and y = 0 otherwise. The set of covariates includes x 1 : Gender; are rescaled such that i∈S w i = n and i∈S w (b) i = n, b = 1, · · · , 500. Note that the rescaling does not change the validity of the bootstrap weights for variance estimation as specified in Assumption 2.
The value exp(θ j ) represents the odds ratio (OR) for job satisfaction when x j changes from 0 to 1 given other covariates.
The estimating function for defining θ is given by g(x, y, θ) = x{y − µ(x ′ θ)}, where µ(x ′ θ) = exp(x ′ θ)/{1 + exp(x ′ θ)}. Let θ N = (θ N0 , θ N1 , · · · , θ N14 ) ′ be the finite population parameters defined by the census estimating equations. We computed the point estimates, the standard errors (SE), the odds ratios (OR) and the p-values for testing H 0 : θ Nj = 0 versus H 1 : θ Nj = 0, j = 0, 1, · · · , 14 using the pseudo empirical likelihood (PEL) and the sample empirical likelihood (SEL) methods. Note that we have r = p in this case and the point estimates, the SE and the OR are the same under the two methods. The p-values for hypothesis tests were computed using the first-order Rao-Scott correction as described at the end of Section 3.2. The SCAD penalty function proposed by Fan and Li (2001) was used for variable selection.
Results of estimation, hypothesis testing and variable selection are presented in Table 6. The first major observation is that the pseudo empirical likelihood and the sample empirical likelihood provide similar results for almost all cases, with only one noticeable exception on the p-value for testing H 0 : θ N12 = 0. The second observation is that only three covariates, x 8 : Being Happy When Working Hard, x 10 : Employment Benefits -Paid Sick Leave, and x 12 : Unfair Treatment/Discrimination -Past 12 Months, show significance to the response variable on job satisfaction from individual tests given all other covariates in the model. The variable selection results, however, point to the fact that x 8 is the most significant factor on job satisfaction.

Additional Remarks
Public-use survey data files might be utilized by researchers with diverse backgrounds and for different scientific objectives. Descriptive population parameters such as means and proportions, especially at the level of user-defined domains, are often of interest. However, complex survey data have also been used for analytic purposes. One important application is hypothesis tests in the presence of nuisance parameters. Binder and Patak (1994) discussed an estimating equation based test on one parameter in the presence of another nuisance parameter. Oguz-Alper and Berger (2016) presented a profile empirical likelihood test with nuisance parameters under the setting that detailed design information such as the first order inclusion probabilities and the population auxiliary information are available. They showed that the limiting distribution of the empirical likelihood ratio statistic follows a standard chisquare for certain sampling designs. General results, such as Theorems 1-6 presented in Section 3, for public-use survey data are not available in the existing literature. More importantly, naively assuming standard chisquare limiting distributions for the empirical likelihood ratio test statistics for public-use survey data files lead to invalid results as shown by the simulation results presented in Table 1. A very important practical problem is variable selection when the survey dataset is used to fit a model involving a large number of covariates. The design-based variable selection techniques described in Section 3.4 are a major contribution of the current paper. Another topic of interest is to test the correctness of the specified model, which is equivalent to testing the unbiasedness of the estimating functions used in the constraints. A pseudo empirical likelihood or a sample empirical likelihood ratio test following Corollary 4 of Qin and Lawless (1994) seems to be possible. Detailed procedures are currently under investigation.
The empirical likelihood methods have been an active research topic during the past three decades, with many new developments covering different areas. Rao and Wu (2009) contained an overview of empirical likelihood for complex surveys up to 2009. There have been several advances in recent years on empirical likelihood for complex surveys as evidenced by the additional references cited in this paper. Reid (2012) provided an overview of likelihood inference in complex settings, and the development of empirical likelihood method for complex survey data received high attention on her list. Our paper addresses a topic with both theoretical and practical importance on analysis of public-use survey data files. Our proposed methods are valid for any public-use survey data files regardless of the original survey design. However, the bootstrap calibrated tests described in Section 4 put restrictions on how the final replication weights should be produced. Creating final replication weights for valid variance estimation (Assumption 2) has been known to be a challenging task at the data file production stage for complex surveys involving stratification and multi-stage unequal probability sampling. Our simulation results show that constructing replication weights to satisfy the requirements for the bootstrap calibration method is even harder. Another important topic is on how to handle item nonresponse for public-use data files. Single imputation methods are a popular approach among some statistical agencies to produce a single complete data file for public users. How to create replication weights for data files in the presence of imputation for missing values is a topic that deserves high attention in future research.
The following three lemmas are required for establishing the asymptotic normality of our proposed maximum pseudo and sample empirical likelihood estimators. Proofs of the lemmas follow similar arguments used in Zhao, Haziza and Wu (2018). Details are omitted.

A.2 Proof of Theorem 1
The proof has similarities to the proof of Theorem 1 in Qin and Lawless (1994). Define Taking the Taylor expansion of Q n1 (θ PEL ,λ PEL ) and Q n2 (θ PEL ,λ PEL ) around (θ N , 0) yields where σ n = θ PEL − θ N + λ PEL . It can be shown that the four terms involved in the above equations are given by Noting that Q n1 (θ N , 0) = i∈Sw i (S)g i (θ N ) = O p (n −1/2 ), it can be shown that σ n = O p (n −1/2 ). It follows that Combining above arguments with Assumption 1, the asymptotic normality of the estimatorθ PEL is established. This completes the proof of Theorem 1.

A.3 Proof of Theorem 2
Denote λ N = λ(θ N ), which is the solution to Applying the Taylor series expansion to g PEL (λ N ) around λ N = 0, together with Lemmas 1-3, we have that This leads to the following asymptotic expansion to the pseudo empirical log-likelihood ratio statistic: whereÛ n (θ) = i∈S w i g(x i , y i , θ). Note that P 1 W 1 P 1 = P 1 . This, coupled with the proof of Theorem 1, shows that By Assumption 1, it can be shown that ( √ n/N)Û n (θ N ) is asymptotically normally distributed with mean zero and variance-covariance matrix at the order O(1). Combining above arguments, we can show that where Q ∼ N(0, I r ), I r is the r × r identity matrix, r is the dimension of population estimating equations. This completes the proof of Theorem 2.

A.4 Proof of Theorem 3
Define Φ(θ) = ∂R(θ)/∂θ ′ which is a k × p matrix. We first derive the asymptotic distribution ofθ R PEL . Note that finding the maximizerθ R PEL = arg max θ∈Θ R r PEL (θ) is equivalent to optimizing the following objective function with respect to (θ, λ, τ ), where τ is another k × 1 vector of Lagrange multiplier for the constrained maximization. The optimizer (θ R PEL ,λ R PEL ,τ R PEL ) of r R PEL (θ, λ, τ ) satisfies 0 = It can be shown through direct calculations that Using a multivariate Taylor series expansion to Q nj (θ R PEL ,λ R PEL ,τ R PEL ) at (θ N , 0, 0), we have where H 11 = −W 1 , H 12 = (Γ, 0), H 21 = H ′ 12 and Applying the theory of block matrix inversions, we obtain In addition, we also have that

It further leads tô
. We now derive the asymptotic distribution of the empirical log-likelihood ratio statistic Noting that P R 2 W 1 P R 2 = P R 2 , we have From the proof of Theorem 1, we have . By Assumption 1, it can be shown that where Ω = (n/N 2 )V ar{ i∈S w i g i (θ N ) | F N }. Therefore, where Q ∼ N(0, I r ) and ∆ R 1 = Ω 1/2 W −1 The proof of Theorem 3 is then completed.

A.5 Proof of Theorem 4
Major steps of the proof are similar to the proof of Theorem 1. If we define Taking the Taylor series expansion of M n1 (θ SEL ,λ SEL ) and M n2 (θ SEL ,λ SEL ) at (θ N , 0) yields where σ n = θ SEL − θ N + λ SEL . By direct calculation, we obtain

This leads to λ
The proof of Theorem 4 is then completed by combining above arguments with Assumption 1.

A.6 Proof of Theorem 5
The proof is similar to the proof of Theorem 2. Let λ N = λ(θ N ) be the solution to Applying the Taylor series expansion to g SEL (λ N ) around λ N = 0, together with Lemmas 1-3, we have that By the Taylor series expansion of −2nr SEL (θ N , λ N ) around λ N = 0, we have It follows from the proof of Theorem 4 that The last equality holds since P 2 W 2 P 2 = P 2 . Combining above arguments, we can show that The proof of Theorem 5 is then completed.

A.7 Proof of Theorem 6
The proof is similar to the proof of Theorem 3. We first derive the asymptotic distribution ofθ R SEL . Finding the maximizerθ R SEL = arg max θ∈Θ R r SEL (θ) is equivalent to optimizing the following objective function with respect to (θ, λ, τ ), where τ is another k × 1 vector of Lagrange multiplier for the constrained maximization. The optimizer (θ R SEL ,λ R SEL ,τ R SEL ) of r R SEL (θ, λ, τ ) satisfies . It can be shown through direct calculations that where Φ = Φ(θ N ). Using similar arguments to the proof of Theorem 3, we havê It is easy to see that P R 4 W 2 P R 4 = P R 4 . Applying the Taylor series expansion, we have that From the proof of Theorem 5, we have It follows that By Assumption 1, it can be shown that where Ω = (n/N 2 )V ar{ i∈S w i g i (θ N ) | F N }. Therefore, This completes the proof.

A.8 Theoretical justification of the bootstrap method
The justification of the bootstrap method essentially involves establishing the bootstrap version of Theorem 5. We consider cases where the final survey weights w i are calibrated over the known population totals of the x variables using the chi-square distance D(w, d). The calibrated weights are given by Under regularity conditions similar to Assumptions 3-5 on the original survey design, we have B (θ) − B(θ) = O p (n −1/2 ) uniformly for all θ ∈ Θ. Consequently, we have the following asymptotic expansion: The bootstrap weights {w * i , ∈ S * } are created by the same calibration procedure with T x replaced byT xHT . Using similar arguments for the asymptotic expansion to LR SEL (θ N ) and conditional on the original sample, we have a similar expansion to the bootstrap version of the SEL ratio statistic as To justify the proposed bootstrap calibration method, it suffices to show that, as n → ∞, where V ar(· | F ) represents the design-based variance and V ar(· | S) denotes the variance under the bootstrap sampling procedure, conditional on the original survey sample S.
Letη = i∈S d i e i (θ N ),η * = i∈S * d * i e * i (θ SEL ) and let z i = (nd i ) −1 and z * i = (nd * i ) −1 . We can rewriteη andη * asη = n −1 i∈Sr i andη * = n −1 i∈S * r * i , respectively, wherê r * i = e * i (θ SEL )/z * i andr i = e i (θ N )/z i . Under the proposed with-replacement bootstrap procedure, we have V ar(η * | S) = S 2 r /n, where S 2 r = n −1 i∈S (r i −η)(r i −η) ′ . If the original survey sample is selected by single-stage PPS sampling with replacement method, then η = n −1 i∈S g i (θ N )/z i is the standard Hansen-Hurwitz estimator and the design-based variance V ar(η | F N ) can be unbiasedly estimated by n −1 {(n − 1) −1 } i∈S (r i −η)(r i −η) ′ . It follows that V ar(η | F N )/V ar(η * | S * ) → 1 as n → ∞. The result also applies to singlestage PPS sampling without replacement with small sampling fractions as commonly used in survey practice on variance estimation.

A.9 Additional simulation results
Tables 7 and 8 summarize the results on the size and power of the tests for H 0 : θ N1 = 1.0 versus H 1 : θ N1 = b for PEL and SEL, respectively, with n/N = 10%. The results for b = 1.0 correspond to the size of the test with nominal value 0.05 and the results for b = 1.0 represent the actual power of the test. Tables 9 and 10 summarize the results on the size and power of the tests for H 0 : θ N1 = θ N2 versus H 1 : (θ N1 , θ N2 ) = (b 1 , b 2 ), and again with n/N = 10%. The results for (b 1 , b 2 ) = (1.0, 1.0) correspond to the size of the test and the results for other values of (b 1 , b 2 ) represent the power of the test.
We further investigate the performance of the empirical likelihood methods for parameters defined through nonsmooth estimating functions. We consider the finite population quantiles and construct empirical likelihood ratio confidence intervals using the proposed methods. The finite population values {(x 1i , x 2i , y i ), i = 1, · · · , N} for the simulation study are generated from the superpopulation model: y i = 0.5 + x 1i + x 2i + ε i , where x 1i ∼ Bernoulli(0.5), x 2i ∼ Expomential(1) and ε i ∼ χ 2 (3). The 100τ th finite population quantile θ N (τ ) is given by the solution to N i=1 g τ (y i , θ) = 0, where g τ (y i , θ) = I(y ≤ θ) − τ and I(·) is the indicator function. We consider scenario A discussed in the previous simulation for creating the final survey weights. We examine five different quantile levels at τ = 0.10, 0.25, 0.50, 0.75 and 0.90. Three methods are employed to construct the 95% confidence interval for θ N (τ ): the PEL approach, the SEL approach, and the normal approximation (NA) approach.
The simulated average length (AL), the coverage probability (CP), the lower tail error (LE) and the upper tail error (UE) rates for the confidence interval (θ L ,θ U ) of parameter θ N (τ ) are computed as is the confidence interval computed from the kth simulation sample, and K is the total number of simulation runs.
Simulation results based on B = 500 sets of bootstrap replication weights and K = 1, 000 simulation runs are presented in Table 11. We have the following major observations: (1) Both the pseudo and the sample empirical likelihood approaches lead to excellent confidence intervals for quantiles in terms of coverage probabilities. (2) The SEL approach gives more balanced tail error rates than the PEL approach in most cases. (3) The Wald-type confidence intervals have lower coverages, especially for small or large quantiles (i.e., τ = 0.10 and 0.90).
(4) The pseudo and sample empirical likelihood ratio confidence intervals are slightly wider than the Wald-type intervals for small or large quantiles.

A.10 Further details of the General Social Survey
The 2016 General Social Survey of Statistics Canada focused on Canadians at Work and Home. The survey questionnaire contained more than 200 questions. The 15 variables used in the application reported in the main paper are derived from the original questions listed below.
• Job Satisfaction (JSR-02) In general, how satisfied are you with your job?