Unconditional empirical likelihood approach for analytic use of public survey data

Modeling survey data often requires having the knowledge of design and weighting variables. With public‐use survey data, some of these variables may not be available for confidentiality reasons. The proposed approach can be used in this situation, as long as calibrated weights and variables specifying the strata and primary sampling units are available. It gives consistent point estimation and a pivotal statistics for testing and confidence intervals. The proposed approach does not rely on with‐replacement sampling, single‐stage, negligible sampling fractions, or noninformative sampling. Adjustments based on design effects, eigenvalues, joint‐inclusion probabilities or bootstrap, are not needed. The inclusion probabilities and auxiliary variables do not have to be known. Multistage designs with unequal selection of primary sampling units are considered. Nonresponse can be easily accommodated if the calibrated weights include reweighting adjustment for nonresponse. We use an unconditional approach, where the variables and sample are random variables. The design can be informative.


INTRODUCTION
Fitting models on complex public-use survey data for secondary data analysis is common practice in many areas of socials sciences and economics. Survey data are rarely composed of independent and identically distributed (i.i.d.) observations, because the data collection process, called sampling design, often involves clustering, stratification, and unequal inclusion probability selection (e.g. Chambers & Skinner, 2003;Skinner et al., 1989). Estimation often relies on adjustments for the sampling design. However, users of public-use survey data often have limited design information, given by calibrated weights and stratification/clustering variables. With secondary data analysis, inclusion probabilities and the auxiliary variables used for weighting are often not available because of confidentiality.
Suppose that the data are selected with multistage unequal probabilities designs, with small or large sampling fractions. The proposed unconditional approach has the advantage of taking the sampling design into account and the model defining the parameter. It does not need the auxiliary variables and inclusion probabilities. The point estimator is consistent. The empirical log-likelihood ratio function follows a 2 -distribution asymptotically, under the null hypothesis, without adjustment involving eigenvalues, design effects, variance estimates, finite population corrections, or bootstrap.
We assume that the public-use survey data contain some calibration weights derived from auxiliary variables and inclusion probabilities (e.g. Deville & Särndal, 1992). Theses weights may also include some nonresponse adjustments. There are numerous situations when the inclusion probabilities and auxiliary variables are unavailable to data users, because they contain sensitive information, which may identify some units. For example, the European Union Statistics on Income and Living Conditions (Eurostat, 2012) users' databases do not contain auxiliary variables. Another example, is when the values of nonsurveyed auxiliary variables are linked to sampled units, for weighting purpose. These auxiliary variables cannot be released by the producer of the data, when interviewees have not given their consent to make them publicly available.
The weights allows consistent point estimation. However, design-based variance estimates implicitly rely on (first and joint) inclusion probabilities and auxiliary variables (Deville & Särndal, 1992) or some bootstrap weights. Consistent design-based variance estimation may not be possible if this information is not available. The aim of this paper is to show that under the proposed approach, this information is not required.
When the sampling design is non-informative (or ignorable), we can use a parametric approach based a conditional "sample-likelihood," given the sample labels (e.g. Chambers et al., 2012). We may also use a disaggregated model with random effects to control for clustering and dichotomous variables for stratification. However, the inclusion probabilities can be associated with some variables of interest, that is, the design could be informative or nonignorable. In this case, the sample-likelihood needs to be adjusted for the sampling design, using the Bayes's theorem (e.g. Krieger & Pfeffermann, 1992;Pfeffermann et al., 1998;Pfeffermann & Sverchkov, 1999). In other words, the information about the design is incorporated within a likelihood-based framework, by specifying a model for the relationship between the inclusion probabilities and some variables of interest. This implicitly assumes that these probabilities are known, which is not the case for public-use data files. The adjusted sample-likelihood accounts for informativeness, but relies on restrictive assumptions about the design, such as "asymptotic independence" which is achieved under sampling with-replacement, Poisson sampling or negligible sampling fractions. The distribution of a sample-likelihood ratio test statistics cannot be easily derived, and variance estimation and tests are often based on bootstrap.
Informative sampling is challenging with a conditional "sample-likelihood" framework. We If a matrix of rescaled bootstrap weights (Rao et al., 1992) is made available with a public-use survey dataset, it can be used for point and variance estimation.  exploited this idea to derive an empirical likelihood approach under public-use survey data, which contain bootstrap weights.  proposed adjusting the pseudoempirical or empirical likelihood ratio function with eigenvalues computed from a bootstrap variance estimate as in Wu and Rao (2006), Berger (2018) and Zhao and Wu (2018). They also suggest a "bootstrap calibration method" which consists in computing the bootstrap quantiles of the distribution of the empirical log-likelihood ratio statistics under with-replacement sampling. This method can be extremely computer intensive, since computing a single values of a empirical log-likelihood ratio function is also intensive, because it involves optimization. For the bootstrap calibration method, the bootstrap weights need to be produced in a very specific way and may be different from the rescaled bootstrap weights available. The information about the design and auxiliary information is in-bedded within the matrix of bootstrap weights. This is a elegant and sensible technique for inference. However, there are several issues. Organizations releasing public-use data do not always provide this matrix of bootstrap weights, which cannot be constructed by users, if the inclusion probabilities and auxiliary variable are not available.  assumed that the bootstrap variance is design-consistent. However, bootstrap is valid under sampling with-replacement or negligible sampling fractions (Rao et al., 1992), which is not always the case with social data. For instance, nonnegligible sampling fractions are used with the "National Health and Nutrition Examination Survey" (National Center for Health Statistics, 2016). In the real data example of Section 9, the sampling fraction is approximately 10%.  target the finite population predictor of 0 under a conditional design-based approach. In this paper, we consider a unconditional model-based approach, with 0 being the parameter of interest. When modeling survey data, it is more natural to target the model parameter 0 , and base the inference on the model and the design when it is informative.  bootstrap approach has the advantage of taking the effect of the auxiliary variables into account, when bootstrap weights are available and the bootstrap variance is consistent, that is, when the sampling fraction are small or under with-replacement sampling. This affect is not taken into account with the proposed approach in this paper. Nevertheless, we allow for large sampling fractions and sampling without-replacement, without the need of design effects, eigenvalues, joint-inclusion probabilities or bootstrap weights.
In Section 2, we define the class of models considered. In Sections 3 and 4, we define the class of sampling designs and the survey weights. The asymptotic framework and some of the regularity conditions are outlined in Section 5. The empirical likelihood approach is introduced in Section 6. The main contribution can be found in Section 7, which shows the consistency of the variance within the self-normalised profile empirical log-likelihood ratio function. In Section 8, a simulation study supports our findings. The proposed approach is illustrated using the "Programme for International Student Assessment" (PISA) survey data in Section 9. It shows that we may fit very different models, if we use the proposed empirical likelihood approach rather than a random effect method. Additional regularity conditions are provided in Appendix A. Computational algorithms and proofs can be find in Data S1.

CLASS OF MODELS SPECIFIED BY MOMENT CONDITIONS
Let Y ∈ R d y denote a random vector, which usually contains response variables, some covariates and some additional side variables. We consider a class of models specified by "moment conditions," that is, let 0 ∈ R d be a parameter specified by where g(Y, ) ∈ R d g is an estimating function, with d g ⩾ d .
Here, E  is the expectation with respect to a model, and 0 d denotes an d-vector of 0. Equation (1) specifies a wide class of models, which includes generalized linear models and nonlinear (in the parameter) models. It has the advantage of not relying on the distribution of error terms. Sample data will be used for estimating 0 . With endogenous covariates, the proposed approach can be used by incorporating a suitable instrument matrix within the definition of g(Y, ) (e.g. Donald et al., 2009), by implicitly assuming that the instrument identifies the parameter (Newey, 1993). However, Domínguez and Lobato (2004) pointed out that this may not be always the case. Berger and Patilea (2022) proposed an empirical likelihood approach that deals with this identification problem, with survey data.
We may also have some optional "side information." Suppose that we know a vector d 0 which is the solution to another moment condition, that is, where f(Y, d) ∈ R d f . Here, d 0 is treated as known, rather than as a parameter to estimate. The vector d 0 called "side information," usually takes the form of descriptive statistics of some of the variables within Y (e.g. Chaudhuri et al., 2008). For example, d 0 could be aggregate information from a census or large surveys, such as vector of means, totals, ratios or quantiles, so that it can be assumed that d 0 is vector of constants. Side information can also be used in micro-econometric (Imbens & Lancaster, 1994) when combining micro and macro data. Taking into account of (2) may improve the estimation of 0 (e.g. Chaudhuri et al., 2008;Deville & Särndal, 1992;Imbens & Lancaster, 1994). The distributions of g(Y, 0 ) and f(Y, d 0 ) do not need to be specified. We simply assume that their second-order moments exists, that is, Similar fourth-order moment conditions will be also needed, such as condition (A4) in Appendix A.

CLUSTERED POPULATION, SAMPLING DESIGN, AND WEIGHTING
Suppose that we have a population of M units clustered into N primary sampling units (psus) denoted  1 , … ,  i , … ,  N . Let U ∶= {1, … , N} be the set of labels of N psus. Suppose that a sample S of n psus is selected without-replacement with unequal probabilities i . The randomly selected sample S is a random variable. We assume that S is a "stratified sample" of psus from H strata. The quantity n h denotes the number of psus sampled from stratum h, where h = 1, … , H and ∑ H h=1 n h = n. Within each psu  i sampled (i ∈ S), a sample S i of n i units is selected. The overall number of units selected is m = ∑ i∈S n i . We have a two-stage design when S i is a single-stage design. If the S i contain more than one stage, we have a multistage design. We have a single-stage cluster designs, when S i =  i . Furthermore, if the psus  i are made up of a single unit, we have a single-stage design. Without loss of generality, hereafter we shall consider two-stage designs. Nonresponse can be an ultimate phase.
We impose no restriction on the sampling fractions n h ∕N h which may be large or small. We assume that the first-stage sample sizes n h are not random and fixed by design.
We consider an unconditional framework, where both the sample and the population values {Y 1 , … , Y M } are random variables. The first-order inclusion probabilities are also assumed random. Inference will not be based on the conditional sampling distribution given the sample's labels or {Y 1 , … , Y M }, as with model-based or design-based approaches. Let Here, Y j is the observed value of Y, for unit j ∈  i . We shall assume that the random variables g j|i ( ) are independent between psus, that is, This assumption allows for possible correlation within psus, that is, g j|i ( 0 ) and g |i ( 0 ) can be dependent.
For example, suppose that the response variable follows a random effect model given by Y j = x ⊤ j|i 0 + j|i , for j ∈  i , with j|i = u i + e ji with u i and e ji both i.i.d. Here, x j|i represents some exogenous covariates. Now the estimating function is g j|i . We see that we have an intra-psu correlation between the g j|i ( 0 ) and g |i ( 0 ), because of the random effect u i . Nonetheless, g j|i ( 0 ) and g |k ( 0 ) are independent for i ≠ k, as in (5). With the proposed approach, it will not be necessary to specify a random effect and its distribution.

SURVEY WEIGHTS
Let w j|i denote some survey weights of a unit j ∈ S i . We assume that there exists some quantities j|i such that where j|i is the probability of selecting a unit j ∈ S i and i is the probability of selecting psu i. The j|i can be g-weights (Särndal et al., 1992, §6.5) or some calibration adjustment (Deville & Särndal, 1992) derived from auxiliary variables. Assumption (6) means that the weights w j|i have been derived by adjusting the sampling weights ( i j|i ) −1 by some reweighting techniques. The quantities j|i depend on the units in the entire sample. We suppose that the w j|i are known for j ∈ S i and i ∈ S. However, j|i , i and j|i are not available to the secondary users (or data analysts). This corresponds to the usual situation when some weights w j|i are provided as part of public-use data files and the more sensitive information given by the inclusion probabilities and calibration variables are not revealed. Hereafter, we shall use Greek letters for unknown quantities and Latin letters for known quantities. We treat j|i , i , and j|i as random variables. The probabilities i and j|i may be associated with Y when the design is informative. The function f(Y, d) used within (2) could be, but does not have to be based upon the auxiliary variables used for computing the weights w j|i .
Consider two "weighted estimating functions" for psu i: The key feature is the fact that the weighted functionsg i ( ) are known and can be computed for a given value of . On the other hand, thêi( ) cannot be computed by a secondary users (or data analysts), because the i are not available to them.

ASYMPTOTIC FRAMEWORK
The asymptotic framework is outlined as follows. We considers a sequence of nested populations of , and a sequence of samples of n [t] psus, with ∀t, 0 < n [t] < n [t+1] and n [t] < N [t] , with t → ∞ (e.g. Hájek, 1964;Isaki & Fuller, 1982). To simplify notation, the index t will be dropped in what follows. Thus, t → ∞ implies that N → ∞, n → ∞, and m → ∞. We assume that n∕N, n h ∕n, and N h ∕N and H are constants free of the limiting process, where N h denotes the number of psus within stratum h. Hence, n h → ∞ and N h → ∞. We also consider that n h − N h → ∞. This excludes heavily stratified design where n h is bounded and H tends to infinity. The PSU sizes are considered bounded asymptotically, that is, m∕n < ∞ and M∕N < ∞. The within-PSU sizes are assumed finite. Thus, the proposed approach is valid asymptotically even when the cluster sample sizes are small. Let o p (⋅) and O p (⋅) be the order of convergence in probability with respect to the model and the sampling design. We assume where || ⋅ || denotes the Frobenius norm. Here, E (2) d is the expectation with respect to the second-stage design. Analogous assumptions about the side information are given by (A1) and (A2) in Appendix A. Conditions (9) and (10) state that the law-of-large-numbers holds. These are standard assumptions which are often made for deriving asymptotic properties of survey estimators (e.g. Fuller, 2009, §6).
Let j|i be the unknown vector of auxiliary variables used for producing w j|i , with j ∈ S i . The result 5 of Deville and Särndal (1992) is a regression coefficient and q j|i are some factors used for deriving w j|i . Let us assume that The key assumption is (13), which holds under normal circumstance when the auxiliary variables j|i are uncorrelated with g j|i ( 0 ). This is a situation usually met in practice, when g j|i ( 0 ) is a estimating function of a generalized linear model. Sincêg is the regression coefficient between j|i and g j|i ( 0 ), and j|i is unknown, a user cannot compute the vector̂g and would not know if g is indeed negligible. Therefore, it will not be possible for a user to know if (13) holds. If (13) does not hold, we expect more conservative variance estimates and confidence intervals.
When making inference about 0 , it is impossible to take the effect of the unknown variables j|i into account, if no proxies or bootstrap weight are available. The only solution is to conjecture that j|i are not correlated with g j|i ( 0 ); in other words, to assume (13). If proxies of some auxiliary variables are available, they can always be used as side variables. Now, (11), (13), and (14) imply that wherê,

PSU-LEVEL EMPIRICAL LIKELIHOOD
The "maximum empirical likelihood estimator" is defined bŷ where Θ is the parameter space and ( ) is the PSU-level empirical log-likelihood function given by where Here, z ih = 1 if PSU i belongs to stratum h and z ih = 0 otherwise. Note that the p i within (18) are defined at PSU-level. It does not mean that we use a PSU-level model, because the model (1) is specified at unit-level. The function ( ) is not a parametric likelihood, but we shall see that it behaves like a likelihood. The Lagrange function associated with the optimization in (18) is ) ⊤ is a scaling factor for the Lagrange multiplier t + * . Here, 1 H is an H-vector of 1. Since t ⊤c * i ( ) = 1, we have that maximizing (20) with respect to p i and * reduces to a dual optimization problem given by Note that * ∈ Λ * ( ) ensures that 0 < p i ⩽ 1, as in Qin and Lawless (1994). The computation of * ( ) is discussed in Appendix B of Data S1. The function (18) reduces to Berger and Torres (2012); Berger and Torres (2014); Berger and Torres (2016) empirical likelihood function, when j|i = 1 ∀i, j, where j|i is given by (6).
Furthermore, if we have a single stratum and a single stage design, (18) becomes the sample empirical likelihood function of Chen and Kim (2014). Note that we consider j|i ≠ 1 ∀i, j implicitly, because j|i = 1 means that the i are known.

Empirical likelihood test
Testing will be based on profiling, as in Qin and Lawless (1994). The main contribution of the paper is to show that the profile empirical log-likelihood ratio function has a 2 -distribution under the null (see (29)).
Here, denotes a vector within the parameter space of 0 and Υ is the parameter space of 0 . We shall show thatr(̃) follows an asymptotic 2 -distribution under H 0 . This result has only been shown in a more restrictive design-based framework, involving negligible sampling fractions, weighting auxiliary variables and known i (Berger, 2018;Oǧuz-Alper & Berger, 2016a).
Here,̂( 0 ) is a regression estimator defined bŷ where the residualŝi( 0 ) and the regression coefficientB( 0 ) are given bŷ Note that we obtain the residualŝi( 0 ) because of the side information constraint ∑ i∈S p ifi = 0 d f within (18). The vectorB( 0 ) should not be confused with (12).
The crucial term of (23) iŝ( 0 ), defined bŷ It is possible to generalize (23) to nondifferentiable functions, by assuming that̂( 0 ) converges to a differentiable function, as in Zhao and Wu (2018). However, this is beyond the scope of this paper, which aims to show that the empirical likelihood-based inference captures the effect of the design and the model, even with large sampling fractions.
Sincê( 0 ) is a regression estimator, we have (Robinson & Särndal, 1983) 1 where E d is the expectation with respect to the sampling design. Furthermore, we assume that the central limit theorem holds for̂( 0 ), that is, where is the "model-design variance" of the estimator N −1̂( 0 ), defined by (32) in Section 7. We shall treat (27) and (28) as regularity conditions. Justification for (28) can be found in Fuller (2009, §1.3) and Bertail et al. (2017).
The key result of this paper is to show that (25) is a consistent estimator of (see Theorem 2). In this case, (23) and (28) imply that under H 0 ∶ 0 =̃, in distribution with respect to the model and design. Here, 2 df =d denotes a 2 -distribution with d degrees of freedom, where d is the trace of the symmetric idempotent matrix (26). Note that Equation (24) is a weighted sum of residualŝi( 0 ), because of the constraint ∑ i∈S p ifi = 0 d f within (18). Thus, thef i may reduce the variance of (24) and increase the power, because (25) is a residual variance. Property (29) means thatr(̃) is an ancillary test statistics, which behaves like the usual likelihood ratio statistics. It can be used for testing the relative fit of two nested models, with 0 being the additional parameters of the full model, and̃= 0. A goodness of fit test can also be based on (29); for example, when 0 is an intercept. Confidence intervals can be constructed when 0 is scalar (d = 1), that is, the confidence interval with a nominal level is where 2 1 ( ) is the upper -quantile of the 2 -distribution with 1 degree of freedom. Lemma 1 shows that the test on 0 based on (22) with = is consistent asymptotically against alternatives H a ∶ 0 =̃; wherẽ∶= 0 + b n and is such that || || = 1, that is, the local power tends to 1, as n → ∞. (9), (10), and under the conditions (A1)-(A5), (A7)-(A8) and (A11) from Appendix A, we have that there exists a sequence r min → ∞, such that P{r(̃) ⩾ r min } → 1, as n → ∞, wherẽ∶= 0 + b n , for some ∈ R d , such that || || = 1. Here, b n denotes an arbitrary sequence tending to zero, and such that nb 2 n → ∞. Lemma 1 implies Theorem 1 below, which establishes the √ m-consistency of̂.

Theorem 1. Under the conditions of Lemmas 1, we have m
The proof of Lemma 1 and Theorem 1 can be found in Appendix D of Data S1. Note that the asymptotic normality of̂can be also established from (28).
Variance estimation is not needed for testing, since tests and confidence intervals can be based on (29). Nonetheless, an estimate for the variance matrix of̂can be easily computed. Sincêis the solution to an estimating equation and̂( 0 ) is a consistent estimator of (see Section 7), the usual "sandwich" variance estimator of̂based on Taylor's theorem iŝ This estimator resembles the variance estimator used with pseudo-likelihood (see Appendix C in Data S1), but there is a crucial difference. Here, we usê(̂) instead of a two-stage variance used with pseudo-likelihood (for more details see Appendix C in Data S1). The matrix (̂) can be viewed as a between PSU stratified variance estimator (Hansen & Hurwitz, 1943) which over-estimate the variance, when the sampling fraction is large. Nevertheless, it turns out that̂(̂) over-estimates the design variance by an amount which estimates the model variance component (see the proofs of Theorem 2 in Appendix D of Data S1).

MAIN RESULT: PROOF OF THE ASYMPTOTIC CONSISTENCY OF THE VARIANCE (25)
This section contains the main contribution of this paper. For (29) to hold under (28),̂( 0 ) needs to be a consistent estimator of the variance of N −1̂( 0 ). In this section, we present the key assumptions needed for the asymptotic unbiasedness and consistency of̂( 0 ). Additional regularity conditions are given in Appendix A. The proofs can be found in Appendix D of Data S1. Using the Taylor theorem, we obtain the asymptotic unconditional "model-design variance" (32) of N −1̂( 0 ) by replacingB( 0 ) by ∶= E  E d (B( 0 )) within (24) (e.g. Fuller, 2009, §2.2.2). This variance is The expectation and variance under the model are denoted by E  and V  . Here, E d and V d are the expectation and variance with respect to the design. The variance V d {̃( 0 )} is the following two-stage variance The operators E (1) d and E (2) d are the first-and second-stage expectations. The first stage refers to the selection of PSUs. The second stage is the selection of units within PSUs, which may contain more than one stage. The first-and second-stage variances operators are V (1) d and V (2) d . The operators E (2) d and V (2) d are conditional expectations and variances given the first-stage sample S. To simplify the notation, we shall use E (2) d (⋅) and V (2) d (⋅) instead of E (2) d (⋅|S) and V (2) d (⋅|S). Since n → ∞, we need an asymptotic expressions for the PSU-level joint-inclusion probabilities. We consider that the joint-inclusion probabilities ik between PSUs i and k are given by the asymptotic expression of Hájek (1964): where ik = i when i = k, and → 0 uniformly as h ∶= Note that ik = i k , when i and k belongs to different strata, because z ih z kh = 0, ∀h. Equation (35) can be justified under under weak conditions given by Hájek (1964). Regularity conditions can be found in Berger (1998Berger ( , 2011. Note that h → ∞ implies that n h → ∞ and N h − n h → ∞, because of Chebyshev's sum inequality. Equation (35) has the advantage of allowing large sampling fraction n h ∕N h . There are enough evidences which shows that (35) is suitable for common designs (e.g. Berger, 1998;Berger, 2011;Haziza et al., 2008;Matei & Tillé, 2005). It turns out that the variance (25) implicitly includes an Horvitz and Thompson (1952) variance based on (35) (for more details, see proof of Lemma 2in Data S1).
We do not need the exact joint-inclusion probabilities for given n h and N h , because this would involve a nonasymptotic setup where n does not tend to ∞. Since n → ∞, we have to use an asymptotic expression for these probabilities. Exact joint-inclusion probabilities would be useless to derive asymptotic properties. Only an asymptotic expression is needed. Furthermore, exact joint-inclusion probabilities relies on the i which are unknown.
In order to derive the asymptotic expectation of̂( 0 ), we need an asymptotic expression for the PSU-level "anticipated variance" of (33), under informative sampling. This expression is given by the following Lemma. Its proof can be found in Appendix D.

Theorem 2. Under the conditions of Lemmas 2 and the conditions (A10), (A11), (A25), and (A26) from Appendix A, we have
The proofs of Theorem 2 can be found in Appendix D. Theorem 2 and Slutsky's Theorem imply (29).

SIMULATION STUDY
We show that the proposed approach have similar performances compared to the customary pseudo-likelihood approach based on the full information about the design, given by the stratification/clustering variables, inclusion probabilities and auxiliary variables used for deriving the weights (6). The proposed empirical likelihood-based inference has the advantage on not relying on these probabilities and auxiliary variables. The pseudo-likelihood approach is described in Appendix C of Data S1. We also consider a parametric model-based approach with a random intercept. The aim of the simulation studies is not to show that the proposed approach is more accurate than pseudo-likelihood. We use pseudo-likelihood as the benchmark. We want to show that the proposed approach, which does not involve the inclusion probabilities and auxiliary variables, is as accurate and has similar confidence intervals coverages as pseudo-likelihood. We consider randomized stratified systematic samples of n = 400 PSUs selected from populations of different sizes containing N = 800, 1333, 4000, and 8000 PSUs, in order to obtain different sampling fractions: n∕N = 0.05, 0.10, 0.30, and 0.50. The PSUs are randomly allocated to three equal sized strata. Equal allocation is used. The number N i of units within PSU  i are generated randomly using a positively skewed beta-distribution, that is, N i ∼ ⌊Beta(1, 20) × 280 + 20⌋. This gives 20 ⩽ N i ⩽ 300. The PSU-level inclusion probabilities i are proportional to N i . Within a PSU  i selected, a simple random sample of size n i = min{5, ⌊N i ∕5⌋} is selected.
The known parameter d 0 contains the proportions of subgroups, created by sorting the unit-level data according to the variablex 2j = x 2j +̃j correlated with x 2j , wherẽj ∼ N(0, 1) and corr(x 2j ,x 2j ) = 0.7. The first 30% of units are allocated to the first group, the next 20% are in the second group and the remaining units belongs to the third group. The known parameter d 0 is given by the proportions of units within the first two groups, that is, d 0 ∶= (0.3, 0.2) ⊤ . Thus, f(Y j , d 0 ) = { (j ∈ group 1) − 0.3, (j ∈ group 2) − 0.2} ⊤ , where (j ∈ group g) = 1 if the unit j belongs to group g, and (j ∈ group g) = 0 otherwise. The proportion of the third group is redundant and should not be included within d 0 and f(Y j , d 0 ). Here, the variables (j ∈ group 1) and (j ∈ group 2) are included within Y j In Table A1, we have the relative efficiency defined by the MSE of the empirical likelihood estimator (17) divided by the MSE of the pseudo-likelihood estimator. We noticed a smaller MSE for the empirical likelihood estimator of the intercept. This difference is more pronounced with larger intra-psu correlation. For the other estimators, there are negligible differences between the MSE of empirical likelihood and pseudo-likelihood.
The observed relative biases (RB) are given in Table A2. The empirical likelihood and pseudo-likelihood approach gives similar biases. Not surprisingly, with large intra-PSU correlation, the estimators based on a random effect model can be biased. When the intra-PSU is zero, we observe no significant differences between the RB of the estimators based on empirical likelihood, pseudo-likelihood, and on a random effect model.
In Table A3, we have the observed coverages of 95% confidence intervals. We consider two empirical likelihood confidence intervals: Wilks-type interval (30) (EL) and the Wald-type interval (ELW) based on the variance estimator (31). We also consider the pseudo-likelihood interval (PL) derived from the variance estimator which can be found in Appendix C of Data S1. We observed similar coverages for EL and ELW. For the intercept, pseudo-likelihood gives significantly low coverages decreasing with large intra-PSU correlation. The EL intervals can have better coverages than Wald-type intervals (ELW or PL). For example, with an intra-PSU correlation 0.05 and a sampling fraction 0.5, we observe low coverages (89.0% and 89.4%) for the Wald-type intervals of the coefficient of the negatively skewed variable x 4 . For EL the observed coverage is 95.1%. For the coefficient of the variable x 4 , the coverages of ELW tend to be slightly lower than those of EL (see the last row of Table A3 which contains the column means).
In Table A4, we have the relative biases (RB) of the empirical likelihood variance estimator (columns EL) given by (31) and the pseudo-likelihood variance estimator (columns PL) defined in Appendix C of Data S1. For the intercept, the RB of PL can be larger than the RB of EL. For the other parameters, similar RB are observed for EL and PL. The EL variance estimator is less biased, but its RB can still be larger than 10%. Now, we investigate the effect of side information (2) on the efficiency. Suppose that we use a different side information given by the means of y j andx 3j = y j +̃3 i correlated with y j , wherẽ 3i ∼ N(0, 2). Thus, f(Y j , d 0 ) = (y j − y ,x 3j − y ) ⊤ , where y is the expectation of the variable y j . In Table A5, we have the observed relative efficiencies of the intercept defined as the ratio of the MSE of the EL estimator of 0 divided by the MSE without f(Y j , d 0 ), within (18). We notice a gain in efficiency with a small intra-PSU correlation and a large sampling fraction. We only report the relative efficiency of the intercept, because f(Y j , d 0 ) did not affect the efficiency of the estimators of 1 , 2 , 3 , and 4 , because f(Y j , d 0 ) is not correlated with the residuals. Consider the logistic model where y j ∼ Bern(P j ), P j = expit(x ⊤ j 0 ), x j ∶= (1, x 1j , x 2j , x 3j , x 4j ) ⊤ and 0 = ( 0 , 1 , 2 , 3 , 4 ) ⊤ = (−2, 1, 1, 1, 1) ⊤ . The covariates follow the same distributions as in (44). The function g(Y j , ) is the estimating function of a logistic model, that is, g(Y j , ) = x j {y j − expit(x ⊤ j 0 )}. The auxiliary variables considered are: 1j = (P j − P) −1 P + e 1j , 2j = P j + e 2j , 3j = P j + e 3j , 4i ∼ Bern(0.2) − 0.2 5j ∼ Bern(0.5) − 0.5 and 6j = M −1 − 1; where e 1j ∼ N(0, sd = 1.7), e 2j ∼ N(0, sd = 0.75) and e 3j ∼ 7 × {Beta(3, 1) − 0.75}. Here, P ∶= M −1 ∑ M j=1 P j and 2 P ∶= 0.3, 0.2) ⊤ are the proportions of subgroups, created by sorting the unit-level data according to the variablex 2i = x 2i +̃i correlated with x 2i , wherẽi ∼ N(0, 1).
In Table A6, we report the observed coverages of the regression coefficients of the logistic model (45) for the proposed approach (EL) and the coverages of the Wald-type interval based on pseudo-likelihood (PL) based on the full information about the design and auxiliary variables. We observe similar coverages, although EL gives slightly smaller coverages than PL.

AN EXAMPLE OF A REAL DATA APPLICATION: UK PISA SURVEY (2006)
We use the 2006 PISA survey data for the United Kingdom. We consider a linear regression model, where "mathematics achievement score on average" is the response and the covariates are -City: 1 for city located schools, 0 otherwise -Parent-tertiary: 1 if parents have tertiary education, 0 otherwise -Large-class: 1 for class sizes over 25, 0 otherwise -Male: 1 for males and 0 for females The data contain 13,152 students clustered into 502 schools. The school-level sampling fraction is non-negligible and approximately equal to 10%. There are nine strata. The dataset contains weights adjusted for the design and unknown auxiliary information. The inclusion probabilities and auxiliary information are not provided. Nevertheless, the minimal information is available for empirical likelihood: PSUs (schools), stratification, and weights.
Since we have missing observations for some covariates, it is sensible to adjust the weights with additional variables which may explain nonresponse. The adjusted weights considered are the students' level weights provided in the dataset divided by fitted response probabilities computed from a logistic model with the following dichotomous covariates: Male (1 for males and 0 for females), Scotland (1 for students in Scottish schools and 0 otherwise), and Public (1 for students in public school and 0 otherwise). The resulting weights are the w j|i that are used within (7). The dataset considered contains the units with no missing values for City, Parent-tertiary, and Large-class. The empirical likelihood approach is valid under a unit-level missing at random response mechanism, because we have a multistage design, and the weights are adjusted for nonresponse at unit level.
We consider three approaches: the empirical likelihood (EL) approach of Section 6, a parametric approach with a random effect (RE) for the intercept and Ordinary Least-Squares (OLS). Pseudo-likelihood used in Section 8 cannot be used here, because we do not know the inclusion probabilities. OLS is expected to be unreliable, because it does not take the design into account. The random effect takes the clustering into account, by allowing the intercept to vary across schools, but ignores the weights. It also relies on parametric assumption (normality of the error term and random effect) which may not hold. The estimates can also be biased, as shown in Table A2.
The results are given in Table A7. The three approaches give very different point estimates, SDs, p-values and confidence intervals. EL should have the largest and more reliable SDs and confidence intervals, because it takes the design into account. SDs should be under-estimated with the other methods. The variable City is significant with OLS, because of its under-estimated standard deviation. However, it is not significant with EL and RE. The variable Large-class is significant with EL and OLS, but not significant with RE. The variable Male is significant in all cases, but the confidence interval of EL is shifted upwards and only just overlap with the other intervals. EL tends to give wider confidence intervals, because it accommodates the randomness of the design and the model. Some EL confidence intervals do not overlap with OLS intervals and barely overlap with RE intervals.
This brief example shows that we may fit very different models, if we use empirical likelihood rather than random effect models. Empirical likelihood is more reliable and reveals that Large-class is significant. With a random effect model, this variables is not significant.

DISCUSSION
The proposed approach has the advantage of only requiring calibration weights and variables specifying the stratification and identifying the PSUs. We do not rely on inclusion probabilities and auxiliary variables, which are usually unknown with public-use survey data. The PSU-level profile empirical log-likelihood test statistics follows a 2 -distribution asymptotically under the null, even when the sampling fraction is large. It does not need a Rao and Scott (1981) adjustment or correction factors based on bootstrap, eigenvalues, design effects, or finite population corrections. Therefore, it can be used like a standard likelihood ratio for testing, model building, and confidence intervals. At the end of Section 7, we propose a variance estimator which does not rely on resampling, linearization, inclusion probabilities or negligible sampling fractions. Note that the proposed approach relies on the condition (13), which is reasonable when fitting models on survey data. The empirical log-likelihood ratio function implicitly recovers some asymptotic joint-inclusion probabilities between the PSUs. Indeed, this function can be approximated by a quadratic form with a consistent variance estimator which implicitly contains the asymptotic joint-inclusion probabilities (35) of Hájek (1964) within an Horvitz and Thompson (1952) variance. These probabilities do not need to be known, computed or be part of any adjustments or correction factors.
With noninformative sampling, the sampling design can be ignored under conditional model-based parametric approaches, but with informative sampling, adjustments involving inclusion probabilities are often needed. These probabilities are not needed for the proposed unconditional approach, even under informative sampling with large sampling fractions. Calibrated weights, stratification and clustering is the only information required.
The model is specified with moment conditions without the need of a distribution for the error term. This allows unknown heteroscedasticity, because assumptions about the residuals variance are not necessary. This is an advantage over model-based parametric likelihood, with skewed data.
Parametric likelihood approaches may not be necessarily robust when the assumptions about the distribution of error terms do not hold.
The simulation study shows that the empirical likelihood approach based on minimal information (calibrated weights, stratification and clustering) gives similar (and sometime more efficient) point estimators, confidence intervals and variance estimates, compared to pseudo-likelihood, based on full information about the design, given by joint inclusion probabilities, auxiliary variables, calibrated weights, stratification, and clustering. Pseudo-likelihood testing is based on Wald-type tests which can be less powerful than tests based on an empirical log-likelihood ratio function.
Optional side information can be used for empirical likelihood inference. This information is given by some variables with descriptive statistics known without errors. It can be combined with empirical likelihood to improve the accuracy of estimates. Side information should not be confused with the auxiliary information used for deriving the calibrated weights. Nevertheless, it is recommended to use side information which may be correlated with some of the auxiliary variables.
Unit-level nonresponse can be taken into account, when the weights have been adjusted for nonresponse, under the usual "missing at random" response mechanism. In this case, nonresponse is in-bedded within the design as an ultimate phase. The nonresponse variables used for adjusting the calibration weights are not needed. However, there are some limitations. If nonresponse occurs at PSU-level or we have a single-stage design, the point estimator is still consistent, but the property (29) would not hold, because we assumed that a fixed number of PSUs is selected. In this case, we would need to know the variables used for nonresponse weighting, because they would need to be part of the side information as in Berger (2018).

SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of this article.

APPENDIX A. REGULARITY CONDITIONS
We assume that the following regularity conditions hold.
We assume that there exists constants S and G such that where min {A} denotes the smallest eigenvalue of A when A is symmetric. Here, * wherê * i ( ) is defined by (A6).