Integrating endogeneity in survey sampling using instrumental-variable calibration estimator

The endogeneity problem arises when the auxiliary variables correlate to the error terms. In such cases, appropriate instrumental variables ensure efficient estimation. Calibration has recognized itself as an important methodological tool at a large scale to estimate the population total in survey sampling. Which does not offer efficient estimation in the presence of endogeneity. When endogeneity is present in the auxiliary variables, the calibration using endogenous auxiliary variables may produce biasedness and increase variance due to inappropriate model assumptions. In this article, we propose instrumental-variable calibrated estimators by using the classical instrumental-variables approach for the case of exact identification that are more efficient than conventional calibration estimators when some auxiliary variables are endogenous. The necessary properties of the proposed estimators are presented. Our study is backed by both the simulation study and a real data example to check the performance of the proposed estimators.


Introduction
Estimation of population total or means has significance while considering the survey data.Various researchers have proposed different estimators to estimate population total and mean under different sampling designs and by considering different problems in survey data.Liu and Arslan [1] proposed the estimators for population mean using auxiliary proportions.Ahmad et al. [2] suggested the generalized estimators for population mean.Wang et al. [3] Derived estimators for population mean by simple and double sampling in situations of extreme values.The calibration technique was derived by Deville and Särndal [4] to obtain an estimator of the population total using some sample weights called calibrated weights.These weights are obtained by minimizing the distance to the Horvitz-Thompson weights with the condition on the calibration equations to be satisfied.The resulting weights will be a function of the auxiliary variables.
Suppose we wish to estimate the total of the variable of interest Y in a finite population U = {Y 1 …, Y k , …, Y N }.A probability sample s is selected from the population with sampling designp(s), and y k is the value of k-th unit of the study variable Y for all k ∈ s (complete response) with a known inclusion probability π k > 0 for each element k, and the corresponding sampling design weight d k .
A vector of p auxiliary variables X T k = ( x k1 , …, x kj , …, x kp ) is the transposed vector whose elements are the values of the auxiliary variables for the kth unit associated with y k .We observe (y k , X k ) for the elements k ∈ s.The population total of X is t x = ∑ U X k is known and Horvitz-Thompson estimators is txπ = ∑ s d k X k .Deville and Särndal [4] suggested the calibration estimator defined in equation where w k weights selected to satisfy To minimize the distance between the design weights w k and initial weights d k , any distance function G k suggested by Deville and Särndal [4] can be minimized under some basic conditions with constraints given in eq.(1.2).Thus, calibration weights are linear functions of design weights and available auxiliary information.If λ = { λ 1 , …, λ j , …, λ p } Langrange multipliers vector.Then the Lanragian equation can be written as equation (1.2).
So φ s (λ) = (t x − txπ ) and λ can be found by the method of Newton's optimization discussed in equation (1.3) as: Hence we get the calibrated weights in equation (1.4) as: where u = q k xk λ.
The proposed calibrated weights gave the different results for different distance functions.Deville and Särndal [4] suggested different distance functions.The chi square distance function gave the class of calibrated weights such as where q k in equation (1.5) is the parameter that can be chosen to for improved calibrated weights and relative efficiency.Estevao and Särndal [5] used arbitrary positive value of q k to improve the calibrated estimator.Which is the same as the generalized regression estimator (GREG) proposed by Cassel et al. [6] and the obtained estimator can be deduced as a model-based and design-based estimator Cardot et al., 2017.[7].
where bs = ( ∑ ).However, this minimum distance technique in calibration offers almost identical estimators for different distance functions.For studying the properties of calibrated estimators, Estevao and Särndal [5] suggested calibration estimators under two-phase sampling.Shehzad [8] and Goga and Shehzad [9] produced the penalized calibrated estimators.Shehzad et al. [10] and Brirah et al. [11] proposed modified calibration methods for estimating the population total.Alam and Hanif [12] proposed cosmetic calibration estimators.Kott [13], Kott [14], Särndal [15], and Kim (2010) also used the calibration technique for different conditions to derive the calibrated estimators.Park and Kim [16] proposed model-based instrumental-variable calibrated estimators to minimize the anticipated variance in calibration estimator also used under two-phase sampling.Endogeneity is a classical problem which arises due to the correlation between the independent variables and error terms.Wooldridge [17] suggested to use an instrumental variable Z k .Which are highly correlated with each endogenous component of X but independent of e to deal the problem of endogeneity.In survey data, the problem of endogeneity also arises when we model the data to estimate the population total.When endogeneity is present in the auxiliary variables, the calibration using endogenous auxiliary variables may produce biasedness and increase variance due to inappropriate model assumptions.This estimation problem has not been addressed in In this paper, we proposed the instrumental-variable calibration estimator using model-assisted and model-based approaches when some auxiliary variables are endogenous.The mathematical properties of the proposed estimator were verified, and the performance of the proposed estimator was evaluated using a simulation study and real data.In sections 3 and 4, properties of proposed estimators are presented.In section 5, the performance of the estimators has been evaluated by a simulation study and a real data example.

Instrumental variables (IV) regression
One of the most important assumptions of the Classical Linear Regression Model (CLRM) is that the regressors are exogenous.The violation of this assumption Cov(X i , e i ) ∕ = 0, that is, the regressors are correlated with the error term, is called Endogeneity.The solution to this violation is the method of Instrumental-variables (IV).An estimator for which the endogenous and instrumental variables are the same is referred to as just or exact identified.An estimator for which the instrumental variables are more than the endogenous variables is called the over-identified estimator [18].Wright [19] first introduced instrumental variables and used them to estimate supply and demand elasticity for butter and flaxseed.Reiersøl [20] applied the same method in the context of errors-in-variables models in his dissertation.Let X = ( X 1 , X 2 , …, X p ) be a n n × p matrix of known regressors and suppose the following super population regression model.
Y is a (n ×1) vector of the dependent variable, and X is (n ×p) non-random matrix of independent variables.Also ( X T X ) is a full-rank matrix and e is a (n ×1) vector of residuals also assumed that the expected value of e is zero and e p are uncorrelated.The variance of e is constant (homoscedastic), i.e. ,var(e) = σ 2 I, also assumed that X and e are independent, i.e. cov(X,e) = 0.It means that the explanatory variables are exogenous and β is (n ×1) vector of unknown parameters.Then the ordinary least square (OLS) estimator is The ordinary least square estimator βOLS is unbiased and has minimum variance such as Hence βOLS is an unbiased and consistent estimator of β.On the other hand, when X and e are correlated, that is cov(X, e) ∕ = 0, it means that the explanatory variable X is endogenous then the OLS estimator is biased and inconsistent.In this situation, it is good to use the estimates to predict the value of the dependent variable given the value of X.However, the estimate does not recover the causal effect of X on y.So, to estimate the parameter β consider a set of variables Z (instrumental variables) which are highly correlated with each endogenous component of X but independent of e [17].If the relationship between each endogenous component of x i and the instrument is defined in equation (2.2) and given as: Then the instrumental variable (IV) estimator is Instrumental-variable estimator βIV in equation (2.3) is unbiased and consistent under certain regularity conditions.

Instrumental-variable calibration approach
The calibration approach is usually used without assuming the super population model [4].The calibration technique consists of estimating the population total The distance function (chi-square distance) is where So taking derivatives of L co ncerning w in equation (3.1) we obtained the value of λ.By putting the value of λ we finally get the weights as: hence the calibration estimator of t y using equation (3.2) becomes We propose the instrumental-variable calibration estimator by the instrumental-variable calibration approach proposed by Ref. [5] without using the distance minimum function approach such as where W k is the calibrated weight obtained by the instrumental-variable approach subject to The weight with unknown λ is where , q k is a positive integer in the present study, we take q k = 1, and Z s is the sample restriction of Z, the classical instrumental variable used instead of the endogenous auxiliary variable.By plugging in the weights in the calibration constraint we find the value of λ as Put the value of λ in equation (3.3) weights equation and finally, we get the required weights as so, the instrumental-variable calibration estimator for the total t y by using equation (3.4) is as: The estimator tIVC defined in equation (3.5) is a model-assisted (designed-based) instrumentalvariable estimator.

Properties of model-assisted instrumental-variable calibration estimator
Some properties of the model-assisted Instrumental-variable calibration estimator ( tIVC ) are presented and their proof are available in appendix.

Model-based instrumental-variable calibration approach
Usually, without the auxiliary information, ty is determined by the Horvitz-Thompson [21] estimator, defined as The estimator in equation (4.1) may be improved by using the auxiliary variables in the form of model-based estimation.A model identified the set of conditions that describe a class of distribution of Y = {y 1 , y 2 , …, y N } [22].Kumar et al. [23] proposed the model-based calibration estimator when the study and auxiliary variables are inversely related.We propose a model-based instrumental-variable calibration estimator of Y by the Instrumental-variable calibration approach proposed by Ref. [5] under the model given in equation (2.1) as: which does not satisfy the assumption of exogeneity, that is E(x i ,e i ) ∕ = 0. We propose a model-based instrumental-variable calibration estimator of Y as: where W s calibrated weights which are obtained by the instrumental-variable calibration technique.Subject to the constraint Since X s is endogenous, we use instrumental-variable Z s instead of endogenous auxiliary variables.By using the Instrumentalvariable calibration approach proposed by Ref. [5], the weights in equation (4.2) become Plug in the value of weight in equation (3.1) we get (1 s + λZ s )X s = 1 u X.By solving it we find the value of Plug in the value of λ in equation (4.3) final weights are as: thus, the proposed Instrumental-variable model-based calibration estimator of t y using equation (4.4) becomes So, the model-based instrumental-variable calibration weights (W s ) perform a similar character to the calibrated weights under certain conditions.

Properties of model-based instrumental-variable calibration estimators
Some Properties of the model-based Instrumental-variables calibration estimator tIVMBC are presented as theorems.

Theorem 3. The model-based Instrumental-variable calibration estimator tIVMBC is biased, and its bias is given as
Theorem 4. The Mean Square Error of the model-based Instrumental-variable calibration estimator tIVMBC is given as ) ⎤ ⎦

Simulation scheme
In this section, we draw the empirical results to check the efficiency of the estimators by the Monte Carlo simulation.The present simulation study generates a finite population of size N = 1000.For this population, 20 variables X = (X 1 , X 2 , …, X 20 ) of size 1000 (X matrix is of dimension 1000 ×20) were generated using normal distribution, in which some are adjusted to have correlation with error terms using a linear function.In this way, they are endogenous.The finite population is based on the pair (y k ,x k ) such that x k and y k are linearly related, and the relation obtains the variable of interest Y as defined in equation (2.1).The value of β is taken as 1.The total value of Y which is t y assumed to be the true population total.Instrumental auxiliary variables were also generated using normal distribution but with the assumption that they are correlated with auxiliary variable are unrelated to error terms.A sample of size n = 25, 50, 75, 100, 150, 200, 250, 300, and 350 were taken using Simple Random Sampling without Replacement (SRSWOR) for each draw.Different number of endogenous variables (E), E = 1,2,and 3, were considered by using a linear model so the error terms relate to corresponding auxiliary variables.Then each endogenous variables replaced by the Instrumental-variable Z, generated to be independent of the error term and correlated with its endogenous auxiliary variable.The number of simulations was R = 1000 and generated data were kept fixed in each simulation.All the computational work was done in R language.

Performance evaluation
The performance evaluation of the proposed estimators with conventional estimator is presented using following measures.Bias: which is calculated for estimated total ( ty ) such as: Mean Square Error (MSE): which is calculated for estimated total ( ty ) such a

Simulation result
The results are presented in Tables 1-5.These results show the behaviour of all the considered estimators: the HT estimator, GREG or conventional calibration estimator, and Instrumental-Variable Calibration (IVC) estimator for different endogenous auxiliary variables for 20 total auxiliary variables for different sample sizes by (SRSWOR).For every table, the performance of each estimator is examined with two properties Bias and Mean Square Error (MSE).
Table .1 shows the results of HT, GREG, and IVC in the form of Bias and MSE for n = 25,50,75,100,150,200,250,300 and 350.For all the sample sizes and E = 1, the Mean Square Error (MSE) of the proposed Instrumental-Variable Calibrated (IVC) estimator is smaller than the HT and GREG estimators.Table 2 shows the results obtained for similar conditions for two endogenous variables, E = 2, for different sample sizes.The Mean Square Error (MSE) of HT and GREG is larger than the proposed Instrumental-variable calibrated (IVC) estimator.Table 3 shows the results obtained for similar conditions for E = 2. for different sample sizes, the Mean Square Error (MSE) of HT and GREG is larger than the proposed Instrumental-Variable Calibrated (IVC) estimator.Tables 4 and 5 show the results for three endogenous variables, E = 3, for different sample sizes in both cases.The Mean Square Error (MSE) of the proposed Instrumental-Variable Calibrated (IVC) estimator is smaller than HT and GREG estimators.
The results show that the proposed Instrumental-Variable Calibrated (IVC) estimator gave the smaller Mean Square Error (MSE) for small and large sample sizes.So Instrumental-Variable Calibrated (IVC) estimator improves the efficiency over conventional Calibration.

Real data example
To compare the proposed estimators with the Horvitz-Thompsons and conventional calibration estimators (GREG estimator).We used a real data example.The data given by Singh et al. [24] is used to evaluate the model performance.The data are freely and publicly accessible for use at: http://www.kiran.nic.in/pdf/Social_Science/elearning/How_to_Test_Endogeneity_or_Exogeneity_using_SAS-1.pdf.Eight variables of size (N = 376) are in the dataset including Min_Tem (Minimum Temperature), Rain (Average Rainfall), Foodgrain_Yield (Yield of food grain), Latitude (Latitude of a particular location), Longitude (Longitude of a particular location), Foodgrain_yld_FD (First difference of Foodgrain_Yield), Min_Tem_FD (First difference of Min_Tem), Rain_FD (First difference of rain), where the Yield of food grain is a dependent variable.The Auxiliary variables are Minimum Temperature and Rain, and the other five variables, Latitude, Longitude, Foodgrain_yld_FD, Min_Tem_FD, and Rain_FD, are selected as instrumental variables.The Auxiliary variable has already endogeneity reported, so we use the instrumental variables instead of the endogenous auxiliary variables to evaluate the model performance.We considered this data as population data and take a sample of size n = 25, 50, 75, 100, 150, 200, and 250 using SRSWOR.

Real data results
Table .6presents the results of the three estimators and their Bias and Mean Square Error (MSE) for different sample sizes.When the auxiliary variable, Minimum Temperature, is endogenous, the variable Longitude of a particular location is used as an instrumental variable.The results show that the proposed Instrumental-Variable Calibrated (IVC) estimator has a smaller Mean Square Error (MSE) than HT and GREG estimators when there is a problem of Endogeneity present in the dataset in case of exact identification.This shows that the proposed estimator is more efficient than the HT and GREG.

Conclusion
In survey sampling, the calibration restrictions are significant.In this paper, the Instrumental-variable calibration technique is used to find the optimum estimators in the presence of the problem of endogeneity.In Monte-Carlo simulation study and real data example, we examined the performance of the proposed estimator for different sample sizes drawn by simple random sampling without replacement from a finite population.The proposed Instrumental-Variable Calibrated (IVC) estimator in terms of Mean Square Error   analysis, Data curation.Abdul Rauf Kashif: Writingoriginal draft, Resources, Investigation, Formal analysis.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Table 1
Monte Carlo Bias and Mean Square Error (MSE) with one endogenous variable (E = 1) i.e.X 1 is endogenous.

Table 2
Monte Carlo Biases and Mean Square Error (MSE) with two endogenous variables (E = 2) i.e.X 1 and X 2 are endogenous.

Table 3
Monte Carlo Bias and Mean Square Error (MSE) with two endogenous variables (E = 2) i.e.X 9 and X 10 are endogenous.

Table 4
Monte Carlo Bias and Mean Square Error (MSE) with three endogenous variables (E = 3) i.e.X 1 , X 2 and X 3 are endogenous.MSE) is more efficient than HT and GREG estimators under different sample sizes and varying endogenous variables.The proposed estimator is more efficient as sample size increases.The present study is limited to the exact identification means that the number of instrumental variables equals the number of endogenous variables Further investigation of the over-identification problem is the topic of future research.

Table 5
Monte Carlo Biases and Mean Square Error (MSE) with three endogenous variables (E = 3) i.e.X 13 , X 14 and X 15 are endogenous.

Table 6
Real data Average Bias and Average Mean Square Error (MSE) with one endogenous variable.