New estimators for estimating population total: an application to water demand in Thailand under unequal probability sampling without replacement for missing data

View article
Environmental Science

Introduction

Increasing demand for water is highly concerning because of water supply reduction. There are many reasons that cause an increase in water demand such as the rapid growth of the human population, climate change, and so on. The world water resources consist of more water from the sea compared to available fresh water or rainwater. The amount of clean water is also affected by polluted water. Many developing countries face water scarcity and flooding issues due to climate change which can affect their sustainability in economics and lead to unsafe conditions and poor health of the population. Freshwater is used for a myriad of reasons such as household usage, business and industry, agriculture and much more. Thailand is one of the developing countries that mainly uses freshwater in agriculture which accounts for a majority of the usage of the world’s available freshwater. Metropolitan waterworks and provincial waterworks are organizations who are responsible for producing, delivering, and distributing water supply to all provinces in Thailand while also providing resources for water. The former is responsible for Bangkok, Nonthaburi, and Samut Prakan and the latter is responsible for the rest of the country. Some of the consumption of water data are missing in the database system which could lead to the wrong interpretation based on missing data. The missing data or nonresponse should be taken into consideration before processing for further analysis to make for a more powerful interpretation.

If this issue is not addressed, water shortage could lead to repercussions in the future and it would be harmful for human life because of a lack of clean water to use. The management of water resources to avoid facing water scarcity needs to be taken into consideration. Knowledge of the gap between the demand and supply of water could accommodate the strategies and policy planning for the world to be prepared for sustainable water management in order to provide sufficient water according to the demand. Estimating the water demand can benefit future planning to avoid water shortage. Bakker et al. (2014) investigated three models to forecast water demand in both cases with the model using weather input and not using it. The simulation results found that the model using weather input gave a maximum 11 percent of the errors which is essential in water supply system control and detecting irregularity. Huang & Lin (2017) proposed a system dynamics model for studying demand and supply for water resources to avoid water shortage in China. The model had been used to estimate demand and supply for Shandong in China for the next 15 years. Rainwater is also one of the sources of water usage. Boretti & Rosa (2019) examined the correlation between water demand, population growth, and economic growth to estimate water scarcity in the world by 2050. They found that the demand for water is growing even more than the growth of the population and economy along with a low quality of resources and water to use. Kaewprasert, Niwitpong & Niwitpong (2022) proposed confidence intervals for estimation of the mean of delta-gamma distribution using the Bayesian method and applied it to rainfall data in Chiang Mai Thailand.

The biased estimator, namely the ratio estimator is a popular method for estimating population total ( Y) or population mean ( Y¯) of a study variable ( y) when the information of an auxiliary variable ( x) exists and is highly positively correlated with y. The ratio estimator was introduced by Cochran (1940) under simple random sampling without replacement (SRSWOR). The mean square error and bias of the ratio estimator are investigated by using the first order approximation of the Taylor linearization approach to transform the ratio estimator to a linear estimator. Then, the properties of the ratio estimator can be approximated from the linear estimator. Sisodia & Dwivedi (1981) proposed a ratio estimator when the population coefficient of variation (Cx) of x is known. The ratio estimator when the kurtosis (β2(x)) is known was proposed by H. P. Singh & M. S. Kakran (1993, unpublished data). Upadhyaya & Singh (1999) suggested the ratio type estimators for estimating population mean when the Cx and β2(x) are known. Bhushan & Kumar (2022) suggested some classes of population mean estimators based on the optimum value of the constant to improve the efficiency of the estimators under ranked set sampling. The ratio estimators of Cochran (1940), Sisodia & Dwivedi (1981), H. P. Singh & M. S. Kakran (1993, unpublished data) and Upadhyaya & Singh (1999) require the population mean of x, X¯ in order to estimate Y¯. Therefore, Perri (2004) proposed an alternative ratio estimator namely regression-in-ratio estimator for estimating Y¯. The estimator of Perri (2004) does not require X¯ by using the regression estimator to estimate this value. In other words, if the auxiliary variable x is correlated with another auxiliary variable namely u then X¯ can be estimated from u by using a regression estimator to estimate this value. The Perri (2004) estimator is a function of two estimators consisting of estimators of Y¯ and X¯. In the context of unequal probability sampling without replacement (UPWOR), Bacanli & Kadilar (2008) modified the ratio estimators under SRSWOR by estimating the population mean of y and x under SRSWOR using the Horvitz & Thompson (1952) type estimators. The variance and associated estimators of Bacanli & Kadilar’s (2008) estimator can be obtained by using a Taylor linearization approach and method from Horvitz & Thompson (1952). Lawson (2021) suggested a general class of ratio estimators for population mean in the form of a combined estimator making use of known auxiliary variables such as the coefficient of variation, coefficient of skewness, coefficient of kurtosis and so on. The Lawson (2021) estimator performed well giving a smaller mean square error especially for a small sample size.

The ratio estimators in the full response case cannot be used to estimate population mean or population total of y when some elements in the sample units are unresponsive. Cochran (1977) considered the ratio estimator under SRSWOR to estimate Y¯ in which information on x is available for all sample units and X¯ is known but some elements of y in the sample units are missing.

Later, ratio estimators with their properties when nonresponse occurs in both y and x but X¯ is known under SRSWOR were proposed by Rao (1986, 1987), Khare & Srivastava (1997), Okafor & Lee (2000), Särndal & Lundström (2005). Kumar (2015), Lawson (2017) introduced estimators for estimating population total and population mean and their variance estimators under probability proportional to size with replacement sampling and nonresponse present in the study. The Lawson estimators are approximately unbiased estimators and they do not require the response propensity when the response probability is uniformly nonresponse, and the sampling fraction is small. Under UPWOR and when information on x is available for all sample units when X¯ is known, Ponkaew & Lawson (2018) proposed a ratio estimator for population total of y with a uniform nonresponse. The variance and associated estimators are also discussed under a reverse framework and when the sampling fraction is ignored. In the same year, Ponkaew (2018) proposed a linear generalized regression estimator (GREG) for population total when information about calibration variables u1,u2,...,uq exists. The estimator of Ponkaew (2018) is in a form of a nonlinear estimator then automated linearization approach was used to transform this estimator to a linear form. Consequently, the variance and their estimators can be approximated from linear estimators. The ratio estimators in the presence of nonresponse require the value of X¯ in both situations where nonresponse occurs with variables y and x and nonresponse occurs only with the variable y. Ponkaew (2018) considered the missing completely at random (MCAR) mechanism which is unlikely to occur in practice. Lawson & Ponkaew (2019) suggested a new GREG estimator using the idea of Lawson (2017) under unequal probability sampling without replacement and nonresponse occurring missing completely at random and when the sampling fraction is small and therefore can be omitted. However, their estimator requires joint inclusion probability which sometimes can be difficult to find. Lawson & Siripanich (2020) improved a new GREG estimator based on the idea of Lawson & Ponkaew (2019) for more flexible situations with the non-uniform nonresponse mechanism or missing at random (MAR) and where the sampling fractions are both large and small. Ponkaew & Lawson (2023) proposed a new approximately unbiased GREG estimator in the form of a ratio estimator following Ponkaew & Lawson (2018) and Lawson & Ponkaew (2019) under the same situation where nonresponse occurs under MCAR but extended it when the sampling fractions are both large and small which is in a general form.

Some researchers suggested to estimate the missing values before further analysis. For example, Shahzad et al. (2020) proposed population mean estimators for when there are some missing observations in the study utilizing robust regression to apply to the regression coefficient estimator under SRSWOR when outliers are present in the study. They considered when nonresponse occurs in the study variable, and in both the study and auxiliary variable when the population mean of the auxiliary variable is known and unknown. Anas et al. (2022) also suggested ratio type regression estimators when nonresponse is present in the study in three situations similar to Shahzad et al. (2020) but the quantile regression in the mean estimator when outliers are present in the study was used. Chodjuntug & Lawson (2022a) suggested a new imputation method to create a population mean estimator when missing data appears in the study variable and applied it to estimate fine particulate matter in Bangkok, Thailand. They suggested to apply two constants to minimize the mean square error of the population mean estimator. Chodjuntug & Lawson (2022b) developed a new estimator by adjusting Chodjuntug & Lawson’s (2022a) by utilizing the response rate and the constant that minimizes the mean square error (MSE) of their proposed estimator. Their estimator using the constant that makes the minimum MSE performed the best. Bhushan et al. (2022) proposed some imputation methods for estimating population mean in the form of logarithmic imputations under SRSWOR for missing data.

In this article, we aim to propose new ratio estimators by extending the Ponkaew & Lawson (2018) estimator to situations where X¯ is known or unknown and nonresponse occurs with both variables y and x. In the situation where X¯ is unknown we used the concept from Perri (2004) to estimate its value from the calibration variables u1,u2,...,uq using the linear GREG estimator of Ponkaew (2018). The variance and associated estimators of the proposed estimators are investigated under the reverse framework. Furthermore, the proposed ratio estimators are considered under both missing at random (MAR) which is more flexible to occur in practice and also considered under MCAR nonresponse mechanism. Finally, we compared the efficiency of the proposed estimators and their variance estimators between the MAR and MCAR mechanisms through a simulation study and an application to water demand data in Thailand.

Materials and Methods

Basic setup

In this section, we introduce notations and basic notions about the population total estimator and their variance estimators under the reverse framework. Let y be a study variable and the population total of y is Y=iUyi where U={1,2,...,N} and N is the population size. Suppose the auxiliary variables x, w and the size variable k are available and highly positively correlated with the study variable. The calibration variables u1,u2,...,uq where q1 are also available and they are correlated with the auxiliary variable x. Let, ui=(1ui1Luiq) and UN=(u1u2LuN) be the N×(q+1) matrix values of ui. We are using the GREG estimator model from Särndal, Swensson & Wretman (1992) and Särndal (2007) in which the linear assisting model ξ, Eξ(xi)=βui and Vξ(xi)=σi2. The linear assisting model ξ is a model describing the relationship between the study variable and auxiliary variable. Let qi be determined by the linear assisting model ξ that is qi=1/σi2. Usually, the standard choice of qi is qi=1 and it is determined by the linear assisting model ξ: Eξ(xi)=βui and Vξ(xi)=σ2.

Let, F be the set of all possible subsets of U and the sample s of size n was selected from the population U under UPWOR. A sampling design p(.) is a probability distribution on F, i.e., P(s)0 for all sF and sFp(s)=1. Let, πi=P(is)=siP(s) be the first order inclusion probability and πij=P(ijs)=s{i,j}P(s) be the second order inclusion probability. We also define ES() and VS() as the expectation and variance operators with respect to the UPWOR sampling design.

In the presence of nonresponse, let subscript R and ri be the nonresponse mechanism and nonresponse indicator variable of yi which ri=1 if unit i responds to item y otherwise ri=0. Let, R=(r1r2rN) be the vector of the response indicator and pi=P(ri=1) be the response probability under MAR nonresponse. Let, ER() and VR() be the expectation and variance operators with respect to the nonresponse mechanism. Three assumptions are defined; (A1) the response mechanism is uniform response. (A2)β^rβ=Op(nr12) and (A3)V(isbiπi)0 as n where bi=wi or ri. We also consider three more conditions for investigating the estimator of Y=iUyi as follows. (B1) nonresponse occurs only on y, the information on xi is available for all is and X¯ is known. (B2) nonresponse occurs on both y and x and X¯ is known and (B3) nonresponse occurs both with y and x and X¯ is unknown but information on u1,u2,...,uq are available for all is and U¯j=1NiUuij, j=1,2,...,q are known.

Throughout this article, we consider variance estimation of the population total estimator in the presence of nonresponse under the reverse framework. Therefore, we discuss three steps to investigate the variance and its nonlinear estimator such as the ratio estimator when nonresponse occurs in the study variable. Assume that we have K variables consisting of t1, the study variable and t2,t3,...,tK, auxiliary variables. Let Y^s be a nonlinear estimator and be defined by,

Y^s=ψ(T^),

where ψ is a known smooth function, T^=[t^1t^2t^K], K2 t^k=isritkiπipi if the variable tk exhibits nonresponse otherwise it can be obtained by Y^s. Under the reverse framework, variance of t^k=istkiπi

V(Y^s)=ERVS(Y^s|R)+VRES(Y^s|R)=V1+V2,

where V1=ERVS(Y^s|R) and V2=VRES(Y^s|R). The formula of V(Y^s) consists of three steps as below.

Step 1: Investigate a formula of V1=ERVS(Y^s|R). Since Y^s is in a form of a nonlinear estimator then V1=ERVS(Y^s|R) can be approximated by,

V1ERVS(Y^s(1)|R),

where Y^s(1) is a linear estimator of Y^s under the Taylor linearization approach.

Step 2: Investigate the formula of V2=VRES(Y^s|R). The formula of V2=VRES(Y^s|R) can be approximated by,

V2VR(Y~s|R),

where Y~s=ES(Y^s|R).

Step 3: Approximate the value of V(Y^s) and its estimator. The value of V(Y^s) can be obtained by,

V(Y^s)=V1+V2.

The estimator of V(Y^s) can be obtained by substituting estimators for the unknown parameter in (5). Then, the estimator of V(Y^s) is defined by,

V^(Y^s)=V^1+V^2,

where V^1, V^2 are the estimators of V1, V2 respectively.

Existing estimators under uniform nonresponse

Uniform nonresponse or missing completely at random (MCAR) is a nonresponse mechanism in which the probability of response of the study variable y neither depends on itself nor another variable such as x,k or w. In this section, we discuss two estimators for estimating population total in the presence of uniform nonresponse namely ratio and GREG estimators proposed by Ponkaew & Lawson (2018) and Ponkaew (2018), respectively. The variance estimation of both ratio and GREG estimators are considered under the reverse framework and the sampling fraction is negligible with the UPWOR sampling design.

The ratio estimator

When nonresponse occurs only with y but the population mean and its estimator of x are available, Ponkaew & Lawson (2018) proposed ratio estimators to estimate population mean and the total of y under unequal probability sampling without replacement and the nonresponse mechanism is MCAR. The Ponkaew & Lawson (2018) estimator for population mean is

Y¯R^=1Nisriyiπip1NisxiπiX¯=Y¯^rX¯^HTX¯=R^rX¯,

where Y¯r^=1Nisriyiπip, X¯HT^=1Nisxiπi and R^r=Y¯r^(X¯HT^)1. Ponkaew & Lawson’s (2018) estimator for population total is

Y^R=NY¯R^=NX¯R^r.

We note that, if p is unknown the estimator of p is equal to p^=(isriπi)(is1πi)1. The variance and associated estimators of the estimator in (8) is defined in (9),

V(Y^R)=iUDi(yiRxi)2+iUi{j}UDij(yiRxi)(yjRxj)+iU1pipiyi2,

where R=Y¯X¯1. The estimator of V(Y^R) is given in (10),

V^(Y^R)=isD^i(yiR^rxi)2+isi{j}sD^ij(yiR^rxi)(yjR^rxj)+isE^iyi2,

where R^r=isriyiπip(isxiπi)1, E^i=riEiπi, D^i=riDip and D^ij=rirjDijp2.

The GREG estimator

The GREG estimators for estimating population mean or population total of the study variable is a powerful method when the calibration variables u1,u2,...,uq are present where q1 are also available. In full response, Särndal, Swensson & Wretman (1992) and Särndal (2007) proposed a GREG estimator under the linear assisting model ξ,

Eξ(xi)=βuiandVξ(xi)=σi2.

Let Qs=diag(qi)s×s and qi be determined by the linear assisting model ξ in (5.1) i.e., qi=σi2. In the presence of nonresponse, Särndal & Lundström (2005) proposed a linear GREG estimator to estimate population total. They investigated variance and associated estimators under the two-phase framework. Ponkaew (2018) proposed linear GREG estimators for estimating the population mean of x under the MCAR mechanism which is defined by,

X¯GREG(1)^=1Nisrixiπip+[U¯1Nisriuiπip](isriqiuiuiπip)1(isriqiuixiπip)=X¯r(1)^+[U¯U¯^r(1)]β^r,

where X¯r(1)^=1Nisrixiπip, U¯^r(1)=1Nisriuiπip, and β^r=(isriqiuiuiπip)1(isriqiuixiπip)=(isriqiuiuiπi)1(isriqiuixiπi).

Then, the GREG estimator to estimate the population total of x is

X^GREG(1)=NX¯GREG(1)^=isrixiπip+[Uisriuiπip](isriqiuiuiπip)1(isriqiuixiπip),=X^r(1)+[UU^r(1)]β^r,

where X^r(1)=NX¯r(1)^=isrixiπip, U^r(1)=NU¯^r(1)=isriuiπip.

Under the reverse framework and when sampling fraction is negligible the variance of W^GREG(1) is

V(X^GREG(1))iUD1iei2+iUjU{i}Dijeiej,

where D1i=(1πi)πip, Dij=πijπiπjπiπj and ei=(xiuiβ).

The estimator of V(W^GREG(1)) is equal to.

V^(X^GREG(1))1p2[isD^irie^i2+isjs{i}D^ijrirje^ie^j],

where e^i=(xiuiβ^r), D^i=1πiπi2, D^ij=πijπiπjπijπiπj, p=(is1πi)(isriπi)1 if p is unknown otherwise p=p.

Results and discussion

The proposed new ratio estimators

In the previous section, we introduced two estimators of the population total: ratio and GREG estimators in the presence of uniform nonresponse. The variance estimation for both ratio and GREG estimators are considered under the UPWOR sampling design and when the sampling fraction is negligible. However, the ratio estimators in (7) and (8) are considered under a situation where nonresponse occurs in y only and they require the value of the population mean of x. Then, in this section we aim to propose new ratio estimators when nonresponse occurs in both variables y and x. We also consider two distinct situations of X¯ that are known or unknown. In the situation where X¯ is unknown we estimate it from the calibration variables u1,u2,...,uq using the GREG estimator. In the context of nonresponse, we investigate the proposed ratio estimator under the MAR mechanism because it has weak assumptions and tends to occur in real life more often than the MCAR mechanism. However, we still consider new ratio estimators under the MCAR mechanism for comparing the efficiency of the proposed estimators. First of all, we extended the Ponkaew & Lawson (2018) estimators to the MAR mechanism. The ratio estimator of Ponkaew & Lawson (2018) for estimating population mean under the MAR mechanism is equal to,

Y¯R(1)^=1Nisriyiπipi1NisxiπiX¯=Y¯r(1)^X¯^HTX¯=R^r(1)X¯,

where Y¯r(1)^=1Nisriyiπipi, X¯HT^=1Nisxiπi and R^r(1)=Y¯r(1)^(X¯HT^)1.

Then, the ratio estimator for estimating population total under the MAR mechanism is

Y^R(1)=NY¯R(1)^=NX¯R^r(1),

Under the MAR mechanism if pi is unknown then it is estimated using the probit or logistic regression models. The variance and associated estimators of Y^R(1) are discussed in Theorem 4.1.

Theorem 1. Under condition (B1) with the reverse framework and the nonresponse mechanism is MAR.

(1) The variance of Y^R(1) is

V(Y^R(1))iUDiAi2+iUi{j}UDijAiAj+iUEiyi2,

where Ai=yiRxi, R=Y¯X¯1 and Ei=1pipi.

(2) The estimator of V(Y^R(1)) is

V^(Y^R(1))isD^iA^i(1)2+isi{j}sD^ijA^i(1)A^j(1)+isE^iyi2,

where A^i(1)=yiR^r(1)xi, R^r(1)=isriyiπipi(isxiπi)1, E^i=riEiπi, D^i=riDipi and D^ij=rirjDijpipj.

Proof. Let Y^R(1) be defined in (17). Therefore, variance of Y^R(1) is

V(Y^R(1))=V(NX¯R^r(1))=N2X¯2V(R^r(1)).

Furthermore, the estimator of V(Y^R(1)) can be obtained by,

V^(Y^R(1))=N2X¯2V^(R^r(1)).

Since R^r(1) is a nonlinear estimator then the variance of this estimator is equal to,

V(R^r(1))=ERVS(R^r(1)|R)+VRES(R^r(1)R)=V1+V2,

where V1=ERVS(R^r(1)|R), V2=VRES(R^r(1)|R).

Step 1: Investigate the formula of V1=ERVS(R^r(1)|R).

By using the Taylor linearization approach the linear estimator of R^r(1) is

R^r(1))Constant+1NX¯isA~iπi,

where A~i=(riyipiR~r(1)xi). Then V1=ERVS(R^r(1))|R) can be approximated by,

V1ERVS(R^r(1)|R)=ERVS(Constant+1NX¯isA~iπi|R)=1N2X¯2ER(iUDiA~i2+iUi{j}UDijA~iA~j|R)=1N2X¯2(iUDiAi2+iUi{j}UDijAiAj),

where Ai=ERVS(A~i|R)=yiRxi and R=Y¯X¯1.

Therefore,

V11N2X¯2(iUDiAi2+iUi{j}UDijAiAj).

Step 2: Investigate the formula of V2=VRES(R^r(1)|R).

The formula of V2=VRES(R^r(1)|R) can be approximated by,

V2VRER(R^r(1)|R)=VRER(1Nisriyiπipi1Nisxiπi|R)=VR(1NiUriyipiX¯|R)=1N2X¯2iU(1pi)yi2pi=1N2X¯2iUEiyi2

where Ei=(1pi)pi.

Then,

V21N2X¯2iUEiyi2.

Step 3: Approximate the value of V(R^r(1)) and its estimators.

The value of V(R^r(1)) can be approximated by,

V(R^r(1))V1+V2=1N2X¯2(iUDiAi2+iUi{j}UDijAiAj+iUEiyi2).

The estimator of V(R^r(1)) is

V^(R^r(1))=1N2X¯2(isD^iA^i(1)2+isi{j}sD^ijA^i(1)A^j(1)+isE^iyi2).

Replace (25) into (18) then the variance of Y^R(1) is

V(Y^R(1))iUDiAi2+iUi{j}UDijAiAj+iUEiyi2.

Furthermore, the estimator of V(Y^R(1)) can be obtained by substituting (26) in (19) then,

V^(Y^R(1))isD^iA^i(1)2+isi{j}sD^ijA^i(1)A^j(1)+isE^iyi2.

In (16) and (17), we extend the ratio estimators of Ponkaew & Lawson (2018) to the MAR mechanism and discussed the variance and its estimators in Theorem 1. However, the ratio estimator for population mean in (16) and for population total in (17) can be used under the condition (B1) that is, when nonresponse occurs only with the y variable but information on xi for all is and X¯ needs to be known. Next, we proposed new ratio estimators under condition (B2) where nonresponse occurs on both y and x but X¯ is known and condition (B3) nonresponse occurs both y and x and X¯ is unknown but information of u1,u2,...,uq are available for all is and the population mean of u1,u2,...,uq are also known.

The new ratio estimator when X¯ is known

Assume that the condition (B2) is satisfied when nonresponse occurs with both variables y and x but X¯ is known. The new ratio estimator for estimating population mean is given below,

Y¯^R(2)=(1Nisriyiπipi1Nisrixiπipi)1NiUxi=Y¯^rX¯^rX¯=R^r(2)X¯,

where Y¯r^=1Nisriyiπipi, X¯r^=1Nisrixiπipi, R^r(2)=Y¯r^(X¯r^)1. Furthermore, the estimator of pi can be obtained by using the probit or logistic regression models under the MAR mechanism. Then, the new ratio estimator for the population total is

Y^R(2)=NY¯R(2)^=NX¯R^r(2).

The variance and associated estimators of Y^R(2) are discussed in Theorem 2.

Theorem 2. Under condition (B2) with reverse framework and where the nonresponse mechanism is MAR.

(1) The variance of Y^R(2) is

V(Y^R(2))=iUDiAi2+iUi{j}UDijAiAj+iUEiyi2,

where Ai=yiRxi, R=Y¯X¯1 and Ei=1pipi.

(2) The estimator of V(Y^R(2)) is

V^(Y^R(2))=isD^iA^i(2)2+isi{j}sD^ijA^i(2)A^j(2)+isE^iyi2,

where A^i(2)=yiR^r(2)xi, R^r(2)=isriyiπipi(isrixiπip)1, D^i=riDiπipi, D^ij=riDijπipi, E^i=riEiπipi. The value of pi, pi=pi if pi is known otherwise pi=p^i. p^i is the estimator of pi from the probit or logistic regression models.

The proof in Theorem 2 is similar to the proof in Theorem 1.

In Theorem 2 we investigated the variance and its estimators of Y^R(2). We note that the variance formulas Y^R(1) and Y^R(2) are the same but the variance estimators of Y^R(1) and Y^R(2) are slightly different because the estimators of Ai=yiRxi are different.

In (28) and (29) we proposed new ratio estimators for population mean and population total of the study variable when nonresponse occurs on both y and x variables but X¯ is known under the MAR mechanism. Furthermore, the variance and its estimator are also discussed in Theorem 2. Next, we proposed the special case of Y^R(2) when the response probability is consider under the MCAR mechanism ( pi=p for all iU). Under the MAR mechanism the population mean estimator is equal to

Y¯^R(2)=(1Nisriyiπip1Nisrixiπip)1NiUxi=(1Nisriyiπi1Nisrixiπi)1NiUxi=Y¯^rX¯^rX¯=R^r(2)X¯,

where Y¯^r=1Nisriyiπi, X¯^r=1Nisrixiπi, R^r(2)=Y¯^r(X¯^r)1. Then, the population total estimator is

Y^R(2)=NY¯^R(2)=NX¯R^r(2).

Finally, the variance and associated estimators of y are discussed in Lemma 3.

Lemma 3. Under condition (B2) with reverse framework and where the nonresponse mechanism is MCAR.

(1) The variance of Y^R(2) is

V(Y^R(2))=iUDiAi2+iUi{j}UDijAiAj+iUEiyi2,

where Ai=yiRxi, R=Y¯X¯1 and Ei=1pp.

(2) The estimator of V(Y^R(2)) is

V^(Y^R(2))=isD^iA^i(2)2+isi{j}sD^ijA^i(2)A^j(2)+isE^iyi2,

where A^i(2)=yiR^r(2)xi, R^r(2)=isriyiπi(isrixiπi)1, D^i=riDiπip, D^ij=riDijπip, E^i=riEiπip. The value of p, p=p if p is known otherwise p=p^. p^ is the estimator of p under the MCAR mechanism that is p^=(isriπi)(is1πi)1.

The new ratio estimator when X¯ is unknown

Assume that the condition (B3) is satisfied, X¯ is unknown and nonresponse occurs on both y and x variables. However, the information of variable u1,u2,...uq is available for all is and U¯ is known. Furthermore, variables u1,u2,...uq are highly correlated with x. Then, we extended the GREG estimator of Ponkaew’s (2018) to the MAR mechanism and it is defined by

X¯GREG^=X¯r^+[U¯U¯^r]β^r,

where X¯r^=1Nisrixiπipi, U¯^r=1Nisriuiπipi, U¯=1NiUui, β^r=(isriqiuiuiπipi)1(isriqiuiwiπipi).

The new ratio estimator for population mean is

Y¯R(3)^=1Nisriyiπipi1NisrixiπipiX¯GREG^.

Then, the new ratio estimator for population total is

Y^R(3)=NY¯^R(3).

The variance and associated estimators of Y^R(3) are discussed in Theorem 4.

Theorem 4. Under condition (B3) with reverse framework and nonresponse mechanism is MAR.

(1) The variance of Y^R(3) is

V(Y^R(3))=iUDiBi2+iUi{j}UDijBiBj+iUEiBi2,

where Di=1πiπi, Dij=πijπiπjπiπj, Ei=1pipi and Bi=xiRuiβ.

(2) The estimator of V(Y^R(3)) is

V^(Y^R(3))=isD^iB^i2+isi{j}sD^ijB^iB^j+isE^iB^i2,

where D^i=riDiπipi, D^ij=riDijπipi, E^i=riEiπipi and B^i=xiR^ruiβ^r, pi=pi if pi is known otherwise pi=p^i. p^i is the estimator of pi from the probit or logistic regression models.

Proof. Let Y^R(3) be defined in (38). However, the new ratio estimator Y¯R(3)^ is a function of the GREG estimator X¯GREG^ then we use the modified automated linearization approach transform X¯GREG^ to a simple form and it is defined by

X¯GREG(1)^=U¯β+1NisriCiπipi,

where Ci=xiuiβ. Then, the new ratio estimator Y¯^R(3) can be approximated by,

Y¯^R(1)(3)1Nisriyiπipi1NisrixiπipiX¯GREG(1)^.

Therefore, variance of Y^R(3) can be approximated from,

V(Y^R(3))=V(NY¯^R(3))=N2V(Y¯^R(3))N2V(Y¯^R(1)(3)).

Furthermore, the estimator of V(Y^R(3)) can be obtained by,

V^(Y^R(3))N2V^(Y¯^R(1)(3)).

We note that Y¯^R(1)(3) is a nonlinear estimator then we use steps (1) to (5) for investigating the value of V(Y¯^R(1)(3)) and it is defined by,

V(Y^R(1)(3))1N2[iUDiBi2+iUi{j}UDijBiBj+iUEiBi2],

where Bi=xiRuiβ. Substitute (45) into (43) then,

V(Y^R(3))iUDiBi2+iUi{j}UDijBiBj+iUEiBi2.

Furthermore, the estimator of V(Y^R(3)) is

V^(Y^R(3))=isD^iB^i2+isi{j}sD^ijB^iB^j+isE^iB^i2,

where D^i=riDiπipi, D^ij=riDijπipi, E^i=riEiπipi and B^i=xiR^ruiβ^r.

Next, we consider Y^R(3) under the MCAR mechanism as follows. The new ratio estimator for population mean when X¯ is unknown and nonresponse occurs on both y and x variables under the MCAR mechanism is

Y¯^R(3)=1Nisriyiπip1NisrixiπipX¯^GREG=1Nisriyiπi1NisrixiπiX¯^GREG,

where X¯^GREG=X¯^r+[U¯U¯^r]β^r, X¯^r=1Nisrixiπip, U¯^r=1Nisriuiπip, U¯=1NiUui, β^r=(isriqiuiuiπip)1(isriqiuiwiπip).

Then, the new ratio estimator for population mean is

Y^R(3)=NY¯^R(3).

The variance and associated estimators of Y^R(3) are discussed in Lemma 5.

Lemma 5. Under condition (B3) with a reverse framework and where the nonresponse mechanism is MCAR.

(1) The variance of Y^R(3) is

V(Y^R(3))=iUDiBi2+iUi{j}UDijBiBj+iUEiBi2,

where Di=1πiπi, Dij=πijπiπjπiπj, Ei=1pp and Bi=xiRuiβ.

(2) The estimator of V(Y^R(3)) is

V^(Y^R(3))=isD^iB^i2+isi{j}sD^ijB^iB^j+isE^iB^i2,

where D^i=riDiπip, D^ij=riDijπip, E^i=riEiπip and B^i=xiR^ruiβ^r. The value of p is defined in (35).

Simulation studies

In this section, the performance of the proposed new ratio estimators and their variance estimators under the MAR mechanism is compared with the MCAR mechanism via simulation studies. We generated a study variable yi from the auxiliary variables xi, wi, size variable ki and calibration variable ui following the model from Sichera (2020) and it is defined by yi=0.2xi+0.1wi+2k+3.7k12+2ui+εi where kigamma(10,5), wigamma(5,10), εiN(0,1), (xiui)N[(155),(1ρρ1)], ρ=0.70, i=1,2,,...,N. Four levels of sample sizes n=100,200,600 and 1,200 are drawn from a population size N=3,000 and n=10,20,60 and 1,200 are drawn from a population size N=300 using Midzuno’s (1952) scheme. We consider the MAR response mechanism with two levels of response rate; 60% and 80% and repeated the simulation 10,000 times ( M=10,000) using Program R (R Core Team, 2021). We consider the case where the response probability is unknown and estimated by the logistic regression model for the MAR mechanism and estimated by the function p^=(isriπi)(is1πi)1 for the MCAR mechanism. The relative root mean square error ( RRMSE) was used to compare the efficiency of the proposed ratio estimators and their variance estimators and the formula is

RRMSE(V^(Y^))=1M1m=1M(A^A)2A

where A^ is the proposed estimators or variance estimators and A is expectation of A^ or E(A^). The results are shown in Tables 1 and 2.

Table 1:
The relative root mean square error of the new ratio estimators and associated variance estimators for N = 3,000.
Response rate (%) n The relative root mean square error of the proposed estimators The relative root mean square error of the variance estimators
X¯ is known X¯ is known X¯ is unknown X¯ is unknown
PRMSE(Y^R(2)) PRMSE(Y^R(3)) PRMSE(V^(Y^R(2))) PRMSE(V^(Y^R(3)))
MAR MCAR MAR MCAR MAR MCAR MAR MCAR
60 100 0.0470 0.0472 0.0465 0.0467 0.1702 0.1703 0.2763 0.3201
200 0.0350 0.0351 0.0354 0.0361 0.1321 0.1316 0.1761 0.2069
600 0.0322 0.0324 0.0330 0.0340 0.1150 0.1158 0.1408 0.1658
1,200 0.0104 0.0105 0.0105 0.0107 0.0427 0.0697 0.0512 0.0804
80 100 0.0364 0.0373 0.0366 0.0375 0.1453 0.1490 0.1930 0.2526
200 0.0258 0.0261 0.0266 0.0269 0.1127 0.1141 0.1855 0.2048
600 0.0134 0.0139 0.0146 0.0150 0.0588 0.0614 0.1108 0.1149
1,200 0.0086 0.0089 0.0088 0.0090 0.0433 0.0552 0.0660 0.0875
DOI: 10.7717/peerj.14551/table-1
Table 2:
The relative root mean square error of the new ratio estimators and associated variance estimators with population size N = 300.
Response rate (%) n The relative root mean square error of the proposed estimators The relative root mean square error of the variance estimators
X¯ is known X¯ is unknown X¯ is known X¯ is unknown
PRMSE(Y^R(2)) PRMSE(Y^R(3)) PRMSE(V^(Y^R(2))) PRMSE(V^(Y^R(3)))
MAR MCAR MAR MCAR MAR MCAR MAR MCAR
60 10 0.1166 0.1196 0.1169 0.1198 0.6605 0.6902 2.2661 2.7669
20 0.1038 0.1056 0.1049 0.1067 0.5423 0.5452 1.3307 1.3510
60 0.0624 0.0626 0.0631 0.0634 0.2420 0.2489 0.3666 0.4646
120 0.0478 0.0479 0.0483 0.0485 0.1548 0.1829 0.2500 0.2634
80 10 0.0949 0.0962 0.0957 0.0968 0.4347 0.4533 2.2106 2.5923
20 0.0863 0.0869 0.0869 0.0871 0.3722 0.3736 1.3123 1.3319
60 0.0480 0.0484 0.0481 0.0485 0.2084 0.2150 0.3628 0.4144
120 0.0298 0.0301 0.0298 0.0303 0.1392 0.1514 0.2164 0.2325
DOI: 10.7717/peerj.14551/table-2

The simulation results found in Table 1 for N=3,000 that the new population total estimator under missing at random performed better than the estimators under missing completely at random for both situations where X¯ is either known or unknown. There was an increase of response rate, decrease of the relative root mean square errors as same as for the sample sizes for all estimators. When X¯ is unknown and needs to be estimated, it results in increasing the relative root mean square errors due to the estimation process. Similar results were found in the case of variance estimators. Similar results are found in Table 2 for a smaller sample size N=300.

An application to water demand in Thailand

The new estimators are applied to estimate the water demand in Thailand. The data are from the provincial waterworks during August and July 2022. Midzuno’s (1952) scheme is instigated to select a sample of size 40 provinces from the total of 74 provinces. The demand for water in August 2022 is considered as study variable y. Two auxiliary variables x and w are the water supply in August and the water demand in July 2022, respectively. The variable x is used to construct the new ratio estimators and the variable w is used to estimate the response probabilities with the logistic regression model under the MAR mechanism. The calibration variable u is the water supply in July 2022 and the size variable k is the number of water users in August 2022. The nonresponse rate is 7.5% in this study.

Table 3 shows the total estimate of water demand in August 2022, Thailand. We see that the estimated water demand when X¯ is known is higher than when X¯ is unknown under both the MAR and MCAR nonresponse mechanisms. In contrast, the estimates of variance when X¯ is unknown is a lot higher than the estimates of variance when X¯ is known due to the estimation of the unknown population mean of the auxiliary variable. The new estimators can be useful for application to the real world when nonresponse occurs in the study which requires management before the estimation process and further analysis.

Table 3:
The total estimates of water demand in August 2022.
Nonresponse mechanism Information on the auxiliary variable Estimated water demand Variance estimates
MAR X¯ is known 122,763,533 5,621,837,076,813
X¯ is unknown 112,079,391 49,945,263,902,570
MCAR X¯ is known 122,752,240 5,958,276,721,564
X¯ is unknown 111,926,154 52,711,445,906,451
DOI: 10.7717/peerj.14551/table-3

Figure 1 shows the conclusion for all the cases of the simulation studies in an empirical study.

The conclusion for all the cases of the simulation studies in an empirical study.

Figure 1: The conclusion for all the cases of the simulation studies in an empirical study.

Conclusions

The new ratio estimators for estimating population total and population mean when missing data is missing at random occurs with both study and auxiliary variables under UPWOR when the population mean of an auxiliary variable is known and unknown are proposed. In the latter we suggested to estimate it from other variables using the GREG estimator. The new ratio estimators are compared by their efficacies under the MAR and MCAR nonresponse mechanisms through simulation studies and an empirical study using water demand data in Thailand. The results found that the new ratio estimators under the MAR mechanism are more efficient than ratio estimators under the MCAR mechanism for all response rates and sample sizes. The proposed estimators are applied to estimate the demand for water so this information can be used to plan for policies and strategies for preventing water shortages which may occur in the future. The proposed estimators are more useful in practice when compared to the estimators proposed by Ponkaew & Lawson (2018) that considered only under MCAR and when only the study variable is missing which also required the known parameter of the population mean of the auxiliary variable which is difficult to find. The proposed estimators are more flexible to apply in real life because we can use them in more flexible situations when both the nonresponse mechanism is uniform or not uniform which is more likely to occur in real world problems. If the population mean of the auxiliary variable is unknown, it can be estimated using the GREG estimator which makes use of the benefit of the related variables in the estimation process to improve the efficiency of the population total estimators. We can extend the new estimator to complex survey designs such as stratified cluster sampling and consider it under the not missing at random nonresponse mechanism (NMAR).

Supplemental Information

Code for Simulation Studies.

DOI: 10.7717/peerj.14551/supp-1

Code for water demand data.

DOI: 10.7717/peerj.14551/supp-2
3 Citations   Views   Downloads