Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Estimation of finite population mean for a sensitive variable using dual auxiliary information in the presence of measurement errors

  • Erum Zahid ,

    Roles Data curation, Formal analysis, Funding acquisition, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    erumzahid22@gmail.com

    Affiliation Department of Statistics, Quaid-i-Azam University, Islamabad, Pakistan

  • Javid Shabbir

    Roles Investigation, Project administration, Supervision, Visualization, Writing – review & editing

    Affiliation Department of Statistics, Quaid-i-Azam University, Islamabad, Pakistan

Abstract

In this study, we propose a new improved estimator of population mean for the sensitive variable in the presence of measurement error under simple and stratified random sampling. This estimator accounts the auxiliary information as well as the ranks of the auxiliary variable. From theoretical and numerical studies it is shown that a new improved estimator performs better than the existing estimators under study.

1 Introduction

In survey sampling, the assumption is made that all the observations are carefully considered on the characteristics under study so the information we obtained is error free. But in practice this assumption is not achieved due to many reasons, including non-response which may arises due to refusal of respondents to give the information or not at home or lack of interest or due to some sensitive issues. In analysis, a basic assumption is that all observations are measured correctly. In multiple regression model, it is assumed that all observations based on the study variable and the auxiliary variable are observed without any error. In many situations these assumptions are violated because of the following reasons. (i) Under the context of qualitative, it is hard to measure some variables (e.g., intelligence, taste, ability, climate, education, poverty etc.). So we use the dummy variables and observations are recorded in terms of values of dummy variables. (ii) In application, some variables are clearly defined but it is hard to take the correct observations (e.g., age is either under reported or over reported in complete year). (iii) It is no doubt that some variables are conceptually defined but is hard to take correct observation on it, instead the observations are taken on closely related variables (e.g., level of education is measured by the number of years of schooling). In all above mentioned cases, it is not possible to obtain true value of the variable. Instead it is recorded with error. So measurement error (ME) appeared because of difference between observed and true value. Also ME is due to the use of imperfect measure of true values of variables. Suppose we are interested to get the average level of anxiety among students, So we take a random sample of some students and measure their level of anxiety. Then we calculate the mean level of anxiety i.e. sample mean. The normality assumption says that if you repeat this process many times and plot the sample means, the distribution will be normal. Usually measurement error and randomized response are studied separately using the known auxiliary or additional information. In reality, when the variable of interest is sensitive, the respondents hesitate to provide the personal information, which gives rise to measurement error.

To estimate the population mean, few researchers discussed the problem of measurement error. [1] discussed some important sources of measurement error in survey data. [2] done the estimation of population mean in the presence of measurement error for ratio-product type estimators. [3] and [4] presented the ratio method of estimation in the presence of measurement error. Further the work is extended by [5]. [6], [7] and [8] studied measurement error and non response together. [9] suggested an estimator for the estimation of population mean in the presence of measurement error and non response under stratified random sampling.

In survey sampling, when the variable of interest is sensitive, then the respondents hesitate to provide their personal information. Direct survey on sensitive question increases the relative bias. [10] introduced the randomized response technique (RRT), which reduces the possible bias and is used to obtain the true information while insuring the privacy of the respondents. For estimation of mean of a sensitive quantitative variable the Randomized Response model (RRM) is extended by [11]. [12] introduced the scrambled randomized response method. [13] proposed the optional RRT method and further this work is extended by [14]. [15] used the scrambled response technique for the estimation of population mean when coefficient of variation is known. [16] used the empirical Bayes estimation for the estimation of sensitive variable. [17] studied the estimation of population mean of sensitive variable in the presence of nonsensitive auxiliary information. [18] and [19] studied the improved estimation of population mean in simple and stratified random sampling.

When the correlation between the study variable and the auxiliary variable is sufficient, then the ranks of the auxiliary variable are also correlated with the study variable and consequently the precision of the estimator increased. [20] suggested the concept of ranks of the auxiliary variable to make efficient estimates. In practice, not much literature has been found in estimating the population mean for the sensitive variable in the presence of measurement error based on dual use of the auxiliary information.

The present paper is organized as: Section 2 gives existing estimators and an improved proposed estimator of population mean for sensitive variable in the presence of measurement error under simple random sampling. Both theoretical and numerical comparison are done in Section 2. In Section 3, some existing estimators and an improved class of estimators is suggested for estimating the finite population mean by incorporating both measurement error and sensitive information simultaneously under stratified random sampling. Efficiency comparison, numerical results and simulation study are also presented in Section 3. Conclusion is given in Section 4.

2 Estimators under simple random sampling

Let Ω = Ω1, Ω2, …, ΩN be a finite population of size N. Suppose that a simple random sample of size n is drawn from Ω by using simple random sampling without replacement. Let Y be the sensitive study variable, which is not observed directly and X be the non-sensitive auxiliary variable which has positive correlation with Y. Let Rx be ranks of the auxiliary variable X. Let S be a scrambling variable which is independent of Y and X. We assume that S has zero mean and variance . The respondent is asked to give a scrambled response for the study variable Y given by Z = Y + S and in addition asked to provide a true response for X.

Let (xi, rx,i, yi, zi) be the observed values and (Xi, Rx,i, Yi, Zi) be the actual values on the variables (X, Rx, Y, Z) respectively. Then the measurement errors be Vi = xiXi, Ui = ziZi and Ti = rx,iRx,i. These measurement errors are assumed to be uncorrelated having normal distribution with zero mean and variances , and respectively. Let , and be the population variances; ρXZ, and be the coefficients of correlation between their subscripts.

2.1 Existing estimators in literature

In this section we consider the following existing estimators.

Mean estimator.

The usual unbiased mean per unit estimator, is given by (1) where is given in Eq (12). The variance of , is given by (2) where λ = (n−1N−1).

Ratio estimator.

The traditional ratio estimator, is given by, (3) where is the sample mean (see Eq (13)) and is known population mean. The bias and mean square error of to first degree of approximation, are given by (4) and (5) where .

Difference estimator.

The usual difference estimator is given by, (6) where d is the constant, whose value is to be determined optimally. The minimum variance of , is given by (7) where optimum value of d is .

Khalil estimator.

Recently [21] proposed the generalized randomized response estimator, given by, (8) where and ;

k and g are constants, and ϕ is assumed to be an unknown constant which is determined optimally as . Also α(≠0) and γ are assumed to be some known parameters of the auxiliary variable X. The bias and minimum MSE of to first degree approximation, are given by (9) and (10) which is exactly equal to the variance of the difference estimator , but is preferable over because of unbiasedness.

2.2 The proposed estimator

We propose an improved randomized response estimator for estimating the population mean of the sensitive variable, dealing with the problem of measurement error. Measurement error is considered on both the study and the auxiliary variables. A scrambled response of Y is observed in form of Z = Y + S, where S is distributed as . The proposed estimator, is given by (11) where, m1 and m2 are constants whose values are to be determined. For obtaining the bias and mean square error, we assume that

Adding δZ and δU, we get

Dividing both sides by n, and then simplifying, we get (12)

Similarly, we can write (13) and (14)

Let , and . In order to get the bias and MSE of the proposed estimator, we consider the following relative error terms:

Let , , , E(ej) = 0, j = 0, 1, 2. , , , , and

Solving Eq (11) in terms of errors, we have (15)

Further simplifying, and keeping the terms up to power 2, we have (16)

On the lines of [22] and [23], we use the approximation method to derive the MSE of our proposed estimator in simple and stratified random sampling. The signal to noise ratio can easily be obtained by using the expression . Using above equation the bias of , is given by (17)

Squaring and taking expectation in Eq (16), we have (18)

The optimum values of m1 and m2 are (19) and (20)

Substitute the optimum values of m1 and m2 in Eq (18), we get the minimum MSE of , given by (21) where,

2.3 Efficiency comparison

We compare the proposed estimator with respect to and , given by

  1. From Eqs (2) and (21)
    , if
  2. From Eqs (5) and (21)
    , if
  3. From Eqs (7) and (21)
    , if
  4. From Eqs (10) and (21)
    , if

The proposed class of estimator is more efficient than other existing estimators when above conditions 1 to 4 are satisfied.

2.4 Numerical results

In this section two populations are generated for simulation study and two are based on real data sets.

2.4.1 Simulation study.

We have generated two populations of size 1,000 from multivariate normal distribution with different covariance matrices. The results of simulation is given in Tables 1 and 2. The population means and covariance matrices, are given below:

  1. Population I and ρXY = 0.8820, and
  2. Population II and ρXY = 0.5897, and
thumbnail
Table 1. MSE of different estimators for Population I under simulation.

https://doi.org/10.1371/journal.pone.0212111.t001

thumbnail
Table 2. MSE of different estimators for Population II under simulation.

https://doi.org/10.1371/journal.pone.0212111.t002

Covariance matrices shows the distribution of sensitive variable Y, the auxiliary variable X and the ranks of the auxiliary variable Rx. There is high correlation in Population I, and weak correlation in Population II. The scrambling response S is distributed as N(0, 0.01σX). The response variable is Z = Y + S. We estimate the MSE using k = 1000 samples of various sizes selected from each population. Three different sample sizes n = 100, 150, 200 are taken from both populations. The expression is given below: where i = 0, R, D, K, P.

Tables 1 and 2 show that the proposed estimator performs better as compared to all other existing estimators for both populations. The MSE of proposed estimators is smaller for Population I as compared to Population II because there is high correlation between the variables in Population I as compared to Population II. As the sample size increases MSE of all the estimators decreases, and it is observed that MSEs of both difference estimator and Khalil estimator is same, but is preferable over because of unbiasedness.

2.4.2 Application to real data.

In this section we have considered two data sets for numerical comparisons. Both data sets consist of 654 observations. The data summary is given below (see Tables 3 and 4) and results are given in Tables 5 and 6.

thumbnail
Table 5. MSE of different estimators for Population III under real data.

https://doi.org/10.1371/journal.pone.0212111.t005

thumbnail
Table 6. MSE of different estimators for Population IV under real data.

https://doi.org/10.1371/journal.pone.0212111.t006

  1. Population III (Source: [24])
  2. Population IV (Source: [24])

In both populations the study and the auxiliary variables are identical, but scrambling responses are different. The correlation coefficients for both the Populations are: ρXY = 0.7564, and . In Population, III and IV smoke (No = 0, Yes = 1) and sex (Female = 0, Male = 1) are taken as scrambling responses respectively.

Tables 5 and 6 show that the proposed estimator is more efficient as compared to all other considered estimators in both Populations (III and IV). The MSEs of both difference estimator and Khalil estimator are equivalent, but is preferable over because of unbiasedness.

3 Estimators under stratified random sampling

Consider a finite population of N identifiable units which are partitioned into L homogeneous subgroups called strata, such that the hth strata consist of Nh units, where h = 1, 2, …, L and . Let Yh be the sensitive variable, which do not observe directly and Xh be the non-sensitive auxiliary variable which has a positive correlation with Yh. Let Rx,h be the ranks of the auxiliary variable Xh and Sh be a scrambling variable which is independent of Yh and Xh. Sh has zero mean and variance . The respondent is asked to give a scrambled response for the study variable Yh given by Zh = Yh + Sh, additionally asked to provide a true response for Xh.

A simple random sample of size nh is drawn without replacement such that . Let (xhi, rx,hi, yhi, zhi) be the observed values and (Xhi, Rx,hi, Yhi, Zhi) be the actual values on the variables (Xh, Rx,h, Yh, Zh) of the ith(i = 1, 2, …, n) sampled units in the hth stratum. Then the measurement errors be , and Thi = rx,hiRx,hi. These measurement errors are assumed to be uncorrelated and having normal distribution with zero mean and variances , and respectively. Let , and be the population variances; ρhXZ, and be the coefficients of correlation, between their subscripts.

3.1 Existing estimators in literature

In this section we consider the following existing estimators.

Mean estimator.

The usual unbiased mean per unit estimator, is given by (22) where is the known stratum weight and is the mean of the sensitive variable Zh in the stratum h, (see Eq (33)). The variance of , is given by (23) where .

Ratio estimator.

The traditional ratio estimator, is given by (24) where is the known population mean and is the sample mean of the auxiliary variable in stratum h, (see Eq (34)). The bias and mean square error of , are given by (25) and (26) where .

Difference estimator.

The usual difference estimator, is given by (27) where dh is the constant, whose value is to be determined optimally. The minimum variance of , is given by (28) where .

Khalil randomized response estimator.

[21] proposed the estimator, which is given by, (29) where and ;

k and g are constants, and ϕh is assumed to be an unknown constant whose value is to be determined from optimality considerations . Also αh(≠0) and γh are assumed to be some known parameters of the auxiliary variable X. The bias and minimum MSE of , are given by (30) and (31) which is exactly equal to the variance of the difference estimator , but is preferable over because of unbiasedness.

3.2 The proposed estimator

An improved randomized response estimator for estimating the population mean of a sensitive variable in the presence of measurement error is proposed. A scrambling response of Yh is observed in the form of Zh = Yh + Sh, where Sh is distributed as . The suggested estimator is given by (32) where, m1h and m2h are constants whose values are to be determined. and are the population mean and sample mean of the ranked of the auxiliary variable, respectively(see Eq (35)). For obtaining the bias and mean square error, we define:

Adding δhZ and δhU, we get

Dividing both sides by nh, and then simplifying, we get (33)

Similarly, we can get (34) and (35)

Let , and .

In order to get the bias and MSE of the suggested estimator, we consider the following relative error terms:

Let, , , , E(ejh) = 0, j = 0, 1, 2. , , , , and .

Using Eq (32) in terms of errors, we have (36)

Further simplifying, and keeping the terms up to power 2, we have (37)

Using above equation, the bias of , is given by (38)

Squaring and then taking expectations of Eq (37), we have (39)

From Eq (39), the optimum values of m1h and m2h are (40) and (41)

Substitute the optimum values of m1h and m2h in Eq (39), the minimum MSE is given by (42) where,

3.3 Efficiency comparison

The efficiency comparison of and with respect to are given by,

  1. From Eqs (23) and (42)
    , if
  2. From Eqs (26) and (42)
    , if
  3. From Eqs (28) and (42)
    , if
  4. From Eqs (31) and (42)
    , if

The proposed class of estimators is more efficient than other existing estimators when above Conditions 1 to 4 are satisfied.

3.4 Numerical results

In this section two populations are generated for simulation study and one for real data set.

3.4.1 Simulation study.

We have generated two populations of size 1,000 from multivariate normal distribution with different covariance matrices. The results are given in Tables 7 and 8. The mean and covariance matrices are give below

  1. Population V. and
    N1 = 500 and N2 = 500,
    ρ1XY = 0.8554, and
    ρ2XY = 0.8797, and
  2. Population VI. and
    N1 = 400 and N2 = 600
    ρ1XY = 0.7172, and
    ρ2XY = 0.7592, and
thumbnail
Table 7. MSE of different estimators for Population V under simulation.

https://doi.org/10.1371/journal.pone.0212111.t007

thumbnail
Table 8. MSE of different estimators for Population VI under simulation.

https://doi.org/10.1371/journal.pone.0212111.t008

Covariance matrices show the distribution of sensitive variable Yh, the auxiliary variable Xh and the ranks of the auxiliary variable Rx,h. Population V consist of two equal strata and Population VI comprises of two unequal strata. In Population V there is high correlation among the variables, and low correlation in Population VI. The scrambling response Sh is distributed as . The response variable is Zh = Yh + Sh. We estimate the MSE using kh = 1000 samples of various sizes selected from each strata. Three different sample sizes, 10%, 15% and 20% are taken for both populations. The expression is given below: where i = 0, R, D, K, P

Tables 7 and 8 show that the estimator performs better as compared to the estimators ,, and . The efficiency of the estimator is improved when there is sufficient correlation between the study variable and the auxiliary variable. By increasing the sample size, MSE values decreases. As the MSEs of and are equal, so their numerical results are also identical for both the populations.

3.4.2 Application to real data.

In this section we consider the real life data set for numerical comparisons. Strata I consist of 318 observations and Strata II contain 336 observations. The data summary is given below (see Tables 9 and 10). The results are given in Table 11. ρ1XY = 0.7564, and

thumbnail
Table 11. MSE of different estimators for Population VII under real data.

https://doi.org/10.1371/journal.pone.0212111.t011

ρ2XY = 0.8109, and

  1. Population VII. (Source: [24])

In Table 11, we observed that the estimator performs better than the estimators ,, and . The estimators and have same MSEs but is preferable due to unbiasedness. As the sample size increases the MSE values decreases, which are the expected results.

4 Conclusion

In the present paper, we have proposed a new improved estimator of the finite population mean that encounter additional information on the auxiliary variable as well as on ranks of the auxiliary variable in the presence of measurement error under simple and stratified random sampling. Through simulation study and real life data sets (see Tables 1, 2, 5, 6, 7, 8 and 11) it is observed that the proposed estimators and perform better than the existing estimators, particularly when there is sufficient correlation between the study variable and the auxiliary variable. It is also concluded that difference estimator and [21] estimator are equally efficient, but difference estimator is preferable due to unbiasedness.

Supporting information

S1 File. Data used in the manuscript “S1_File.csv”.

https://doi.org/10.1371/journal.pone.0212111.s001

(CSV)

Acknowledgments

The authors are grateful to anonymous referees for their valuable comments and feedback.

References

  1. 1. Cochran WG. Errors of measurement in statistics. Technometrics. 1968;10(4):637–666.
  2. 2. Fuller WA. Estimation in the presence of measurement error. International Statistical Review/Revue Internationale de Statistique. 1995; p. 121–141.
  3. 3. Shalabh S. Ratio method of estimation in the presence of measurement errors. Jour Ind Soc Agri Statist. 1997;52:150–155.
  4. 4. Biemer PP, Groves RM, Lyberg LE, Mathiowetz NA, Sudman S. Measurement Errors in Surveys. John Wiley & Sons; 2011.
  5. 5. Shukla D, Pathak S, Thakur N. An estimator for mean estimation in presence of measurement error. Research and Reviews: A Journal of Statistics. 2012;1(1):1–8.
  6. 6. Singh RS, Sharma P. Method of Estimation in the Presence of Non-response and Measurement Errors Simultaneously. Journal of Modern Applied Statistical Methods. 2015;14(1):12.
  7. 7. Kumar S. Improved estimation of population mean in presence of nonresponse and measurement error. Journal of Statistical Theory and Practice. 2016;10(4):707–720.
  8. 8. Azeem M, Hanif M. Joint influence of measurement error and non response on estimation of population mean. Communications in Statistics-Theory and Methods. 2017;46(4):1679–1693.
  9. 9. Zahid E, Shabbir J. Estimation of population mean in the presence of measurement error and non response under stratified random sampling. PloS one. 2018;13(2):e0191572. pmid:29401519
  10. 10. Warner SL. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association. 1965;60(309):63–69. pmid:12261830
  11. 11. Greenberg BG, Kuebler RR Jr, Abernathy JR, Horvitz DG. Application of the randomized response technique in obtaining quantitative data. Journal of the American Statistical Association. 1971;66(334):243–250.
  12. 12. Eichhorn BH, Hayre LS. Scrambled randomized response methods for obtaining sensitive quantitative data. Journal of Statistical Planning and inference. 1983;7(4):307–316.
  13. 13. Gupta S, Shabbir J. Sensitivity estimation for personal interview survey questions. Statistica. 2004;64(4):643–653.
  14. 14. Gupta S, Shabbir J, Sehra S. Mean and sensitivity estimation in optional randomized response models. Journal of Statistical Planning and Inference. 2010;140(10):2870–2874.
  15. 15. Singh HP, Mathur N. Estimation of population mean when coefficient of variation is known using scrambled response technique. Journal of statistical planning and inference. 2005;131(1):135–144.
  16. 16. Chaudhuri A, Pal S. On efficacy of empirical Bayes estimation of a finite population mean of a sensitive variable through randomized responses. Model Assisted Statistics and Applications. 2015;10(4):283–288.
  17. 17. Gupta S, Shabbir J, Sousa R, Corte-Real P. Improved Exponential Type Estimators of the Mean of a Sensitive Variable in the Presence of Nonsensitive Auxiliary Information. Communications in Statistics-Simulation and Computation. 2016;45(9):3317–3328.
  18. 18. Shabbir J, Gupta S. On estimating finite population mean in simple and stratified random sampling. Communications in Statistics-Theory and Methods. 2010;40(2):199–212.
  19. 19. Haq A, Shabbir J. Improved family of ratio estimators in simple and stratified random sampling. Communications in Statistics-Theory and Methods. 2013;42(5):782–799.
  20. 20. Haq A, Khan M, Hussain Z. A new estimator of finite population mean based on the dual use of the auxiliary information. Communications in Statistics-Theory and Methods. 2017;46(9):4425–4436.
  21. 21. Khalil S, Gupta S, Hanif M. Estimation of finite population mean in stratified sampling using scrambled responses in the presence of measurement errors. Communications in Statistics-Theory and Methods. 2018; p. 1–9.
  22. 22. Cochran WG. Sampling Techniques: 3d Ed. Wiley New York; 1977.
  23. 23. Sukhatme PV, Sukhatme B, Sukhatme S, Asok C. Sampling theory with applications. Indian Society of Agricultural Statistics, New Delhi & IOWA State University Press, Ames, USA. 1984;.
  24. 24. Rosner B. Fundamentals of Biostatistics. 2006. Duxbury Press. 2015;.