Abstract
Consider estimation of a population mean of a response variable when the observations are missing at random with respect to the covariate. Two common approaches to imputing the missing values are the nonparametric regression weighting method and the Horvitz-Thompson (HT) inverse weighting approach. The regression approach includes the kernel regression imputation and the nearest neighbor imputation. The HT approach, employing inverse kernel-estimated weights, includes the basic estimator, the ratio estimator and the estimator using inverse kernel-weighted residuals. Asymptotic normality of the nearest neighbor imputation estimators is derived and compared to kernel regression imputation estimator under standard regularity conditions of the regression function and the missing pattern function. A comprehensive simulation study shows that the basic HT estimator is most sensitive to discontinuity in the missing data patterns, and the nearest neighbors estimators can be insensitive to missing data patterns unbalanced with respect to the distribution of the covariate. Empirical studies show that the nearest neighbor imputation method is most effective among these imputation methods for estimating a finite population mean and for classifying the species of the iris flower data.
Similar content being viewed by others
References
Anderson, T.W.: Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. J. Am. Stat. Assoc. 52, 200–203 (1957)
Carpenter, J.R., Kenward, M.G., Vansteelandt, S.: A comparison of multiple imputation and doubly robust estimation for analyses with missing data. J. R. Stat. Soc. A 69, 571–584 (2006)
Chen, J., Shao, J.: Nearest neighbor imputation for survey data. J. Off. Stat. 16, 113–132 (2000)
Chen, J., Shao, J.: Jacknife variance estimation for nearest neighbor imputation. J. Am. Stat. Assoc. 96, 260–269 (2001)
Cheng, P.E.: Strong consistency of nearest neighbor regression function estimators. J. Multivar. Anal. 15, 63–72 (1984)
Cheng, P.E.: Nonparametric estimation of mean functionals with data missing at random. J. Am. Stat. Assoc. 89, 81–87 (1994)
Cheng, P.E., Wei, L.J.: Nonparametric inference under ignorable missing data process and treatment assignment. In: International Statistical Symposium, Taipei, vol. 1, pp. 97–112 (1986)
Cochran, W.G.: Sampling Techniques. Wiley, New York (1977)
Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21–27 (1967)
Dempster, A.P., Laird, N.M., Roubin, D.B.: Maximum likehood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B 39, 1–38 (1977)
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen., Part II 7, 179–188 (1936)
Fix, E., Hodges, J.L. Jr.: Discriminatory analysis, nonparametric discrimination. USAF School of Aviation Medicine, Randolph Field, Tex., Project 21-49-404, Rept. 4, Contract AF41(128)-31 (1951)
Gunn, S.R.: Support vector machines for classification and regression. Technical Report MP-TR-98-05, Image Speech and Intelligent Systems Group, University of Southampton (1998)
Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite population. J. Am. Stat. Assoc. 47, 663–685 (1952)
Kang, J.D.Y., Schafer, J.L.: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Stat. Sci. 22, 523–539 (2007)
Lee, H., Rancout, E., Sarndal, C.E.: Experiments with variance estimation from survey data with imputed values. J. Off. Stat. 10, 231–243 (1994)
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (2002)
Logtsgaarden, D.O., Quesenberry, C.P.: A nonparametric estimate of a multivariate density function. Ann. Math. Stat. 36, 1049–1051 (1965)
Neyman, J.: Contribution to the theory of sampling human populations. J. Am. Stat. Assoc. 33, 101–116 (1938)
Orchard, T., Woodury, M.A.: A missing information principle: Theory and applications. In: Proc. 6th Berkeley Symposium on Math. Stat. and Prob., vol. 1, pp. 697–715 (1972)
Potthoff, R.F., Roy, S.N.: A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika 51, 313–326 (1964)
Qin, J., Shao, J., Zhang, B.: Efficient and doubly robust imputation for covariate-dependent missing responses. J. Am. Stat. Assoc. 103, 797–810 (2008)
Rancourt, E.: Estimation with nearest neighbor imputation at Statistics Canada. In: Proceedings of the Section on Survey Research Methods, pp. 131–138. Am. Statist. Assoc., Alexandria (1999)
Robins, J.M., Rotnitzky, A.: Comment on “Inference for semiparamentric models: some questions and an answer,” by P.J. Bickel and J. Kwon. Stat. Sin. 11, 920–936 (2001)
Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89, 846–886 (1994)
Robins, J.M., Sued, M., Quanhong, L.-G., Rotnitzky, A.: Comment: performance of double-robust estimators when inverse probability weights are highly variable. Stat. Sci. 22, 544–559 (2007)
Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–45 (1983)
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)
Sande, I.G.: A personal view of Hot Deck imputation procedures. Surv. Methodol. 5, 238–258 (1979)
Scharfstein, D.O., Rotnitzky, A., Robins, J.M.: Adjusting for nonignorable drop-out using semiparametric nonresponse models. J. Am. Stat. Assoc. 94, 1096–1120 (1999)
Shao, J., Wang, H.: Confidence intervals based on survey data with nearest neighbor imputation. Stat. Sin. 18, 281–297 (2008)
Wang, Q., Rao, J.N.K.: Empirical likelihood-based inference under imputation for missing response data. Ann. Stat. 30, 896–924 (2002)
Yates, F.: The analysis of replicated experiments when the field results are incomplete. Emporium J. Exp. Agric. 1, 129–142 (1933)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ning, J., Cheng, P.E. A comparison study of nonparametric imputation methods. Stat Comput 22, 273–285 (2012). https://doi.org/10.1007/s11222-010-9223-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-010-9223-y