Skip to main content
Log in

Semi-parametric optimization for missing data imputation

Applied Intelligence Aims and scope Submit manuscript

Abstract

Missing data imputation is an important issue in machine learning and data mining. In this paper, we propose a new and efficient imputation method for a kind of missing data: semi-parametric data. Our imputation method aims at making an optimal evaluation about Root Mean Square Error (RMSE), distribution function and quantile after missing-data are imputed. We evaluate our approaches using both simulated data and real data experimentally, and demonstrate that our stochastic semi-parametric regression imputation is much better than existing deterministic semi-parametric regression imputation in efficiency and effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

  1. Allison P (2001) Missing data. Sage Publication, Inc

  2. Cios K, Kurgan L (2002) Trends in data mining and knowledge discovery. In: Pal N, Jain L, Teoderesku N (eds) Knowledge discovery in advanced information systems. Springer

  3. Clifton C (2003) Change detection in overhead imagery using neural networks. Appl Intell 18(2):215–234

    Article  MATH  Google Scholar 

  4. Dempster et al (1983) Incomplete data in sample surveys. In: Madow WG, Olkin I, Rubin D (eds) Sample surveys Vol.: Theory and annotated bibliography, New York, NY, Academic Press, pp 3–10

    Google Scholar 

  5. Engle RF et al (1986) Semiparametric estimates of the relation between weather and electricity sales. J Am Statist Assoc 81(394), Applications.

  6. Friedman JH, Khavi R, Yun Y (1996) Lazy decision trees. In: Proceedings of the 13th national conference on artificial intelligence, AAAI Pres/MIT Press, pp 717–724

  7. Ghahramani et al (1997) Mixture models for Learning from incomplete data. In: Greiner R, Petsche T, Hanson SJ (eds) Computational learning theory and natural learning systems, Volume IV: Making learning systems practical, Cambridge, MA, The MIT Press, pp 67–85

    Google Scholar 

  8. Han J, Kamber M (2000) Data mining concepts and techniques. Morgan Kaufmann Publishers

  9. Hand D et al (1994) A handbook of small data sets. London, Chapman & Hall, pp 208–211

    Google Scholar 

  10. Hoti F, Holmstrom L (2004) A semiparametric density estimation approach to pattern classification. Patt Recog 37:409–419

    Article  Google Scholar 

  11. Hu X (2005) A data mining approach for retailing bank customer attrition analysis. Appl Intell 22(1):47–60

    Article  Google Scholar 

  12. Kaya M, Alhajj R (2006) Utilizing genetic algorithms to optimize membership functions for fuzzy weighted association rule mining. Appl Intell 24(1):7–15

    Article  Google Scholar 

  13. Kim Y (2001) The curse of the missing data. In: http://209.68.240.11:8080

  14. Little R, Rubin D (2002) Statistical analysis with missing data (2nd edn.). John Wiley and Sons, New York

    MATH  Google Scholar 

  15. Liu WZ, White AP, Thompson SG, Bramer MA (1997) Techniques for dealing with missing values in classification. In: IDAL97, vol 1280 of Lecture notes, pp 527–536

  16. Ramoni M (1997) Learning Bayesian networks from incomplete databases. Technical report kmi-97-6, Knowledge Media Institute, The Open University

  17. Millimet D, List J, Stengos T (2003) The environmental kuznets curve: Real progress or misspecified models? Rev Econ Stat 85(4):1038–1047

    Article  Google Scholar 

  18. Peixoto J (1990) A property of well-formulated polynomial regression models. Am Stat 44:26–30

    Article  MathSciNet  Google Scholar 

  19. Pickle S et al (2005). Robust parameter design: a semi-parametric approach. In: http://www.stat.vt.edu/tech_reports/VTTechReport05-7.pdf

  20. Pin T, James L (1999) The elasticity of demand for gasoline: a semi-parametric analysis. In http://uiuc.edu/∼ng/working/gas.ps

  21. Pyle D (1994) Data preparation for data mining. Morgan Kaufmann Publishers, Inc

  22. Qin YS, Rao JNK (2004) Confidence intervals for parameters of the response variable in a linear model with missing data. Technique Report

  23. Quinlan JR (1989) Unknown attribute values in induction. In: proc. 6th int’ workshop on machine learning, Ithaca, pp 164–168

  24. Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, USA

    Google Scholar 

  25. Silverman B (1986) Density estimation for statistics and data analysis. Chapman and Hall, New York

    MATH  Google Scholar 

  26. Wang Q, Rao JNK (2002a) Empirical likelihood-based inference in linear models with missing data. Scand J Statist 29:563–576

    Article  MATH  MathSciNet  Google Scholar 

  27. Wang Q, Rao J (2002b) Empirical likelihood-based inference under imputation with missing response. Ann Statistics 30:563–576

    MathSciNet  Google Scholar 

  28. Wang Q, Hardle W (2004) Semiparametric regression analysis with missing response at random. J Am Statistical Assoc 99

  29. White AP (1987) Probabilistic induction by dynamic path generation in virtual trees. In: Bramer MA (ed) Research and development in expert systems III. Cambridge, Cambridge University Press, pp 35–46

    Google Scholar 

  30. Zhang C, Yang Q, Liu B (2005) Intelligent data preparation. IEEE Trans Knowl Data Eng 17(9):1163–1165

    Article  Google Scholar 

  31. Zhang C, Zhang S, Webb G (2003) Identifying approximate itemsets of interest in large databases. Appl Intell 18:91–104

    Article  Google Scholar 

  32. Zhang S, Zhang C, Yang Q (2004) Information enhancement for data mining. IEEE Intell Syst 19(2):12–13

    Article  Google Scholar 

  33. Zhang S, Qin ZX, Ling CX, Sheng SL (2005) Missing is useful: missing values in cost-sensitive decision trees. IEEE Trans Knowl Data Eng 17(12):1689–1693

    Article  Google Scholar 

  34. Zhang S et al (2006) Optimized parameters for missing data imputation. In: Proceedings of PRICAI 2006, Guilin, China, August 7–11, 2006 Proceedings, pp 1010–1016

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shichao Zhang.

Additional information

This work is partially supported by Australian large ARC grants (DP0449535, DP0559536 and DP0667060), a China NSF major research Program (60496327), China NSF grants (60463003, 10661003), an Overseas Outstanding Talent Research Program of Chinese Academy of Sciences (06S3011S01), a High-level Studying-Abroad Talent Program of the China Human-Resource Ministry and an Innovation Project of Guangxi Graduate Education (2006106020812M35).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qin, Y., Zhang, S., Zhu, X. et al. Semi-parametric optimization for missing data imputation. Appl Intell 27, 79–88 (2007). https://doi.org/10.1007/s10489-006-0032-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-006-0032-0

Keywords

Navigation