On the choice of the best imputation methods for missing values considering three groups of classification methods

Luengo, Julián; García, Salvador; Herrera, Francisco

doi:10.1007/s10115-011-0424-2

On the choice of the best imputation methods for missing values considering three groups of classification methods

Regular Paper
Published: 14 June 2011

Volume 32, pages 77–108, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Julián Luengo¹,
Salvador García² &
Francisco Herrera¹

1859 Accesses
156 Citations
Explore all metrics

Abstract

In real-life data, information is frequently lost in data mining, caused by the presence of missing values in attributes. Several schemes have been studied to overcome the drawbacks produced by missing values in data mining tasks; one of the most well known is based on preprocessing, formerly known as imputation. In this work, we focus on a classification task with twenty-three classification methods and fourteen different imputation approaches to missing values treatment that are presented and analyzed. The analysis involves a group-based approach, in which we distinguish between three different categories of classification methods. Each category behaves differently, and the evidence obtained shows that the use of determined missing values imputation methods could improve the accuracy obtained for these methods. In this study, the convenience of using imputation methods for preprocessing data sets with missing values is stated. The analysis suggests that the use of particular imputation methods conditioned to the groups is required.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

On Combining Imputation Methods for Handling Missing Data

Review of Single Imputation and Multiple Imputation Techniques for Handling Missing Values

Missing Data Imputation and Its Effect on the Accuracy of Classification

References

Acuna E, Rodriguez C (2004) Classification, clustering and data mining applications. Springer, Berlin, pp 639–648
Book Google Scholar
Alcalá-fdez J, Sánchez L, García S, Jesus MJD, Ventura S, Garrell JM, Otero J, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3): 307–318
Article Google Scholar
Asuncion A, Newman D (2007) UCI machine learning repository. http://archive.ics.uci.edu/ml/
Atkeson CG, Moore AW, Schaal S (1997) Locally weighted learning. Artif Intell Rev 11: 11–73
Article Google Scholar
Barnard J, Meng X (1999) Applications of multiple imputation in medical studies: From aids to nhanes. Stat Methods Med Res 8(1): 17–36
Article Google Scholar
Batista G, Monard M (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5): 519–533
Article Google Scholar
Bezdek J, Kuncheva L (2001) Nearest prototype classifier designs: an experimental study. Int J Intell Syst 16(12): 1445–1473
Article MATH Google Scholar
Broomhead D, Lowe D (1988) Multivariable functional interpolation and adaptive networks. Complex Syst 11: 321–355
MathSciNet Google Scholar
Clark P, Niblett T (1989) The cn2 induction algorithm. Mach Learn J 3(4): 261–283
Google Scholar
Cohen W (1995) Fast effective rule induction. In: Machine learning: proceedings of the twelfth international conference, pp 1–10
Cohen W, Singer Y (1999) A simple and fast and and effective rule learner. In: Proceedings of the sixteenth national conference on artificial intelligence, pp 335–342
Cover TM, Thomas JA (1991) Elements of information theory, 2nd edn. Wiley, NY
Book MATH Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30
MathSciNet MATH Google Scholar
Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11: 131–170
MathSciNet Google Scholar
Domingos P, Pazzani M (1997) On the optimality of the simple bayesian classifier under zero-one loss. Mach Learn 29: 103–137
Article MATH Google Scholar
Ennett CM, Frize M, Walker CR (2001) Influence of missing values on artificial neural network performance. Stud Health Technol Inform 84: 449–453
Google Scholar
Fan R-E, Chen P-H, Lin C-J (2005) Working set selection using second order information for training support vector machines. J Mach Learn Res 6: 1889–1918
MathSciNet MATH Google Scholar
Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern Part A 37(5): 692–709
Article Google Scholar
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41(12): 3692–3705
Article MATH Google Scholar
Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of 13th international joint conference on uncertainly in artificial intelligence (IJCAI93), pp. 1022–1029
Feng H, Guoshun C, Cheng Y, Yang B, Chen Y (2005) A svm regression based approach to filling in missing values. In: Khosla R, Howlett RJ, Jain LC (eds) ‘KES (3)’, vol 3683 of lecture notes in computer science. Springer, Berlin, pp 581–587
Frank E, Witten I (1998) Generating accurate rule sets without global optimization. In: Proceedings of the fifteenth international conference on machine learning, pp 144–151
García-Laencina P, Sancho-Gómez J, Figueiras-Vidal A (2009) Pattern classification with missing data: a review. Neural Comput Appl. 9(1): 1–12
Google Scholar
García S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9: 2677–2694
MATH Google Scholar
Gheyas IA, Smith LS (2010) A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing In Press, Corrected Proof
Grzymala-Busse J, Goodwin L, Grzymala-Busse W, Zheng X (2005) Handling missing attribute values in preterm birth data sets. In: Proceedings of 10th international conference of rough sets and fuzzy sets and data mining and granular computing(RSFDGrC), pp 342–351
Grzymala-Busse JW, Hu M (2000) A comparison of several approaches to missing attribute values in data mining. In: Ziarko W, Yao YY (eds) Rough sets and current trends in computing, vol 2005 of lecture notes in computer science, Springer, pp 378–385
Hruschka ER Jr., Hruschka ER, Ebecken NF (2007) Bayesian networks for imputation in classification problems. J Intell Inf Syst 29(3): 231–252
Article Google Scholar
Kim H, Golub GH, Park H (2005) Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics 21(2): 187–198
Article Google Scholar
Kwak N, Choi C-H (2002) Input feature selection by mutual information based on parzen window. IEEE Trans Pattern Anal Mach Intell 24(12): 1667–1671
Article Google Scholar
Kwak N, Choi C-H (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1): 143–159
Article Google Scholar
Cessie S le, van Houwelingen J (1992) Ridge estimators in logistic regression. Appl Stat 41(1): 191–201
Article MATH Google Scholar
Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy k-means clustering method. In: Proceedings of 4th international conference of rough sets and current trends in computing (RSCTC), pp 573–579
Little RJA, Rubin DB (1987) Statistical analysis with missing data, wiley series in probability and statistics, 1st edn. Wiley, New York
Google Scholar
Luengo J, García S, Herrera F (2010) A study on the use of imputation methods for experimentation with Radial Basis Function Network classifiers handling missing attribute values: the good synergy between RBFNs and EventCovering method. Neural Netw 23(3): 406–418
Article Google Scholar
Matsubara ET, Prati RC, Batista GEAPA, Monard MC (2008) Missing value imputation using a semi-supervised rank aggregation approach. In: Zaverucha G, da Costa ACPL (eds) ‘SBIA’, vol 5249 of lecture notes in computer science. Springer, Berlin, pp 217–226
McLachlan G (2004) Discriminant analysis and statistical pattern recognition. Wiley, NY
MATH Google Scholar
Merlin P, Sorjamaa A, Maillet B, Lendasse A (2010) X-SOM and L-SOM: a double classification approach for missing value imputation. Neurocomputing 73(7–9): 1103–1108
Article Google Scholar
Michalksi R, Mozetic I, Lavrac N (1986) The multipurpose incremental learning system aq15 and its testing application to three medical domains. In: Proceedings of 5th international conference on artificial intelligence (AAAI), pp 1041–1045
Moller F (1990) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw 6: 525–533
Article Google Scholar
Nogueira BM, Santos TRA, Zárate LE (2007) Comparison of classifiers efficiency on missing values recovering: application in a marketing database with massive missing data. In: ‘CIDM’, IEEE, pp 66–72
Oba S, aki Sato M, Takemasa I, Monden M, ichi Matsubara K, Ishii S (2003) A bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16): 2088–2096
Article Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8): 1226–1238
Article Google Scholar
Pham DT, Afify AA (2005) Rules-6: a simple rule induction algorithm for supporting decision making. In: Industrial electronics society, 2005. IECON 2005. 31st annual conference of IEEE, pp 2184–2189
Pham DT, Afify AA (2006) Sri: A scalable rule induction algorithm. Proc Inst Mech Eng Part C J Mech Eng Sci 220:537–552
Google Scholar
Plat J (1991) A resource allocating network for function interpolation. Neural Comput 3(2): 213–225
Article Google Scholar
Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods: support vector learning. MIT Press, Cambridge, pp 185–208
Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, Los Altos
Google Scholar
Qin B, Xia Y, Prabhakar S (2010) Rule induction for uncertain data. Knowl Inf Syst, doi:10.1007/s10115-010-0335-7, pp 1–28 (in press)
Quinlan J (1993) C4.5:programs for machine learning. Morgan Kauffman, Los Altos
Google Scholar
Reddy C, Park J-H (2010) Multi-resolution boosting for classification and regression problems. Knowl Inf Syst, doi:10.1007/s10115-010-0358-0, pp 1–22, (in press)
Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Learn Res 8: 1623–1657
MATH Google Scholar
Safarinejadian B, Menhaj M, Karrari M (2010) A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowl Inf Syst 23(3): 267–292
Article Google Scholar
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman & Hall, London
Book MATH Google Scholar
Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14: 853–871
Article Google Scholar
Song Q, Shepperd M, Chen X, Liu J (2008) Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J Syst Softw 81(12): 2361–2370
Article Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17(6): 520–525
Article Google Scholar
Twala B (2009) An empirical comparison of techniques for handling incomplete data using decision trees. Appl Artif Intell 23: 373–405
Article Google Scholar
Unnebrink K, Windeler J (n.d.)
Wang H, Wang S (2010) Mining incomplete survey data through classification. Knowl Inf Syst 24(2): 221–233
Article Google Scholar
Wilson D (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3): 408–421
Article MATH Google Scholar
Wong AKC, Chiu DKY (1987) Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans Pattern Anal Mach Intell 9(6): 796–805
Article Google Scholar
Wu X, Urpani D (1999) Induction by attribute elimination. IEEE Trans Knowl Data Eng 11(5): 805–812
Article Google Scholar
Zheng Z, Webb GI (2000) Lazy learning of bayesian rules. Mach Learn 41(1): 53–84
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Artificial Intelligence, CITIC-University of Granada, 18071, Granada, Spain
Julián Luengo & Francisco Herrera
Dept. of Computer Science, University of Jaén, 23071, Jaén, Spain
Salvador García

Authors

Julián Luengo
View author publications
You can also search for this author in PubMed Google Scholar
Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julián Luengo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luengo, J., García, S. & Herrera, F. On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst 32, 77–108 (2012). https://doi.org/10.1007/s10115-011-0424-2

Download citation

Received: 24 June 2009
Revised: 10 March 2011
Accepted: 22 May 2011
Published: 14 June 2011
Issue Date: July 2012
DOI: https://doi.org/10.1007/s10115-011-0424-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

On the choice of the best imputation methods for missing values considering three groups of classification methods

Abstract

Access this article

Similar content being viewed by others

On Combining Imputation Methods for Handling Missing Data

Review of Single Imputation and Multiple Imputation Techniques for Handling Missing Values

Missing Data Imputation and Its Effect on the Accuracy of Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the choice of the best imputation methods for missing values considering three groups of classification methods

Abstract

Access this article

Similar content being viewed by others

On Combining Imputation Methods for Handling Missing Data

Review of Single Imputation and Multiple Imputation Techniques for Handling Missing Values

Missing Data Imputation and Its Effect on the Accuracy of Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation