Abstract
Missing data recurrently affect datasets in almost every field of quantitative research. The subject is vast and complex and has originated a literature rich in very different approaches to the problem. Within an exploratory framework, distance-based methods such as nearest-neighbour imputation (NNI), or procedures involving multivariate data analysis (MVDA) techniques seem to treat the problem properly. In NNI, the metric and the number of donors can be chosen at will. MVDA-based procedures expressly account for variable associations. The new approach proposed here, called Forward Imputation, ideally meets these features. It is designed as a sequential procedure that imputes missing data in a step-by-step process involving subsets of units according to their “completeness rate”. Two methods within this context are developed for the imputation of quantitative data. One applies NNI with the Mahalanobis distance, the other combines NNI and principal component analysis. Statistical properties of the two methods are discussed, and their performance is assessed, also in comparison with alternative imputation methods. To this purpose, a simulation study in the presence of different data patterns along with an application to real data are carried out, and practical hints for users are also provided.
Similar content being viewed by others
References
Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the Forward Search. Springer, New York
Azzalini A (2015) R package “sn”: the skew-normal and skew-t distributions (version 1.2-4). http://azzalini.stat.unipd.it/SN
Azzalini A, Capitanio A (1999) Statistical applications of the multivariate skew normal distribution. J R Stat Soc B 61(3):579–602
Azzalini A, Dalla Valle A (1996) The multivariate skew-normal distribution. Biometrika 83(4):715–726
Breiman L (2001) Random forests. Mach Learn 45:5–32
Cox TF, Cox MAA (2001) Multidimensional scaling, 2nd edn. Chapman & Hall/CRC, Boca Raton
Ferrari PA, Annoni P, Barbiero A, Manzi G (2011) An imputation method for categorical variables with application to nonlinear principal component analysis. Comput Stat Data Anal 55:2410–2420
Gower JC (2005) Principal coordinates analysis. In: Armitage P, Colton T (eds) Encyclopedia of biostatistics. Wiley, New York
Greenacre M (1984) Theory and applications of correspondence analysis. Academic Press, London
Groves RM, Dillman DA, Eltinge JL, Little RJA (2002) Survey nonresponse. Wiley, New York
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference and prediction, 2nd edn. Springer, New York
Hollander M, Wolfe DA (1999) Nonparametric statistical methods, 2nd edn. Wiley-Interscience, New York
Husson F, Josse J (2015) missMDA: Handling missing values with/in multivariate data analysis (principal component methods). R package version 1.8.2. http://CRAN.R-project.org/package=missMDA
Josse J, Pagès J, Husson F (2011) Multiple imputation in principal component analysis. Adv Data Anal Classif 5:231–246
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
Mardia KV (1970) Measures of multivariate skewness and kurtosis with applications. Biometrika 57(3):519–530
Nora-Chouteau C (1974) Une méthode de reconstitution et d’analyse de données incomplètes. PhD thesis, Université Pierre et Marie Curie
R Core Team (2015) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org
Rässler S, Rubin DB, Zell ER (2013) Imputation. Wiley Interdiscip Rev Comput Stat 5(1):20–29. doi:10.1002/wics.1240
Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC, London
Solaro N, Barbiero A, Manzi G, Ferrari PA (2014) Algorithmic-type imputation techniques with different data structures: alternative approaches in comparison. In: Vicari D, Okada A, Ragozini G, Weihs C (eds) Analysis and modeling of complex data in behavioural and social sciences. Studies in classification, data analysis, and knowledge organization. Springer International Publishing, Cham, pp 253–261
Solaro N, Barbiero A, Manzi G, Ferrari PA (2015a) A comprehensive simulation study on the Forward Imputation. Working Paper 2015\(\_\)4, Università degli Studi di Milano, Italy. https://ideas.repec.org/p/mil/wpdepa/2015-04.html
Solaro N, Barbiero A, Manzi G, Ferrari PA (2015b) GenForImp: a sequential distance-based approach for imputing missing data. R package version 1.0.0. http://CRAN.R-project.org/package=GenForImp
Stekhoven DJ (2013). missForest: nonparametric missing value imputation using random forest. R package version 1.4. http://CRAN.R-project.org/package=missForest
Stekhoven DJ, Bühlmann P (2012) MissForest—nonparametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118
Tarsitano A, Falcone M (2010) Missing values adjustment for mixed-type data. Working Paper n. 15-2010, Università della Calabria, Italy. https://ideas.repec.org/p/clb/wpaper/201015.html
Wasito I, Mirkin B (2005) Nearest neighbour approach in the least-squares data imputation algorithms. Inf Sci 169(1):1–25
Acknowledgments
N. Solaro’s work was partly funded by the MIUR PRIN “MISURA—Multivariate models for risk assessment” project. The authors would like to thank the Coordinating Editor, Associate Editor and the two anonymous referees for their valuable comments and suggestions, which greatly improved the paper.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
In FIP, the requirement: \(n_{k} \ge p\) for each \(k=0,1, \ldots , K\) is not binding. For a given k, suppose that \(n_{k} < p\), and consider the correlation matrix \(\mathbf {R}_{k}\) computed from the complete matrix \({\mathbf {X}}_{k}\) (similar arguments would hold for the variance-covariance matrix \(\varvec{\Sigma }_{k}\)). Let \(\mathbf {Z}_{k}\) be the standardized matrix derived from \({\mathbf {X}}_{k}\). It follows that: \( \mathbf {R}_{k} = \frac{1}{n_{k}} \mathbf {Z}_{k}^t \mathbf {Z}_{k} \), for which: \({\mathrm{{rank}{(}}}\mathbf {R}_{k}) \le n_{k} < p\). Now, consider the \(n_{k} \times n_{k}\) product-matrix: \( \mathbf {P}_{k} = \mathbf {Z}_{k} \mathbf {Z}_{k}^t \). Standard results of matrix algebra ensure that \(\mathbf {P}_{k}\) and \(\mathbf {R}_{k}\) have the same rank. In particular, as \(\mathbf {Z}_{k}\) has zero-mean columns, it always holds that: \( \mathbf {Z}_{k} \mathbf {Z}_{k}^t \mathbf {1}= 0 \cdot \mathbf {1}= \mathbf {0}\), where \(\mathbf {1}\) is a vector of \(n_{k}\) ones, so that \(\mathbf {P}_{k}\) always admits a zero eigenvalue. Therefore, \( {\mathrm{{rank}{(}}}\mathbf {P}_{k}) = {\mathrm{{rank}{(}}}\mathbf {R}_{k}) \le n_{k} -1 \). Moreover, \(n_{k} \mathbf {R}_{k}\) has the same non-null eigenvalues of \(\mathbf {P}_{k}\), with (at least) extra \(p - n_{k} + 1 \) eigenvalues equal to zero.
Let \(\eta ^{(k)}_{s}\) be a non-null eigenvalue of \(\mathbf {P}_{k}\), and \(\mathbf {v}^{(k)}_{s}\) the corresponding normalized eigenvector (\(s = 1, \ldots , n^{\prime }_k\) \(\le n_{k} -1\), where: \(n^{\prime }_k = {\mathrm{{rank}{(}}}\mathbf {P}_{k})\)). By definition, \( \mathbf {P}_{k} \mathbf {v}^{(k)}_{s} = \eta ^{(k)}_{s} \mathbf {v}^{(k)}_{s} \). Pre-multiplying both members by \(\mathbf {Z}_{k}^t\) leads to: \( n_{k} \mathbf {R}_{k} \mathbf {Z}_{k}^t \mathbf {v}^{(k)}_{s} = \eta ^{(k)}_{s} \mathbf {Z}_{k}^t \mathbf {v}^{(k)}_{s} \), from which it is apparent that: \( \mathbf {Z}_{k}^t \mathbf {v}^{(k)}_{s} = \varvec{\xi }^{(k)}_{s} \) is the s-th p-dimensional eigenvector of \(n_{k} \mathbf {R}_{k}\). Finally, from the fact that: \( {\varvec{\xi }^{(k)}_{s}}^{t} \varvec{\xi }^{(k)}_{s} = {\mathbf {v}^{(k)}_{s}}^{t} \mathbf {P}_{k} \mathbf {v}^{(k)}_{s} = \eta ^{(k)}_{s} {\mathbf {v}^{(k)}_{s}}^{t} \mathbf {v}^{(k)}_{s} = \eta ^{(k)}_{s} \), it derives that the s-th normalized eigenvector of \(n_{k} \mathbf {R}_{k}\), which is also an eigenvector for \(\mathbf {R}_{k}\), is given by: \( \varvec{\omega }^{(k)}_{s} = \varvec{\xi }^{(k)}_{s} / \sqrt{\eta ^{(k)}_{s}} \), while the s-th eigenvalue of \(\mathbf {R}_{k}\) is: \( \lambda ^{(k)}_{s} = \eta ^{(k)}_{s} / n_{k} \), for \(s= 1, \ldots , n^{\prime }_k \le n_{k} -1 < p\).
Rights and permissions
About this article
Cite this article
Solaro, N., Barbiero, A., Manzi, G. et al. A sequential distance-based approach for imputing missing data: Forward Imputation. Adv Data Anal Classif 11, 395–414 (2017). https://doi.org/10.1007/s11634-016-0243-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-016-0243-0