A sequential distance-based approach for imputing missing data: Forward Imputation

Solaro, Nadia; Barbiero, Alessandro; Manzi, Giancarlo; Ferrari, Pier Alda

doi:10.1007/s11634-016-0243-0

A sequential distance-based approach for imputing missing data: Forward Imputation

Regular Article
Published: 25 March 2016

Volume 11, pages 395–414, (2017)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Nadia Solaro¹,
Alessandro Barbiero²,
Giancarlo Manzi² &
…
Pier Alda Ferrari²

730 Accesses
9 Citations
3 Altmetric
Explore all metrics

Abstract

Missing data recurrently affect datasets in almost every field of quantitative research. The subject is vast and complex and has originated a literature rich in very different approaches to the problem. Within an exploratory framework, distance-based methods such as nearest-neighbour imputation (NNI), or procedures involving multivariate data analysis (MVDA) techniques seem to treat the problem properly. In NNI, the metric and the number of donors can be chosen at will. MVDA-based procedures expressly account for variable associations. The new approach proposed here, called Forward Imputation, ideally meets these features. It is designed as a sequential procedure that imputes missing data in a step-by-step process involving subsets of units according to their “completeness rate”. Two methods within this context are developed for the imputation of quantitative data. One applies NNI with the Mahalanobis distance, the other combines NNI and principal component analysis. Statistical properties of the two methods are discussed, and their performance is assessed, also in comparison with alternative imputation methods. To this purpose, a simulation study in the presence of different data patterns along with an application to real data are carried out, and practical hints for users are also provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampling Techniques for Quantitative Research

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

A Tutorial on Applying the Difference-in-Differences Method to Health Data

Article Open access 07 September 2023

References

Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the Forward Search. Springer, New York
Book MATH Google Scholar
Azzalini A (2015) R package “sn”: the skew-normal and skew-t distributions (version 1.2-4). http://azzalini.stat.unipd.it/SN
Azzalini A, Capitanio A (1999) Statistical applications of the multivariate skew normal distribution. J R Stat Soc B 61(3):579–602
Article MathSciNet MATH Google Scholar
Azzalini A, Dalla Valle A (1996) The multivariate skew-normal distribution. Biometrika 83(4):715–726
Article MathSciNet MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45:5–32
Article MATH Google Scholar
Cox TF, Cox MAA (2001) Multidimensional scaling, 2nd edn. Chapman & Hall/CRC, Boca Raton
MATH Google Scholar
Ferrari PA, Annoni P, Barbiero A, Manzi G (2011) An imputation method for categorical variables with application to nonlinear principal component analysis. Comput Stat Data Anal 55:2410–2420
Article MathSciNet MATH Google Scholar
Gower JC (2005) Principal coordinates analysis. In: Armitage P, Colton T (eds) Encyclopedia of biostatistics. Wiley, New York
Google Scholar
Greenacre M (1984) Theory and applications of correspondence analysis. Academic Press, London
MATH Google Scholar
Groves RM, Dillman DA, Eltinge JL, Little RJA (2002) Survey nonresponse. Wiley, New York
MATH Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference and prediction, 2nd edn. Springer, New York
Hollander M, Wolfe DA (1999) Nonparametric statistical methods, 2nd edn. Wiley-Interscience, New York
MATH Google Scholar
Husson F, Josse J (2015) missMDA: Handling missing values with/in multivariate data analysis (principal component methods). R package version 1.8.2. http://CRAN.R-project.org/package=missMDA
Josse J, Pagès J, Husson F (2011) Multiple imputation in principal component analysis. Adv Data Anal Classif 5:231–246
Article MathSciNet MATH Google Scholar
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
MATH Google Scholar
Mardia KV (1970) Measures of multivariate skewness and kurtosis with applications. Biometrika 57(3):519–530
Article MathSciNet MATH Google Scholar
Nora-Chouteau C (1974) Une méthode de reconstitution et d’analyse de données incomplètes. PhD thesis, Université Pierre et Marie Curie
R Core Team (2015) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org
Rässler S, Rubin DB, Zell ER (2013) Imputation. Wiley Interdiscip Rev Comput Stat 5(1):20–29. doi:10.1002/wics.1240
Article Google Scholar
Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York
Book MATH Google Scholar
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC, London
Book MATH Google Scholar
Solaro N, Barbiero A, Manzi G, Ferrari PA (2014) Algorithmic-type imputation techniques with different data structures: alternative approaches in comparison. In: Vicari D, Okada A, Ragozini G, Weihs C (eds) Analysis and modeling of complex data in behavioural and social sciences. Studies in classification, data analysis, and knowledge organization. Springer International Publishing, Cham, pp 253–261
Solaro N, Barbiero A, Manzi G, Ferrari PA (2015a) A comprehensive simulation study on the Forward Imputation. Working Paper 2015\(\_\)4, Università degli Studi di Milano, Italy. https://ideas.repec.org/p/mil/wpdepa/2015-04.html
Solaro N, Barbiero A, Manzi G, Ferrari PA (2015b) GenForImp: a sequential distance-based approach for imputing missing data. R package version 1.0.0. http://CRAN.R-project.org/package=GenForImp
Stekhoven DJ (2013). missForest: nonparametric missing value imputation using random forest. R package version 1.4. http://CRAN.R-project.org/package=missForest
Stekhoven DJ, Bühlmann P (2012) MissForest—nonparametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118
Article Google Scholar
Tarsitano A, Falcone M (2010) Missing values adjustment for mixed-type data. Working Paper n. 15-2010, Università della Calabria, Italy. https://ideas.repec.org/p/clb/wpaper/201015.html
Wasito I, Mirkin B (2005) Nearest neighbour approach in the least-squares data imputation algorithms. Inf Sci 169(1):1–25
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

N. Solaro’s work was partly funded by the MIUR PRIN “MISURA—Multivariate models for risk assessment” project. The authors would like to thank the Coordinating Editor, Associate Editor and the two anonymous referees for their valuable comments and suggestions, which greatly improved the paper.

Author information

Authors and Affiliations

Department of Statistics and Quantitative Methods, Università degli Studi di Milano-Bicocca, Via Bicocca degli Arcimboldi, 8, 20126, Milan, Italy
Nadia Solaro
Department of Economics, Management and Quantitative Methods, Università degli Studi di Milano, Milan, Italy
Alessandro Barbiero, Giancarlo Manzi & Pier Alda Ferrari

Authors

Nadia Solaro
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Barbiero
View author publications
You can also search for this author in PubMed Google Scholar
Giancarlo Manzi
View author publications
You can also search for this author in PubMed Google Scholar
Pier Alda Ferrari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nadia Solaro.

Appendix

In FIP, the requirement: \(n_{k} \ge p\) for each \(k=0,1, \ldots , K\) is not binding. For a given k, suppose that \(n_{k} < p\), and consider the correlation matrix \(\mathbf {R}_{k}\) computed from the complete matrix \({\mathbf {X}}_{k}\) (similar arguments would hold for the variance-covariance matrix \(\varvec{\Sigma }_{k}\)). Let \(\mathbf {Z}_{k}\) be the standardized matrix derived from \({\mathbf {X}}_{k}\). It follows that: \( \mathbf {R}_{k} = \frac{1}{n_{k}} \mathbf {Z}_{k}^t \mathbf {Z}_{k} \), for which: \({\mathrm{{rank}{(}}}\mathbf {R}_{k}) \le n_{k} < p\). Now, consider the \(n_{k} \times n_{k}\) product-matrix: \( \mathbf {P}_{k} = \mathbf {Z}_{k} \mathbf {Z}_{k}^t \). Standard results of matrix algebra ensure that \(\mathbf {P}_{k}\) and \(\mathbf {R}_{k}\) have the same rank. In particular, as \(\mathbf {Z}_{k}\) has zero-mean columns, it always holds that: \( \mathbf {Z}_{k} \mathbf {Z}_{k}^t \mathbf {1}= 0 \cdot \mathbf {1}= \mathbf {0}\), where \(\mathbf {1}\) is a vector of \(n_{k}\) ones, so that \(\mathbf {P}_{k}\) always admits a zero eigenvalue. Therefore, \( {\mathrm{{rank}{(}}}\mathbf {P}_{k}) = {\mathrm{{rank}{(}}}\mathbf {R}_{k}) \le n_{k} -1 \). Moreover, \(n_{k} \mathbf {R}_{k}\) has the same non-null eigenvalues of \(\mathbf {P}_{k}\), with (at least) extra \(p - n_{k} + 1 \) eigenvalues equal to zero.

Let \(\eta ^{(k)}_{s}\) be a non-null eigenvalue of \(\mathbf {P}_{k}\), and \(\mathbf {v}^{(k)}_{s}\) the corresponding normalized eigenvector (\(s = 1, \ldots , n^{\prime }_k\) \(\le n_{k} -1\), where: \(n^{\prime }_k = {\mathrm{{rank}{(}}}\mathbf {P}_{k})\)). By definition, \( \mathbf {P}_{k} \mathbf {v}^{(k)}_{s} = \eta ^{(k)}_{s} \mathbf {v}^{(k)}_{s} \). Pre-multiplying both members by \(\mathbf {Z}_{k}^t\) leads to: \( n_{k} \mathbf {R}_{k} \mathbf {Z}_{k}^t \mathbf {v}^{(k)}_{s} = \eta ^{(k)}_{s} \mathbf {Z}_{k}^t \mathbf {v}^{(k)}_{s} \), from which it is apparent that: \( \mathbf {Z}_{k}^t \mathbf {v}^{(k)}_{s} = \varvec{\xi }^{(k)}_{s} \) is the s-th p-dimensional eigenvector of \(n_{k} \mathbf {R}_{k}\). Finally, from the fact that: \( {\varvec{\xi }^{(k)}_{s}}^{t} \varvec{\xi }^{(k)}_{s} = {\mathbf {v}^{(k)}_{s}}^{t} \mathbf {P}_{k} \mathbf {v}^{(k)}_{s} = \eta ^{(k)}_{s} {\mathbf {v}^{(k)}_{s}}^{t} \mathbf {v}^{(k)}_{s} = \eta ^{(k)}_{s} \), it derives that the s-th normalized eigenvector of \(n_{k} \mathbf {R}_{k}\), which is also an eigenvector for \(\mathbf {R}_{k}\), is given by: \( \varvec{\omega }^{(k)}_{s} = \varvec{\xi }^{(k)}_{s} / \sqrt{\eta ^{(k)}_{s}} \), while the s-th eigenvalue of \(\mathbf {R}_{k}\) is: \( \lambda ^{(k)}_{s} = \eta ^{(k)}_{s} / n_{k} \), for \(s= 1, \ldots , n^{\prime }_k \le n_{k} -1 < p\).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Solaro, N., Barbiero, A., Manzi, G. et al. A sequential distance-based approach for imputing missing data: Forward Imputation. Adv Data Anal Classif 11, 395–414 (2017). https://doi.org/10.1007/s11634-016-0243-0

Download citation

Received: 07 July 2014
Revised: 28 February 2016
Accepted: 07 March 2016
Published: 25 March 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s11634-016-0243-0

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A sequential distance-based approach for imputing missing data: Forward Imputation

Abstract

Access this article

Similar content being viewed by others

Sampling Techniques for Quantitative Research

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

A Tutorial on Applying the Difference-in-Differences Method to Health Data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A sequential distance-based approach for imputing missing data: Forward Imputation

Abstract

Access this article

Similar content being viewed by others

Sampling Techniques for Quantitative Research

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

A Tutorial on Applying the Difference-in-Differences Method to Health Data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation