Skip to main content
Log in

A sequential distance-based approach for imputing missing data: Forward Imputation

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Missing data recurrently affect datasets in almost every field of quantitative research. The subject is vast and complex and has originated a literature rich in very different approaches to the problem. Within an exploratory framework, distance-based methods such as nearest-neighbour imputation (NNI), or procedures involving multivariate data analysis (MVDA) techniques seem to treat the problem properly. In NNI, the metric and the number of donors can be chosen at will. MVDA-based procedures expressly account for variable associations. The new approach proposed here, called Forward Imputation, ideally meets these features. It is designed as a sequential procedure that imputes missing data in a step-by-step process involving subsets of units according to their “completeness rate”. Two methods within this context are developed for the imputation of quantitative data. One applies NNI with the Mahalanobis distance, the other combines NNI and principal component analysis. Statistical properties of the two methods are discussed, and their performance is assessed, also in comparison with alternative imputation methods. To this purpose, a simulation study in the presence of different data patterns along with an application to real data are carried out, and practical hints for users are also provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the Forward Search. Springer, New York

    Book  MATH  Google Scholar 

  • Azzalini A (2015) R package “sn”: the skew-normal and skew-t distributions (version 1.2-4). http://azzalini.stat.unipd.it/SN

  • Azzalini A, Capitanio A (1999) Statistical applications of the multivariate skew normal distribution. J R Stat Soc B 61(3):579–602

    Article  MathSciNet  MATH  Google Scholar 

  • Azzalini A, Dalla Valle A (1996) The multivariate skew-normal distribution. Biometrika 83(4):715–726

    Article  MathSciNet  MATH  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45:5–32

    Article  MATH  Google Scholar 

  • Cox TF, Cox MAA (2001) Multidimensional scaling, 2nd edn. Chapman & Hall/CRC, Boca Raton

    MATH  Google Scholar 

  • Ferrari PA, Annoni P, Barbiero A, Manzi G (2011) An imputation method for categorical variables with application to nonlinear principal component analysis. Comput Stat Data Anal 55:2410–2420

    Article  MathSciNet  MATH  Google Scholar 

  • Gower JC (2005) Principal coordinates analysis. In: Armitage P, Colton T (eds) Encyclopedia of biostatistics. Wiley, New York

    Google Scholar 

  • Greenacre M (1984) Theory and applications of correspondence analysis. Academic Press, London

    MATH  Google Scholar 

  • Groves RM, Dillman DA, Eltinge JL, Little RJA (2002) Survey nonresponse. Wiley, New York

    MATH  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference and prediction, 2nd edn. Springer, New York

  • Hollander M, Wolfe DA (1999) Nonparametric statistical methods, 2nd edn. Wiley-Interscience, New York

    MATH  Google Scholar 

  • Husson F, Josse J (2015) missMDA: Handling missing values with/in multivariate data analysis (principal component methods). R package version 1.8.2. http://CRAN.R-project.org/package=missMDA

  • Josse J, Pagès J, Husson F (2011) Multiple imputation in principal component analysis. Adv Data Anal Classif 5:231–246

    Article  MathSciNet  MATH  Google Scholar 

  • Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • Mardia KV (1970) Measures of multivariate skewness and kurtosis with applications. Biometrika 57(3):519–530

    Article  MathSciNet  MATH  Google Scholar 

  • Nora-Chouteau C (1974) Une méthode de reconstitution et d’analyse de données incomplètes. PhD thesis, Université Pierre et Marie Curie

  • R Core Team (2015) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org

  • Rässler S, Rubin DB, Zell ER (2013) Imputation. Wiley Interdiscip Rev Comput Stat 5(1):20–29. doi:10.1002/wics.1240

    Article  Google Scholar 

  • Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York

    Book  MATH  Google Scholar 

  • Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC, London

    Book  MATH  Google Scholar 

  • Solaro N, Barbiero A, Manzi G, Ferrari PA (2014) Algorithmic-type imputation techniques with different data structures: alternative approaches in comparison. In: Vicari D, Okada A, Ragozini G, Weihs C (eds) Analysis and modeling of complex data in behavioural and social sciences. Studies in classification, data analysis, and knowledge organization. Springer International Publishing, Cham, pp 253–261

  • Solaro N, Barbiero A, Manzi G, Ferrari PA (2015a) A comprehensive simulation study on the Forward Imputation. Working Paper 2015\(\_\)4, Università degli Studi di Milano, Italy. https://ideas.repec.org/p/mil/wpdepa/2015-04.html

  • Solaro N, Barbiero A, Manzi G, Ferrari PA (2015b) GenForImp: a sequential distance-based approach for imputing missing data. R package version 1.0.0. http://CRAN.R-project.org/package=GenForImp

  • Stekhoven DJ (2013). missForest: nonparametric missing value imputation using random forest. R package version 1.4. http://CRAN.R-project.org/package=missForest

  • Stekhoven DJ, Bühlmann P (2012) MissForest—nonparametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118

    Article  Google Scholar 

  • Tarsitano A, Falcone M (2010) Missing values adjustment for mixed-type data. Working Paper n. 15-2010, Università della Calabria, Italy. https://ideas.repec.org/p/clb/wpaper/201015.html

  • Wasito I, Mirkin B (2005) Nearest neighbour approach in the least-squares data imputation algorithms. Inf Sci 169(1):1–25

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

N. Solaro’s work was partly funded by the MIUR PRIN “MISURA—Multivariate models for risk assessment” project. The authors would like to thank the Coordinating Editor, Associate Editor and the two anonymous referees for their valuable comments and suggestions, which greatly improved the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nadia Solaro.

Appendix

Appendix

In FIP, the requirement: \(n_{k} \ge p\) for each \(k=0,1, \ldots , K\) is not binding. For a given k, suppose that \(n_{k} < p\), and consider the correlation matrix \(\mathbf {R}_{k}\) computed from the complete matrix \({\mathbf {X}}_{k}\) (similar arguments would hold for the variance-covariance matrix \(\varvec{\Sigma }_{k}\)). Let \(\mathbf {Z}_{k}\) be the standardized matrix derived from \({\mathbf {X}}_{k}\). It follows that: \( \mathbf {R}_{k} = \frac{1}{n_{k}} \mathbf {Z}_{k}^t \mathbf {Z}_{k} \), for which: \({\mathrm{{rank}{(}}}\mathbf {R}_{k}) \le n_{k} < p\). Now, consider the \(n_{k} \times n_{k}\) product-matrix: \( \mathbf {P}_{k} = \mathbf {Z}_{k} \mathbf {Z}_{k}^t \). Standard results of matrix algebra ensure that \(\mathbf {P}_{k}\) and \(\mathbf {R}_{k}\) have the same rank. In particular, as \(\mathbf {Z}_{k}\) has zero-mean columns, it always holds that: \( \mathbf {Z}_{k} \mathbf {Z}_{k}^t \mathbf {1}= 0 \cdot \mathbf {1}= \mathbf {0}\), where \(\mathbf {1}\) is a vector of \(n_{k}\) ones, so that \(\mathbf {P}_{k}\) always admits a zero eigenvalue. Therefore, \( {\mathrm{{rank}{(}}}\mathbf {P}_{k}) = {\mathrm{{rank}{(}}}\mathbf {R}_{k}) \le n_{k} -1 \). Moreover, \(n_{k} \mathbf {R}_{k}\) has the same non-null eigenvalues of \(\mathbf {P}_{k}\), with (at least) extra \(p - n_{k} + 1 \) eigenvalues equal to zero.

Let \(\eta ^{(k)}_{s}\) be a non-null eigenvalue of \(\mathbf {P}_{k}\), and \(\mathbf {v}^{(k)}_{s}\) the corresponding normalized eigenvector (\(s = 1, \ldots , n^{\prime }_k\) \(\le n_{k} -1\), where: \(n^{\prime }_k = {\mathrm{{rank}{(}}}\mathbf {P}_{k})\)). By definition, \( \mathbf {P}_{k} \mathbf {v}^{(k)}_{s} = \eta ^{(k)}_{s} \mathbf {v}^{(k)}_{s} \). Pre-multiplying both members by \(\mathbf {Z}_{k}^t\) leads to: \( n_{k} \mathbf {R}_{k} \mathbf {Z}_{k}^t \mathbf {v}^{(k)}_{s} = \eta ^{(k)}_{s} \mathbf {Z}_{k}^t \mathbf {v}^{(k)}_{s} \), from which it is apparent that: \( \mathbf {Z}_{k}^t \mathbf {v}^{(k)}_{s} = \varvec{\xi }^{(k)}_{s} \) is the s-th p-dimensional eigenvector of \(n_{k} \mathbf {R}_{k}\). Finally, from the fact that: \( {\varvec{\xi }^{(k)}_{s}}^{t} \varvec{\xi }^{(k)}_{s} = {\mathbf {v}^{(k)}_{s}}^{t} \mathbf {P}_{k} \mathbf {v}^{(k)}_{s} = \eta ^{(k)}_{s} {\mathbf {v}^{(k)}_{s}}^{t} \mathbf {v}^{(k)}_{s} = \eta ^{(k)}_{s} \), it derives that the s-th normalized eigenvector of \(n_{k} \mathbf {R}_{k}\), which is also an eigenvector for \(\mathbf {R}_{k}\), is given by: \( \varvec{\omega }^{(k)}_{s} = \varvec{\xi }^{(k)}_{s} / \sqrt{\eta ^{(k)}_{s}} \), while the s-th eigenvalue of \(\mathbf {R}_{k}\) is: \( \lambda ^{(k)}_{s} = \eta ^{(k)}_{s} / n_{k} \), for \(s= 1, \ldots , n^{\prime }_k \le n_{k} -1 < p\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Solaro, N., Barbiero, A., Manzi, G. et al. A sequential distance-based approach for imputing missing data: Forward Imputation. Adv Data Anal Classif 11, 395–414 (2017). https://doi.org/10.1007/s11634-016-0243-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-016-0243-0

Keywords

Mathematics Subject Classification

Navigation