Skip to main content
Log in

How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

  • Published:
Quality & Quantity Aims and scope Submit manuscript

Abstract

The study is devoted to a comparison of three approaches to handling missing data of categorical variables: complete case analysis, multiple imputation (based on random forest), and the missing-indicator method. Focusing on OLS regression, we describe how the choice of the approach depends on the missingness mechanism, its proportion, and model specification. The results of a simulated statistical experiment show that each approach may lead to either almost unbiased or dramatically biased estimates. The choice of the appropriate approach should be primarily based on the missingness mechanism: one should choose CCA under MCAR, MI under MAR, and, again, CCA under MNAR. Although MIM produces almost unbiased estimates under MCAR and MNAR as well, it leads to inefficient regression coefficients—ones with too big standard errors and, consequently, incorrect p-values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability and material

All data files are available upon request.

Code availability

All scripts are available upon request.

Notes

  1. The only exceptions are the articles (Choi et al. 2019) and (Donders et al. 2006) where the authors use simulated data but limit their analysis to continuous variables only.

  2. The exception is paper (Henry et al. 2013) where the variable of race contains missing values, but in this study the authors use real data and do not control any factors that may affect the results of comparison.

  3. All the technical files are available upon request.

  4. Random forest-based multiple imputation was carried out with ‘sklearn’ package (specifically its IterativeImputer class) in Python (Pedregosa et al. 2011), which is equivalent to ‘mice’ package in R.

  5. θ is the true value of a parameter.

  6. AnOVa was carried out with ‘statsmodels’ package (specifically its anova_lm function) in Python (Seabold and Perktold 2010).

  7. ChAID was carried out with ‘randan’ package (specifically its CHAIDRegressor class) in Python.

References

Download references

Funding

The publication was prepared within the framework of the Academic Fund Program at the National Research University Higher School of Economics (HSE) in 2020 (Grant No. 20-04-016) and by the Russian Academic Excellence Project "5–100".

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Svetlana Zhuchkova.

Ethics declarations

Conflict of interest

We declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhuchkova, S., Rotmistrov, A. How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment. Qual Quant 56, 1–22 (2022). https://doi.org/10.1007/s11135-021-01114-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11135-021-01114-w

Keywords

Navigation