Comparative Study of Missing Value Imputation Techniques on E-Commerce Product Ratings

Dimple Chehal, Parul Gupta, Payal Gulati, Tanisha Gupta

Abstract


Missing data is a common occurrence in practically all studies, and it adds a layer of ambiguity to data interpretation. Missing values in a dataset mean loss of important information. It is one of the most common data quality issues. Missing values are values that are not present in the data set. These are usually written as NAN’s, blanks, or any other placeholders. Missing values create imbalanced observations, biased estimates and sometimes lead to misleading results. The majority of real-world datasets have missing values. As a result, to deliver an efficient and valid analysis and the solutions should be taken into account appropriately. By filling in the missing values a complete dataset can be created and the challenge of dealing with complex patterns of missingness can be avoided. Missing values can be of both continuous and categorical types. To get more precise results, a variety of techniques to fill out missing values can be used. In the present study, nine different imputation methods: Simple Imputer, Last Observation Carried forward (LOCF), KNN Imputation (KNN), Hot Deck, Linear Regression, MissForest, Random Forest Regression, DataWig, and Multivariate Imputation by Chained Equation (MICE) were compared. The comparison was performed on Amazon real-time dataset based on three evaluation criteria: R- Squared (R2), Mean squared error (MSE), and Mean absolute error (MAE). As a result of the findings KNN had the best outcomes, while DataWig had the worst results for R- Squared (R2). The R-squared value ranges from 0-1. In terms of mean squared error (MSE) and mean absolute error (MAE), the Hot deck imputation approach fared best, whereas MissForest performed worst (MAE). The hot deck imputation method appears to be of interest and merits further investigation in practice.


Full Text:

PDF

References


J. C. Jakobsen, C. Gluud, J. Wetterslev, and P. Winkel, “When and how should multiple imputation be used for handling missing data in randomised clinical trials - A practical guide with flowcharts,” BMC Med. Res. Methodol., vol. 17, no. 1, pp. 1–10, 2017, doi: 10.1186/s12874-017-0442-1.

A. H. Ropper, H. D. Lewis, J. Hirman, and P. Flyer, “Hyperosmolar Therapy for Raised Intracranial Pressure,” N. Engl. J. Med., vol. 367, no. 26, pp. 2554–2557, 2012, doi: 10.1056/nejmc1212351.

E. Afrifa-Yamoah, U. A. Mueller, S. M. Taylor, and A. J. Fisher, “Missing data imputation of high-resolution temporal climate time series data,” Meteorol. Appl., vol. 27, no. 1, pp. 1–18, 2020, doi: 10.1002/met.1873.

F. Cismondi, A. S. Fialho, S. M. Vieira, S. R. Reti, J. M. C. Sousa, and S. N. Finkelstein, “Missing data in medical databases: Impute, delete or classify?,” Artif. Intell. Med., vol. 58, no. 1, pp. 63–72, 2013, doi: 10.1016/j.artmed.2013.01.003.

S. P. Mandel J, “A Comparison of Six Methods for Missing Data Imputation,” J. Biom. Biostat., vol. 06, no. 01, pp. 1–6, 2015, doi: 10.4172/2155-6180.1000224.

J. Kaiser, “Dealing with Missing Values in Data,” J. Syst. Integr., pp. 42–51, 2014, doi: 10.20470/jsi.v5i1.178.

H. Kang, “The prevention and handling of the missing data,” Korean J. Anesthesiol., vol. 64, no. 5, pp. 402–406, 2013, doi: 10.4097/kjae.2013.64.5.402.

W. C. Lin and C. F. Tsai, “Missing value imputation: a review and analysis of the literature (2006–2017),” Artif. Intell. Rev., vol. 53, no. 2, pp. 1487–1509, 2020, doi: 10.1007/s10462-019-09709-4.

A. Plaia and A. L. Bondì, “Single imputation method of missing values in environmental pollution data sets,” Atmos. Environ., vol. 40, no. 38, pp. 7316–7330, 2006, doi: 10.1016/j.atmosenv.2006.06.040.

Z. Zhang, “Missing data imputation: Focusing on single imputation,” Ann. Transl. Med., vol. 4, no. 1, 2016, doi: 10.3978/j.issn.2305-5839.2015.12.38.

S. Sinharay, H. S. Stern, and D. Russell, “The use of multiple imputation for the analysis of missing data,” Psychol. Methods, vol. 6, no. 3, pp. 317–329, 2001, doi: 10.1037/1082-989x.6.4.317.

J. W. Graham, P. E. Cumsille, and E. Elek-Fisk, “Methods for Handling Missing Data,” Handb. Psychol., 2003, doi: 10.1002/0471264385.wei0204.

D. F. Heitjan and S. Basu, “Distinguishing ‘missing at random’ and ‘missing completely at random,’” Am. Stat., vol. 50, no. 3, pp. 207–213, 1996, doi: 10.1080/00031305.1996.10474381.

“Missing Value Imputation – A Review,” Kdnuggets.com. [Online]. Available: https://www.kdnuggets.com/2020/09/missing-value-imputation-review.html.

“6.4. Imputation of missing values — scikit-learn 0.24.2 documentation,” Scikit-learn.org. [Online]. Available: https://scikit-learn.org/stable/modules/impute.html.

R. M. Hamer and P. M. Simpson, “Last observation carried forward versus mixed models in the analysis of psychiatric clinical trials (American Journal of Psychiatry (2009) 166, (639-641)),” Am. J. Psychiatry, vol. 166, no. 8, p. 942, 2009.

G. Chhabra, V. Vashisht, and J. Ranjan, “A review on missing data value estimation using imputation algorithm,” J. Adv. Res. Dyn. Control Syst., vol. 11, no. 7 Special Issue, pp. 312–318, 2019.

T. A. Myers, “Goodbye, Listwise Deletion: Presenting Hot Deck Imputation as an Easy and Effective Tool for Handling Missing Data,” Commun. Methods Meas., vol. 5, no. 4, pp. 297–310, 2011, doi: 10.1080/19312458.2011.624490.

A. Ye, “MissForest: The best missing data imputation algorithm?,” Towards Data Science, 31-Aug-2020. [Online]. Available: https://towardsdatascience.com/missforest-the-best-missing-data-imputation-algorithm-4d01182aed3.

D. Radečić, “How to use Python and MissForest algorithm to impute missing data,” Towards Data Science, 05-Nov-2020. [Online]. Available: https://towardsdatascience.com/how-to-use-python-and-missforest-algorithm-to-impute-missing-data-ed45eb47cb9a.

“Missingpy,” Pypi.org. [Online]. Available: https://pypi.org/project/missingpy/.

“sklearn.ensemble.RandomForestRegressor — scikit-learn 0.24.2 documentation,” Scikit-learn.org. [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.

“Welcome to DataWig’s documentation! — DataWig documentation,” Readthedocs.io. [Online]. Available: https://datawig.readthedocs.io/en/latest/

W. Badr, “6 different ways to compensate for missing values in a dataset (data imputation with examples),” Towards Data Science, 05-Jan-2019. [Online]. Available: https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779.

F. Bießmann et al., “DataWig: Missing value imputation for tables,” J. Mach. Learn. Res., vol. 20, pp. 1–6, 2019.

C. G. Schuetz, “Using neuroimaging to predict relapse to smoking: role of possible moderators and mediators.,” Int. J. Methods Psychiatr. Res., vol. 17 Suppl 1, no. 1, pp. S78–S82, 2008, doi: 10.1002/mpr.

J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel, “Image-based recommendations on styles and substitutes,” SIGIR 2015 - Proc. 38th Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., pp. 43–52, 2015, doi: 10.1145/2766462.2767755.

I. Stewart, “Cups and downs,” Coll. Math. J., vol. 43, no. 1, pp. 15–19, 2012, doi: 10.4169/college.math.j.43.1.015.




DOI: https://doi.org/10.31449/inf.v47i3.4156

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.