Abstract
In this paper, we study the problem of computing Random Forest-distances in the presence of missing data. We present a general framework which avoids pre-imputation and uses in an agnostic way the information contained in the input points. We centre our investigation on RatioRF, an RF-based distance recently introduced in the context of clustering and shown to outperform most known RF-based distance measures. We also show that the same framework can be applied to several other state-of-the-art RF-based measures and provide their extensions to the missing data case. We provide significant empirical evidence of the effectiveness of the proposed framework, showing extensive experiments with RatioRF on 15 datasets. Finally, we also positively compare our method with many alternative literature distances, which can be computed with missing values.
- L. AbdAllah and I. Shimshoni. 2013. A distance function for data with missing values and its application. Int J Comput Sci Eng(2013), 7.Google Scholar
- Loai AbdAllah and Ilan Shimshoni. 2013. A distance function for data with missing values and its application. Int J Comput Sci Eng(2013), 7.Google Scholar
- Loai AbdAllah and Ilan Shimshoni. 2014. Mean shift clustering algorithm for data with missing values. In Data Warehousing and Knowledge Discovery: 16th International Conference, DaWaK 2014, Munich, Germany, September 2-4, 2014. Proceedings 16. Springer, 426–438.Google ScholarCross Ref
- Loai AbdAllah and Ilan Shimshoni. 2016. k-means over incomplete datasets using mean Euclidean distance. In Machine Learning and Data Mining in Pattern Recognition: 12th International Conference, MLDM 2016, New York, NY, USA, July 16-21, 2016, Proceedings. Springer, 113–127.Google ScholarCross Ref
- Deepak Adhikari, Wei Jiang, Jinyu Zhan, Zhiyuan He, Danda B Rawat, Uwe Aickelin, and Hadi A Khorshidi. 2022. A comprehensive survey on imputation of missing data in internet of things. Comput. Surveys 55, 7 (2022), 1–38.Google ScholarDigital Library
- S. Aryal, K.M. Ting, T. Washio, and G. Haffari. 2020. A comparative study of data-dependent approaches without learning in measuring similarities of data objects. Data Min. Knowl. Discov. 34, 1 (2020), 124–162.Google ScholarDigital Library
- Manuele Bicego and Ferdinando Cicalese. 2023. On the Good Behaviour of Extremely Randomized Trees in Random Forest-Distance Computation. In Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD2023). Springer, 645–660.Google ScholarDigital Library
- M. Bicego, F. Cicalese, and A. Mensi. 2023. RatioRF: A Novel Measure for Random Forest Clustering Based on the Tversky’s Ratio Model. IEEE Transactions on Knowledge and Data Engineering 35, 1(2023), 830–841.Google Scholar
- M. Bicego and F. Escolano. 2020. On Learning Random Forests for Random Forest Clustering. In Proc. Int. Conf. on Pattern Recognition. 3451–3458.Google Scholar
- L. Breiman. 2001. Random forests. Machine Learning 45(2001), 5–32.Google ScholarDigital Library
- L. Breiman, J.H. Friedman, R. Olshen, and C.J. Stone. 1984. Classification and Regression Trees. Wadsworth.Google Scholar
- S. Van Buuren and K. Oudshoorn. 1999. Flexible multivariate imputation by MICE. Leiden: TNO.Google Scholar
- A. Criminisi, J. Shotton, and E. Konukoglu. 2012. Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning. Foundations and Trends in Computer Graphics and Vision 7, 2-3(2012), 81–227.Google Scholar
- Shounak Datta, Supritam Bhattacharjee, and Swagatam Das. 2018. Clustering with missing features: a penalized dissimilarity measure based approach. Machine Learning 107(2018), 1987–2025.Google ScholarDigital Library
- Shounak Datta, Debaleena Misra, and Swagatam Das. 2016. A feature weighted penalty based dissimilarity measure for k-nearest neighbor classification with missing features. Pattern Recognition Letters 80 (2016), 231–237.Google ScholarDigital Library
- Jerome H Friedman et al. 1977. A recursive partitioning decision rule for nonparametric classification. IEEE Trans. Computers 26, 4 (1977), 404–408.Google Scholar
- P. Geurts, D. Ernst, and L. Wehenkel. 2006. Extremely randomized trees. Machine Learning 63, 1 (2006), 3–42.Google ScholarDigital Library
- Md Kamrul Hasan, Md Ashraful Alam, Shidhartho Roy, Aishwariya Dutta, Md Tasnim Jawad, and Sunanda Das. 2021. Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021). Informatics in Medicine Unlocked 27 (2021), 100799.Google ScholarCross Ref
- T. Ishioka. 2013. Imputation of missing values for unsupervised data using the proximity in random forests. In Proc. Int. Conf. on Mobile, Hybrid, and On-line Learning. 30–6.Google Scholar
- Janus Christian Jakobsen, Christian Gluud, Jørn Wetterslev, and Per Winkel. 2017. When and how should multiple imputation be used for handling missing data in randomised clinical trials–a practical guide with flowcharts. BMC medical research methodology 17, 1 (2017), 1–10.Google Scholar
- Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6, 1(2012), 1–39.Google Scholar
- Qian Ma, Yu Gu, Wang-Chien Lee, Ge Yu, Hongbo Liu, and Xindong Wu. 2020. REMIAN: Real-time and error-tolerant missing value imputation. ACM Transactions on Knowledge Discovery from Data (TKDD) 14, 6(2020), 1–38.Google ScholarDigital Library
- N. Mantel. 1967. The detection of disease clustering and a generalized regression approach. Cancer research 27, 2_Part_1 (1967), 209–220.Google Scholar
- M. Orozco-Alzate, P.A. Castro-Cabrera, M. Bicego, and J.M. Londoño-Bonilla. 2015. The DTW-based representation space for seismic pattern classification. Computers & Geosciences 85 (2015), 86–95.Google ScholarDigital Library
- T.D. Pigott. 2001. A Review of Methods for Missing Data. Educational Research and Evaluation 7, 4 (2001), 353–383.Google ScholarCross Ref
- J.R. Quinlan. 1986. Induction Decision Trees. Machine Learning 1(1986), 81–106.Google ScholarCross Ref
- J.R. Quinlan. 1989. Unknown attribute values in induction. In Proceedings of the 6th Int. Machine Learning Workshop. 164–168.Google ScholarCross Ref
- J.R. Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc.Google ScholarDigital Library
- J Ross Quinlan. 1987. Decision trees as probabilistic classifiers. In Proceedings of the Fourth International Workshop on Machine Learning. Elsevier, 31–37.Google ScholarCross Ref
- Removed. Removed. Removed. In Removed. Removed, Removed.Google Scholar
- D.B. Rubin. 1976. Inference and missing data. Biometrika 63, 3 (1976), 581–592.Google ScholarCross Ref
- M.S. Santos, P.H. Abreu, S. Wilk, and J. Santos. 2020. How distance metrics influence missing data imputation with k-nearest neighbours. Pattern Recognition Letters 136 (2020), 111–119.Google ScholarCross Ref
- Miriam Seoane Santos, Pedro Henriques Abreu, Alberto Fernández, Julián Luengo, and João Santos. 2022. The impact of heterogeneous distance functions on missing data imputation and classification performance. Engineering Applications of Artificial Intelligence 111 (2022), 104791.Google ScholarCross Ref
- J.W. Schneider and P. Borlund. 2007. Matrix comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results. Journal of the American Society for Information Science and Technology 58, 11 (2007), 1586–1595.Google ScholarDigital Library
- T. Shi and S. Horvath. 2006. Unsupervised Learning With Random Forest Predictors. Journal of Computational and Graphical Statistics 15, 1(2006), 118–138.Google ScholarCross Ref
- D. Sitaram, A. Dalwani, A. Narang, M. Das, and P. Auradkar. 2015. A measure of similarity of time series containing missing data using the mahalanobis distance. In Proc. Int. Conf. on advances in computing and communication engineering. IEEE, 622–627.Google Scholar
- D.J. Stekhoven and P. Buhlmann. 2011. MissForest: non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 1 (2011), 112–118.Google ScholarDigital Library
- J.A.C Sterne, I.R. White, J.B. Carlin, M. Spratt, P. Royston, M.G. Kenward, A.M. Wood, and J.R. Carpenter. 2009. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338(2009), b2393.Google ScholarCross Ref
- K.M. Ting, Y. Zhu, M. Carman, Y. Zhu, and Z.-H. Zhou. 2016. Overcoming Key Weaknesses of Distance-based Neighbourhood Methods Using a Data Dependent Dissimilarity Measure. In Proc. Int. Conf. on Knowledge Discovery and Data Mining. 1205–1214.Google ScholarDigital Library
- O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R.B. Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 6 (2001), 520–525.Google ScholarCross Ref
- A. Tversky. 1977. Features of similarity. Psychological review 84, 4 (1977), 327.Google Scholar
- K. Wagstaff. 2004. Clustering with Missing Values: No Imputation Required. In Classification, Clustering, and Data Mining Applications. 649–658.Google Scholar
- Shichao Zhang. 2021. Challenges in KNN classification. IEEE Transactions on Knowledge and Data Engineering 34, 10(2021), 4663–4675.Google ScholarDigital Library
- Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. 2018. Efficient kNN Classification With Different Numbers of Nearest Neighbors. IEEE Transactions on Neural Networks and Learning Systems 29, 5(2018), 1774–1785.Google ScholarCross Ref
- X. Zhu, C.C. Loy, and S. Gong. 2014. Constructing Robust Affinity Graphs for Spectral Clustering. In Proc. Int. Conf. on Computer Vision and Pattern Recognition, CVPR 2014. 1450–1457.Google Scholar
- D.A. Zighed, R. Abdesselam, and A. Hadgu. 2012. Topological comparisons of proximity measures. In Proc. Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining. Springer, 379–391.Google Scholar
Index Terms
- Computing Random Forest-distances in the presence of missing data
Recommendations
Using Random Forest Distances for Outlier Detection
Image Analysis and Processing – ICIAP 2022AbstractIn recent years, a great variety of outlier detectors have been proposed in the literature, many of which are based on pairwise distances or derived concepts. However, in such methods, most of the efforts have been devoted to the outlier detection ...
Automatic Delta-Adjustment Method Applied to Missing Not At Random Imputation
Computational Science – ICCS 2023AbstractMissing data can be described by the absence of values in a dataset, which can be a critical issue in domains such as healthcare. A common solution for this problem is imputation, where the missing values are replaced by estimations. Most ...
A reinforcement learning-based approach for imputing missing data
AbstractMissing data is a major problem in real-world datasets, which hinders the performance of data analytics. Conventional data imputation schemes such as univariate single imputation replace missing values in each column with the same approximated ...
Comments