skip to main content
research-article
Free Access
Just Accepted

Computing Random Forest-distances in the presence of missing data

Online AM:08 April 2024Publication History
Skip Abstract Section

Abstract

In this paper, we study the problem of computing Random Forest-distances in the presence of missing data. We present a general framework which avoids pre-imputation and uses in an agnostic way the information contained in the input points. We centre our investigation on RatioRF, an RF-based distance recently introduced in the context of clustering and shown to outperform most known RF-based distance measures. We also show that the same framework can be applied to several other state-of-the-art RF-based measures and provide their extensions to the missing data case. We provide significant empirical evidence of the effectiveness of the proposed framework, showing extensive experiments with RatioRF on 15 datasets. Finally, we also positively compare our method with many alternative literature distances, which can be computed with missing values.

References

  1. L. AbdAllah and I. Shimshoni. 2013. A distance function for data with missing values and its application. Int J Comput Sci Eng(2013), 7.Google ScholarGoogle Scholar
  2. Loai AbdAllah and Ilan Shimshoni. 2013. A distance function for data with missing values and its application. Int J Comput Sci Eng(2013), 7.Google ScholarGoogle Scholar
  3. Loai AbdAllah and Ilan Shimshoni. 2014. Mean shift clustering algorithm for data with missing values. In Data Warehousing and Knowledge Discovery: 16th International Conference, DaWaK 2014, Munich, Germany, September 2-4, 2014. Proceedings 16. Springer, 426–438.Google ScholarGoogle ScholarCross RefCross Ref
  4. Loai AbdAllah and Ilan Shimshoni. 2016. k-means over incomplete datasets using mean Euclidean distance. In Machine Learning and Data Mining in Pattern Recognition: 12th International Conference, MLDM 2016, New York, NY, USA, July 16-21, 2016, Proceedings. Springer, 113–127.Google ScholarGoogle ScholarCross RefCross Ref
  5. Deepak Adhikari, Wei Jiang, Jinyu Zhan, Zhiyuan He, Danda B Rawat, Uwe Aickelin, and Hadi A Khorshidi. 2022. A comprehensive survey on imputation of missing data in internet of things. Comput. Surveys 55, 7 (2022), 1–38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Aryal, K.M. Ting, T. Washio, and G. Haffari. 2020. A comparative study of data-dependent approaches without learning in measuring similarities of data objects. Data Min. Knowl. Discov. 34, 1 (2020), 124–162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Manuele Bicego and Ferdinando Cicalese. 2023. On the Good Behaviour of Extremely Randomized Trees in Random Forest-Distance Computation. In Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD2023). Springer, 645–660.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Bicego, F. Cicalese, and A. Mensi. 2023. RatioRF: A Novel Measure for Random Forest Clustering Based on the Tversky’s Ratio Model. IEEE Transactions on Knowledge and Data Engineering 35, 1(2023), 830–841.Google ScholarGoogle Scholar
  9. M. Bicego and F. Escolano. 2020. On Learning Random Forests for Random Forest Clustering. In Proc. Int. Conf. on Pattern Recognition. 3451–3458.Google ScholarGoogle Scholar
  10. L. Breiman. 2001. Random forests. Machine Learning 45(2001), 5–32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. Breiman, J.H. Friedman, R. Olshen, and C.J. Stone. 1984. Classification and Regression Trees. Wadsworth.Google ScholarGoogle Scholar
  12. S. Van Buuren and K. Oudshoorn. 1999. Flexible multivariate imputation by MICE. Leiden: TNO.Google ScholarGoogle Scholar
  13. A. Criminisi, J. Shotton, and E. Konukoglu. 2012. Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning. Foundations and Trends in Computer Graphics and Vision 7, 2-3(2012), 81–227.Google ScholarGoogle Scholar
  14. Shounak Datta, Supritam Bhattacharjee, and Swagatam Das. 2018. Clustering with missing features: a penalized dissimilarity measure based approach. Machine Learning 107(2018), 1987–2025.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Shounak Datta, Debaleena Misra, and Swagatam Das. 2016. A feature weighted penalty based dissimilarity measure for k-nearest neighbor classification with missing features. Pattern Recognition Letters 80 (2016), 231–237.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jerome H Friedman et al. 1977. A recursive partitioning decision rule for nonparametric classification. IEEE Trans. Computers 26, 4 (1977), 404–408.Google ScholarGoogle Scholar
  17. P. Geurts, D. Ernst, and L. Wehenkel. 2006. Extremely randomized trees. Machine Learning 63, 1 (2006), 3–42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Md Kamrul Hasan, Md Ashraful Alam, Shidhartho Roy, Aishwariya Dutta, Md Tasnim Jawad, and Sunanda Das. 2021. Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021). Informatics in Medicine Unlocked 27 (2021), 100799.Google ScholarGoogle ScholarCross RefCross Ref
  19. T. Ishioka. 2013. Imputation of missing values for unsupervised data using the proximity in random forests. In Proc. Int. Conf. on Mobile, Hybrid, and On-line Learning. 30–6.Google ScholarGoogle Scholar
  20. Janus Christian Jakobsen, Christian Gluud, Jørn Wetterslev, and Per Winkel. 2017. When and how should multiple imputation be used for handling missing data in randomised clinical trials–a practical guide with flowcharts. BMC medical research methodology 17, 1 (2017), 1–10.Google ScholarGoogle Scholar
  21. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6, 1(2012), 1–39.Google ScholarGoogle Scholar
  22. Qian Ma, Yu Gu, Wang-Chien Lee, Ge Yu, Hongbo Liu, and Xindong Wu. 2020. REMIAN: Real-time and error-tolerant missing value imputation. ACM Transactions on Knowledge Discovery from Data (TKDD) 14, 6(2020), 1–38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. Mantel. 1967. The detection of disease clustering and a generalized regression approach. Cancer research 27, 2_Part_1 (1967), 209–220.Google ScholarGoogle Scholar
  24. M. Orozco-Alzate, P.A. Castro-Cabrera, M. Bicego, and J.M. Londoño-Bonilla. 2015. The DTW-based representation space for seismic pattern classification. Computers & Geosciences 85 (2015), 86–95.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T.D. Pigott. 2001. A Review of Methods for Missing Data. Educational Research and Evaluation 7, 4 (2001), 353–383.Google ScholarGoogle ScholarCross RefCross Ref
  26. J.R. Quinlan. 1986. Induction Decision Trees. Machine Learning 1(1986), 81–106.Google ScholarGoogle ScholarCross RefCross Ref
  27. J.R. Quinlan. 1989. Unknown attribute values in induction. In Proceedings of the 6th Int. Machine Learning Workshop. 164–168.Google ScholarGoogle ScholarCross RefCross Ref
  28. J.R. Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J Ross Quinlan. 1987. Decision trees as probabilistic classifiers. In Proceedings of the Fourth International Workshop on Machine Learning. Elsevier, 31–37.Google ScholarGoogle ScholarCross RefCross Ref
  30. Removed. Removed. Removed. In Removed. Removed, Removed.Google ScholarGoogle Scholar
  31. D.B. Rubin. 1976. Inference and missing data. Biometrika 63, 3 (1976), 581–592.Google ScholarGoogle ScholarCross RefCross Ref
  32. M.S. Santos, P.H. Abreu, S. Wilk, and J. Santos. 2020. How distance metrics influence missing data imputation with k-nearest neighbours. Pattern Recognition Letters 136 (2020), 111–119.Google ScholarGoogle ScholarCross RefCross Ref
  33. Miriam Seoane Santos, Pedro Henriques Abreu, Alberto Fernández, Julián Luengo, and João Santos. 2022. The impact of heterogeneous distance functions on missing data imputation and classification performance. Engineering Applications of Artificial Intelligence 111 (2022), 104791.Google ScholarGoogle ScholarCross RefCross Ref
  34. J.W. Schneider and P. Borlund. 2007. Matrix comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results. Journal of the American Society for Information Science and Technology 58, 11 (2007), 1586–1595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Shi and S. Horvath. 2006. Unsupervised Learning With Random Forest Predictors. Journal of Computational and Graphical Statistics 15, 1(2006), 118–138.Google ScholarGoogle ScholarCross RefCross Ref
  36. D. Sitaram, A. Dalwani, A. Narang, M. Das, and P. Auradkar. 2015. A measure of similarity of time series containing missing data using the mahalanobis distance. In Proc. Int. Conf. on advances in computing and communication engineering. IEEE, 622–627.Google ScholarGoogle Scholar
  37. D.J. Stekhoven and P. Buhlmann. 2011. MissForest: non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 1 (2011), 112–118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J.A.C Sterne, I.R. White, J.B. Carlin, M. Spratt, P. Royston, M.G. Kenward, A.M. Wood, and J.R. Carpenter. 2009. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338(2009), b2393.Google ScholarGoogle ScholarCross RefCross Ref
  39. K.M. Ting, Y. Zhu, M. Carman, Y. Zhu, and Z.-H. Zhou. 2016. Overcoming Key Weaknesses of Distance-based Neighbourhood Methods Using a Data Dependent Dissimilarity Measure. In Proc. Int. Conf. on Knowledge Discovery and Data Mining. 1205–1214.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R.B. Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 6 (2001), 520–525.Google ScholarGoogle ScholarCross RefCross Ref
  41. A. Tversky. 1977. Features of similarity. Psychological review 84, 4 (1977), 327.Google ScholarGoogle Scholar
  42. K. Wagstaff. 2004. Clustering with Missing Values: No Imputation Required. In Classification, Clustering, and Data Mining Applications. 649–658.Google ScholarGoogle Scholar
  43. Shichao Zhang. 2021. Challenges in KNN classification. IEEE Transactions on Knowledge and Data Engineering 34, 10(2021), 4663–4675.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. 2018. Efficient kNN Classification With Different Numbers of Nearest Neighbors. IEEE Transactions on Neural Networks and Learning Systems 29, 5(2018), 1774–1785.Google ScholarGoogle ScholarCross RefCross Ref
  45. X. Zhu, C.C. Loy, and S. Gong. 2014. Constructing Robust Affinity Graphs for Spectral Clustering. In Proc. Int. Conf. on Computer Vision and Pattern Recognition, CVPR 2014. 1450–1457.Google ScholarGoogle Scholar
  46. D.A. Zighed, R. Abdesselam, and A. Hadgu. 2012. Topological comparisons of proximity measures. In Proc. Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining. Springer, 379–391.Google ScholarGoogle Scholar

Index Terms

  1. Computing Random Forest-distances in the presence of missing data

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data Just Accepted
        ISSN:1556-4681
        EISSN:1556-472X
        Table of Contents

        Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Online AM: 8 April 2024
        • Accepted: 29 March 2024
        • Revised: 25 March 2024
        • Received: 28 November 2023
        Published in tkdd Just Accepted

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)48
        • Downloads (Last 6 weeks)48

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader