research-article

Free Access

Just Accepted

Computing Random Forest-distances in the presence of missing data

Authors:
Manuele Bicego

University of Verona, Verona, Italy

University of Verona, Verona, Italy
Search about this author

,
Ferdinando Cicalese

University of Verona, Verona, Italy

University of Verona, Verona, Italy
Search about this author

Authors Info & Claims

ACM Transactions on Knowledge Discovery from DataAccepted on March 2024https://doi.org/10.1145/3656345

Online AM:08 April 2024Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

In this paper, we study the problem of computing Random Forest-distances in the presence of missing data. We present a general framework which avoids pre-imputation and uses in an agnostic way the information contained in the input points. We centre our investigation on RatioRF, an RF-based distance recently introduced in the context of clustering and shown to outperform most known RF-based distance measures. We also show that the same framework can be applied to several other state-of-the-art RF-based measures and provide their extensions to the missing data case. We provide significant empirical evidence of the effectiveness of the proposed framework, showing extensive experiments with RatioRF on 15 datasets. Finally, we also positively compare our method with many alternative literature distances, which can be computed with missing values.

References

L. AbdAllah and I. Shimshoni. 2013. A distance function for data with missing values and its application. Int J Comput Sci Eng(2013), 7.Google Scholar
Loai AbdAllah and Ilan Shimshoni. 2013. A distance function for data with missing values and its application. Int J Comput Sci Eng(2013), 7.Google Scholar
Loai AbdAllah and Ilan Shimshoni. 2014. Mean shift clustering algorithm for data with missing values. In Data Warehousing and Knowledge Discovery: 16th International Conference, DaWaK 2014, Munich, Germany, September 2-4, 2014. Proceedings 16. Springer, 426–438.Google ScholarCross Ref
Loai AbdAllah and Ilan Shimshoni. 2016. k-means over incomplete datasets using mean Euclidean distance. In Machine Learning and Data Mining in Pattern Recognition: 12th International Conference, MLDM 2016, New York, NY, USA, July 16-21, 2016, Proceedings. Springer, 113–127.Google ScholarCross Ref
Deepak Adhikari, Wei Jiang, Jinyu Zhan, Zhiyuan He, Danda B Rawat, Uwe Aickelin, and Hadi A Khorshidi. 2022. A comprehensive survey on imputation of missing data in internet of things. Comput. Surveys 55, 7 (2022), 1–38.Google ScholarDigital Library
S. Aryal, K.M. Ting, T. Washio, and G. Haffari. 2020. A comparative study of data-dependent approaches without learning in measuring similarities of data objects. Data Min. Knowl. Discov. 34, 1 (2020), 124–162.Google ScholarDigital Library
Manuele Bicego and Ferdinando Cicalese. 2023. On the Good Behaviour of Extremely Randomized Trees in Random Forest-Distance Computation. In Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD2023). Springer, 645–660.Google ScholarDigital Library
M. Bicego, F. Cicalese, and A. Mensi. 2023. RatioRF: A Novel Measure for Random Forest Clustering Based on the Tversky’s Ratio Model. IEEE Transactions on Knowledge and Data Engineering 35, 1(2023), 830–841.Google Scholar
M. Bicego and F. Escolano. 2020. On Learning Random Forests for Random Forest Clustering. In Proc. Int. Conf. on Pattern Recognition. 3451–3458.Google Scholar
L. Breiman. 2001. Random forests. Machine Learning 45(2001), 5–32.Google ScholarDigital Library
L. Breiman, J.H. Friedman, R. Olshen, and C.J. Stone. 1984. Classification and Regression Trees. Wadsworth.Google Scholar
S. Van Buuren and K. Oudshoorn. 1999. Flexible multivariate imputation by MICE. Leiden: TNO.Google Scholar
A. Criminisi, J. Shotton, and E. Konukoglu. 2012. Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning. Foundations and Trends in Computer Graphics and Vision 7, 2-3(2012), 81–227.Google Scholar
Shounak Datta, Supritam Bhattacharjee, and Swagatam Das. 2018. Clustering with missing features: a penalized dissimilarity measure based approach. Machine Learning 107(2018), 1987–2025.Google ScholarDigital Library
Shounak Datta, Debaleena Misra, and Swagatam Das. 2016. A feature weighted penalty based dissimilarity measure for k-nearest neighbor classification with missing features. Pattern Recognition Letters 80 (2016), 231–237.Google ScholarDigital Library
Jerome H Friedman et al. 1977. A recursive partitioning decision rule for nonparametric classification. IEEE Trans. Computers 26, 4 (1977), 404–408.Google Scholar
P. Geurts, D. Ernst, and L. Wehenkel. 2006. Extremely randomized trees. Machine Learning 63, 1 (2006), 3–42.Google ScholarDigital Library
Md Kamrul Hasan, Md Ashraful Alam, Shidhartho Roy, Aishwariya Dutta, Md Tasnim Jawad, and Sunanda Das. 2021. Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021). Informatics in Medicine Unlocked 27 (2021), 100799.Google ScholarCross Ref
T. Ishioka. 2013. Imputation of missing values for unsupervised data using the proximity in random forests. In Proc. Int. Conf. on Mobile, Hybrid, and On-line Learning. 30–6.Google Scholar
Janus Christian Jakobsen, Christian Gluud, Jørn Wetterslev, and Per Winkel. 2017. When and how should multiple imputation be used for handling missing data in randomised clinical trials–a practical guide with flowcharts. BMC medical research methodology 17, 1 (2017), 1–10.Google Scholar
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6, 1(2012), 1–39.Google Scholar
Qian Ma, Yu Gu, Wang-Chien Lee, Ge Yu, Hongbo Liu, and Xindong Wu. 2020. REMIAN: Real-time and error-tolerant missing value imputation. ACM Transactions on Knowledge Discovery from Data (TKDD) 14, 6(2020), 1–38.Google ScholarDigital Library
N. Mantel. 1967. The detection of disease clustering and a generalized regression approach. Cancer research 27, 2_Part_1 (1967), 209–220.Google Scholar
M. Orozco-Alzate, P.A. Castro-Cabrera, M. Bicego, and J.M. Londoño-Bonilla. 2015. The DTW-based representation space for seismic pattern classification. Computers & Geosciences 85 (2015), 86–95.Google ScholarDigital Library
T.D. Pigott. 2001. A Review of Methods for Missing Data. Educational Research and Evaluation 7, 4 (2001), 353–383.Google ScholarCross Ref
J.R. Quinlan. 1986. Induction Decision Trees. Machine Learning 1(1986), 81–106.Google ScholarCross Ref
J.R. Quinlan. 1989. Unknown attribute values in induction. In Proceedings of the 6th Int. Machine Learning Workshop. 164–168.Google ScholarCross Ref
J.R. Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc.Google ScholarDigital Library
J Ross Quinlan. 1987. Decision trees as probabilistic classifiers. In Proceedings of the Fourth International Workshop on Machine Learning. Elsevier, 31–37.Google ScholarCross Ref
Removed. Removed. Removed. In Removed. Removed, Removed.Google Scholar
D.B. Rubin. 1976. Inference and missing data. Biometrika 63, 3 (1976), 581–592.Google ScholarCross Ref
M.S. Santos, P.H. Abreu, S. Wilk, and J. Santos. 2020. How distance metrics influence missing data imputation with k-nearest neighbours. Pattern Recognition Letters 136 (2020), 111–119.Google ScholarCross Ref
Miriam Seoane Santos, Pedro Henriques Abreu, Alberto Fernández, Julián Luengo, and João Santos. 2022. The impact of heterogeneous distance functions on missing data imputation and classification performance. Engineering Applications of Artificial Intelligence 111 (2022), 104791.Google ScholarCross Ref
J.W. Schneider and P. Borlund. 2007. Matrix comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results. Journal of the American Society for Information Science and Technology 58, 11 (2007), 1586–1595.Google ScholarDigital Library
T. Shi and S. Horvath. 2006. Unsupervised Learning With Random Forest Predictors. Journal of Computational and Graphical Statistics 15, 1(2006), 118–138.Google ScholarCross Ref
D. Sitaram, A. Dalwani, A. Narang, M. Das, and P. Auradkar. 2015. A measure of similarity of time series containing missing data using the mahalanobis distance. In Proc. Int. Conf. on advances in computing and communication engineering. IEEE, 622–627.Google Scholar
D.J. Stekhoven and P. Buhlmann. 2011. MissForest: non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 1 (2011), 112–118.Google ScholarDigital Library
J.A.C Sterne, I.R. White, J.B. Carlin, M. Spratt, P. Royston, M.G. Kenward, A.M. Wood, and J.R. Carpenter. 2009. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338(2009), b2393.Google ScholarCross Ref
K.M. Ting, Y. Zhu, M. Carman, Y. Zhu, and Z.-H. Zhou. 2016. Overcoming Key Weaknesses of Distance-based Neighbourhood Methods Using a Data Dependent Dissimilarity Measure. In Proc. Int. Conf. on Knowledge Discovery and Data Mining. 1205–1214.Google ScholarDigital Library
O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R.B. Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 6 (2001), 520–525.Google ScholarCross Ref
A. Tversky. 1977. Features of similarity. Psychological review 84, 4 (1977), 327.Google Scholar
K. Wagstaff. 2004. Clustering with Missing Values: No Imputation Required. In Classification, Clustering, and Data Mining Applications. 649–658.Google Scholar
Shichao Zhang. 2021. Challenges in KNN classification. IEEE Transactions on Knowledge and Data Engineering 34, 10(2021), 4663–4675.Google ScholarDigital Library
Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. 2018. Efficient kNN Classification With Different Numbers of Nearest Neighbors. IEEE Transactions on Neural Networks and Learning Systems 29, 5(2018), 1774–1785.Google ScholarCross Ref
X. Zhu, C.C. Loy, and S. Gong. 2014. Constructing Robust Affinity Graphs for Spectral Clustering. In Proc. Int. Conf. on Computer Vision and Pattern Recognition, CVPR 2014. 1450–1457.Google Scholar
D.A. Zighed, R. Abdesselam, and A. Hadgu. 2012. Topological comparisons of proximity measures. In Proc. Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining. Springer, 379–391.Google Scholar

Index Terms

Computing Random Forest-distances in the presence of missing data
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining
      1. Data cleaning

Recommendations

Using Random Forest Distances for Outlier Detection
Image Analysis and Processing – ICIAP 2022
Abstract
In recent years, a great variety of outlier detectors have been proposed in the literature, many of which are based on pairwise distances or derived concepts. However, in such methods, most of the efforts have been devoted to the outlier detection ...
Read More
Automatic Delta-Adjustment Method Applied to Missing Not At Random Imputation
Computational Science – ICCS 2023
Abstract
Missing data can be described by the absence of values in a dataset, which can be a critical issue in domains such as healthcare. A common solution for this problem is imputation, where the missing values are replaced by estimations. Most ...
Read More
A reinforcement learning-based approach for imputing missing data
Abstract
Missing data is a major problem in real-world datasets, which hinders the performance of data analytics. Conventional data imputation schemes such as univariate single imputation replace missing values in each column with the same approximated ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Just Accepted
ISSN:1556-4681
EISSN:1556-472X
Table of Contents

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Online AM: 8 April 2024
- Accepted: 29 March 2024
- Revised: 25 March 2024
- Received: 28 November 2023
Published in tkdd Just Accepted

Check for updates
Author Tags
Random Forest distances
missing data
RatioRF measure
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 48
  Total Downloads
- Downloads (Last 12 months)48
- Downloads (Last 6 weeks)48
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Computing Random Forest-distances in the presence of missing data

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Using Random Forest Distances for Outlier Detection

Automatic Delta-Adjustment Method Applied to Missing Not At Random Imputation

A reinforcement learning-based approach for imputing missing data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Computing Random Forest-distances in the presence of missing data

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Using Random Forest Distances for Outlier Detection

Automatic Delta-Adjustment Method Applied to Missing Not At Random Imputation

A reinforcement learning-based approach for imputing missing data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media