ABSTRACT
Outlier detection, also named as anomaly detection, is one of the hot issues in the field of data mining. As well-known outlier detection algorithms, Isolation Forest(iForest) and Local Outlier Factor(LOF) have been widely used. However, iForest is only sensitive to global outliers, and is weak in dealing with local outliers. Although LOF performs well in local outlier detection, it has high time complexity. To overcome the weaknesses of iForest and LOF, a two-layer progressive ensemble method for outlier detection is proposed. It can accurately detect outliers in complex datasets with low time complexity. This method first utilizes iForest with low complexity to quickly scan the dataset, prunes the apparently normal data, and generates an outlier candidate set. In order to further improve the pruning accuracy, the outlier coefficient is introduced to design a pruning threshold setting method, which is based on outlier degree of data. Then LOF is applied to further distinguish the outlier candidate set and get more accurate outliers. The proposed ensemble method takes advantage of the two algorithms and concentrates valuable computing resources on the key stage. Finally, a large number of experiments are carried out to verify the ensemble method. The results show that compared with the existing methods, the ensemble method can significantly improve the outlier detection rate and greatly reduce the time complexity.
- Jorge Edmundo Alpuche Aviles, Maria Isabel Cordero Marcos, David Sasaki, Keith Sutherland, Bill Kane, and Esa Kuusela. 2018. Creation of knowledge-based planning models intended for large scale distribution: Minimizing the effect of outlier plans. Journal of applied clinical medical physics 19, 3 (2018), 215--226.Google ScholarCross Ref
- Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. In ACM sigmod record, Vol. 29. ACM, 93--104.Google ScholarDigital Library
- D Dua and E Karra Taniskidou. 2017. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California. School of Information and Computer Science (2017).Google Scholar
- Jakub Dvořák and Petr Savickỳ. 2007. Softening splits in decision trees using simulated annealing. In International Conference on Adaptive and Natural Computing Algorithms. Springer, 721--729.Google ScholarDigital Library
- Sarah Erfani, Mahsa Baktashmotlagh, Sutharshan Rajasegarar, Shanika Karunasekera, and Chris Leckie. 2015. R1SVM: A randomised nonlinear approach to large-scale anomaly detection. (2015).Google Scholar
- Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. 2017. Local search methods for k-means with outliers. Proceedings of the VLDB Endowment 10, 7 (2017), 757--768.Google ScholarDigital Library
- Riyaz Ahamed Ariyaluran Habeeb, Fariza Nasaruddin, Abdullah Gani, Ibrahim Abaker Targio Hashem, Ejaz Ahmed, and Muhammad Imran. 2018. Real-time big data processing for anomaly detection: a survey. International Journal of Information Management (2018).Google Scholar
- Raihan Ul Islam, Mohammad Shahadat Hossain, and Karl Andersson. 2018. A novel anomaly detection algorithm for sensor data under uncertainty. Soft Computing 22, 5 (2018), 1623--1639.Google ScholarDigital Library
- Liefa Liao and Bin Luo. 2018. Entropy Isolation Forest Based on Dimension Entropy for Anomaly Detection. In International Symposium on Intelligence Computation and Applications. Springer, 365--376.Google Scholar
- Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6, 1 (2012), 3.Google Scholar
- Zhaoli Liu, Tao Qin, Xiaohong Guan, Hezhi Jiang, and Chenxu Wang. 2018. An integrated method for anomaly detection from massive system logs. IEEE Access 6 (2018), 30602--30611.Google ScholarCross Ref
- Khaled Ali Othman, Md Nasir Sulaiman, Norwati Mustapha, and Nurfadhlina Mohd Sharef. 2017. Local Outlier Factor in Rough K-Means Clustering. PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY 25 (2017), 211--222.Google Scholar
- Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2018. Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2041--2050.Google ScholarDigital Library
- Guillaume Staerman, Pavlo Mozharovskyi, Stephan Clémençon, and Florence d'Alché Buc. 2019. Functional Isolation Forest. arXiv preprint arXiv:1904.04573 (2019).Google Scholar
- Jialing Tang and Henry YT Ngan. 2016. Traffic outlier detection by density-based bounded local outlier factors. Information Technology in Industry 4, 1 (2016), 6.Google Scholar
- Xian Teng, Muheng Yan, Ali Mert Ertugrul, and Yu-Ru Lin. 2018. Deep into Hypersphere: Robust and Unsupervised Anomaly Discovery in Dynamic Networks.. In IJCAI. 2724--2730.Google Scholar
- Bing Tu, Chengle Zhou, Wenlan Kuang, Longyuan Guo, and Xianfeng Ou. 2018. Hyperspectral imagery noisy label detection by spectral angle local outlier factor. IEEE Geoscience and Remote Sensing Letters 15, 9 (2018), 1417--1421.Google ScholarCross Ref
- Prabha Verma, Prashant Singh, and RDS Yadava. 2017. Fuzzy c-means clustering based outlier detection for SAW electronic nose. In 2017 2nd international conference for convergence in technology (I2CT). IEEE, 513--519.Google ScholarCross Ref
- Yizhou Yan, Lei Cao, and Elke A Rundensteiner. 2017. Scalable top-n local outlier detection. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1235--1244.Google ScholarDigital Library
Index Terms
- Outlier detection using isolation forest and local outlier factor
Recommendations
Sparse random projection isolation forest for outlier detection
Highlights- We analyzed the isolation-forest-based methods’ problem of lacking efficacy in selecting suitable hyperplanes to split data.
Graphical abstractDisplay Omitted
AbstractIsolation Forest has a low computational complexity, hence has been widely applied to detect outliers in large-scale data. However, it suffers from the artifacts caused by the hyperplanes chosen, thereby failing to detect outliers in ...
A Novel Noise Clustering Based on Local Outlier Factor
Integrated Uncertainty in Knowledge Modelling and Decision MakingAbstractReducing the impact of outliers is an essential issue in machine learning, including clustering. There are two main approaches to reducing the impact of outliers: one is to build robust models, and the other is to remove outliers through ...
Improving Detection Efficiency: Optimizing Block Size in the Local Outlier Factor (LOF) Algorithm
Rough SetsAbstractDetecting outliers in data is essential in various fields, such as finance, healthcare, and many other domains with anomalies. Among well-known outlier detection algorithms, Local Outlier Factor (LOF) is widely used for identifying unusual data ...
Comments