Abstract
Anomaly detection has numerous applications in diverse fields. For example, it has been widely used for discovering network intrusions and malicious events. It has also been used in numerous other applications such as identifying medical malpractice or credit fraud. Detection of anomalies in quantitative data has received a considerable attention in the literature and has a venerable history. By contrast, and despite the widespread availability use of categorical data in practice, anomaly detection in categorical data has received relatively little attention as compared to quantitative data. This is because detection of anomalies in categorical data is a challenging problem. Some anomaly detection techniques depend on identifying a representative pattern then measuring distances between objects and this pattern. Objects that are far from this pattern are declared as anomalies. However, identifying patterns and measuring distances are not easy in categorical data compared with quantitative data. Fortunately, several papers focussing on the detection of anomalies in categorical data have been published in the recent literature. In this article, we provide a comprehensive review of the research on the anomaly detection problem in categorical data. Previous review articles focus on either the statistics literature or the machine learning and computer science literature. This review article combines both literatures. We review 36 methods for the detection of anomalies in categorical data in both literatures and classify them into 12 different categories based on the conceptual definition of anomalies they use. For each approach, we survey anomaly detection methods, and then show the similarities and differences among them. We emphasize two important issues, the number of parameters each method requires and its time complexity. The first issue is critical, because the performance of these methods are sensitive to the choice of these parameters. The time complexity is also very important in real applications especially in big data applications. We report the time complexity if it is reported by the authors of the methods. If it is not, then we derive it ourselves and report it in this article. In addition, we discuss the common problems and the future directions of the anomaly detection in categorical data.
- Abror Abduvaliyev, Al-Sakib Khan Pathan, Jianying Zhou, Rodrigo Roman, and Wai-Choong Wong. 2013. On the vital areas of intrusion detection systems in wireless sensor networks. IEEE Commun. Surveys Tutor. 15, 3 (2013), 1223--1237.Google ScholarCross Ref
- Hala Abukhalaf, Jianxin Wang, and Shigeng Zhang. 2015. Outlier detection techniques for localization in wireless sensor networks: A survey. Int. J. Future Gen. Commun. Netw. 8, 6 (2015), 99--114.Google Scholar
- Charu C. Aggarwal. 2017. Outlier Analysis, 2nd ed. Springer, Cham. Google ScholarDigital Library
- Charu C. Aggarwal and Philip S. Yu. 2001. Outlier detection for high dimensional data. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’01). 37--46. Google ScholarDigital Library
- Charu C. Aggarwal, Yuchen Zhao, and Philip S. Yu. 2011. Outlier detection in graph streams. In Proceedings of the ACM IEEE International Conference on Data Engineering (ICDE’11). 399--409. Google ScholarDigital Library
- Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of International Conference on Very Large Data Bases (VLDB’94). 487--499. Google ScholarDigital Library
- A. Agresti. 2010. Analysis of Ordinal Categorical Data (2nd ed.). John Wiley 8 Sons, New York, NY.Google Scholar
- A. Agresti. 2013. Categorical Data Analysis (3rd ed.). John Wiley 8 Sons, New York, NY.Google Scholar
- Malik Agyemang, Ken Barker, and Rada Alhajj. 2006. A comprehensive survey of numeric and symbolic outlier mining techniques. Intell. Data Anal. 10(6) (2006), 521--538. Google ScholarDigital Library
- Mohiuddin Ahmed, Abdun Naser Mahmood, and Jiankun Hu. 2016. A survey of network anomaly detection techniques. Netw. Comput. Appl. 60 (2016), 19--31. Google ScholarDigital Library
- Mohiuddin Ahmed, Abdun Naser Mahmood, and Md. Rafiqul Islam. 2016. A survey of anomaly detection techniques in financial domain. Future Gen. Comput. Syst. 55 (2016), 278--288. Google ScholarDigital Library
- P. Ajitha and E. Chandra. 2015. A survey on outliers detection in distributed data mining for big data. J. Basic Appl. Sci. Res. 5, 2 (2015), 31--38.Google Scholar
- Leman Akoglu, Mary Mcglohon, and Christos Faloutsos. 2010. OddBall: Spotting anomalies in weighted graphs. In Proceedings of the Pacific Asia Knowledge Discovery and Data Mining (PAKDD’10). 420--431. Google ScholarDigital Library
- Leman Akoglu, Hanghang Tong, and Danai Koutra. 2015. Graph-based anomaly detection and description: A survey. Data Min. Knowl. Discov. 29, 3 (2015), 626--688. Google ScholarDigital Library
- Leman Akoglu, Hanghang Tong, Jilles Vreeken, and Christos Faloutsos. 2012. Fast and reliable anomaly detection in categorical data. In Proceedings of the ACM International Conference on Information and Knowledge Management, (CIKM’12). 415--424. Google ScholarDigital Library
- Fabrizio Angiulli, Stefano Basta, and Clara Pizzuti. 2006. Distance-based detection and prediction of outliers. IEEE Trans. Knowl. Data Eng. 18(2) (2006), 145--160. Google ScholarDigital Library
- Fabrizio Angiulli and Fabio Fassetti. 2002. Fast outlier detection in high dimensional spaces. In Proceedings of the European Conference on the Principles of Data Mining and Knowledge Discovery. 19--26. Google ScholarDigital Library
- Yagnik N. Ankur and Ajay Shanker Singh. 2014. Oulier analysis using frequent pattern mining: A review. Int. J. Comput. Sci. Info. Technol. 5, 1 (2014), 47--50.Google Scholar
- N. Archana and S. S. Pawar. 2014. Survey on outlier pattern detection techniques for time-series data. Int. J. Sc. Res. 1, 1 (2014), 1852--1856.Google Scholar
- Tony Bailetti, Mahmoud Gad, and Ahmed Shah. 2016. Intrusion learning: An overview of an emergent discipline. Technol. Innovat. Manage. Rev. 6, 2 (2016), 15--20.Google ScholarCross Ref
- U. A. B. U. A. Bakar, Hemant Ghayvat, S. F. Hasanm, and S. C. Mukhopadhyay. 2016. Activity and anomaly detection in smart home: A survey. In Next Generation Sensors and Systems, Subhas Chandra Mukhopadhyay (Ed.). Springer, New York, NY, Chapter 9, 191--220.Google Scholar
- Zuriana Abu Bakar, Rosmayati Mohemad, Akbar Ahmad, and Mustafa Mat Deris. 2006. A comparative study for outlier detection techniques in data mining. In Proceedings of IEEE International Conference on Cybernetics and Intelligent Systems. 1--6.Google ScholarCross Ref
- V. Barnett and T. Lewis. 1994. Outliers in Statistical Data (3rd ed.). John Wiley 8 Sons, New York, NY.Google Scholar
- S. Bay and M. Schwabacher. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, SIGKDD. 29--38. Google ScholarDigital Library
- Eric J. Beh. 2008. Simple correspondence analysis of nominal-ordinal contingency tables.J. Appl. Math. Decis. Sci. 228 (2008), 1--17.Google ScholarCross Ref
- Alka P. Beldar and Vinod S. Wadne. 2015. The detail survey of anomaly/outlier detection methods in data mining. Int. J. Multidisc. Curr. Res. 3 (2015), 462--472.Google Scholar
- Clauber Gomes Bezerra, Bruno Sielly Jales Costa, Luiz Affonso Guedes, and Plamen Parvanov Angelov. 2015. A comparative study of autonomous learning outlier detection methods applied to fault detection. In Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE’15). 1--7.Google ScholarDigital Library
- Kanishka Bhaduri, Bryan L. Matthews, and Chris R. Giannella. 2011. Algorithms for speeding up distance-based outlier detection. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, (SIGKDD’11). 895--867. Google ScholarDigital Library
- Umale Bhagyashree and M. Nilav. 2014. Overview of k-means and expectation maximization algorithm for document clustering. In Proceedings of the International Conference on Quality Up-gradation in Engineering, Science and Technology (ICQUEST’14). 5--8.Google Scholar
- N. Billor, Ali S. Hadi, and P. Velleman. 2000. Blocked adaptive computationally-efficient outlier nominators. Comput. Stat. Data Anal. 34 (2000), 279--298. Google ScholarDigital Library
- Christian Böhm, Katrin Haegler, Nikola S Müller, and Claudia Plant. 2009. CoCo: Coding cost for parameter-free outlier detection. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, (SIGKDD’09). 149--158. Google ScholarDigital Library
- Shyam Boriah, Varun Chandola, and Vipin Kumar. 2008. Similarity measures for categorical data: A comparative evaluation. In Proceedings of the International SIAM Data Mining Conference (SDM’08). 243--254.Google ScholarCross Ref
- Mohamed Bouguessa. 2014. A mixture model-based combination approach for outlier detection. Int. J. Artific. Intell. Tools 23, 4 (2014), 1--21.Google Scholar
- Mohamed Bouguessa. 2015. A practical outlier detection approach for mixed-attribute data. Expert Syst. Appl. 42 (2015), 8637--8649. Google ScholarDigital Library
- M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. 2000. LOF: Identifying density--based local outliers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’00). 93--104. Google ScholarDigital Library
- Guilherme O Campos, Arthur Zimek, Jörg Sander, Ricardo JGB Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E Houle. 2016. On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30, 4 (2016), 891--927. Google ScholarDigital Library
- E. Castillo, J. M. Gutiérrez, and A. S. Hadi. 1997. Expert Systems and Probabilistic Network Models. Springer-Verlag, New York, NY. Google ScholarDigital Library
- V. Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM Comput. Surveys 41(3) (2009), 1--58. Google ScholarDigital Library
- V. Chandola, A. Banerjee, and V. Kumar. 2012. Anomaly detection for discrete sequences: A survey. Trans. Knowl. Data Eng. 24(5) (2012), 823--839. Google ScholarDigital Library
- V. Chandola, S. Boriah, and V. Kumar. 2008. Understanding Categorical Similarity Measures for Outlier Detection. Technical Report. University of Minnesota, Department of Computer Science and Engineering, 1-46.Google Scholar
- V. Chandola, S. Boriah, and V. Kumar. 2009. A framework for exploring categorical data. In Proceedings of the International SIAM Data Mining Conference (SDM’09). 187--198.Google Scholar
- S. Chatterjee and Ali S. Hadi. 1986. Influential observations, high leverage points, and outliers in regression. Stat. Sci. 1 (1986), 379--416.Google ScholarCross Ref
- S. Chatterjee and Ali S. Hadi. 1988. Sensitivity Analysis in Linear Regression. John Wiley 8 Sons, New York, NY. Google ScholarDigital Library
- Sanjay Chawla and Pei Sun. 2006. SLOM: A new measure for local spatial outliers. Knowl. Info. Syst. 9 (2006), 412--429.Google ScholarDigital Library
- Haibin Cheng, Pang-Ning Tan, Christopher Potter, and Steven A. Klooster. 2009. Detection and characterization of anomalies in multivariate time series. In Proceedings of the SIAM International Conference on Data Mining (SDM’09). 413--424.Google Scholar
- HyungJun Cho and Soo-Heang Eo. 2016. Outlier detection for mass spectrometric data. In Statistical Analysis in Proteomics, Klaus Jung (Ed.). Springer, New York, NY, Chapter 5, 91--102.Google Scholar
- Gregory F. Cooper. 1990. The computational complexity of probabilistic inference using Bayesian belief networks. Artific. Intell. 42 (1990), 393--405. Google ScholarDigital Library
- Denis Cousineau and Sylvain Chartier. 2015. Outliers detection and treatment: A review. Int. J. Psychol. Res. 3, 1 (2015), 58--67.Google ScholarCross Ref
- J. Vijay Daniel, S. Joshna, and P. Manjula. 2013. A survey of various intrusion detection techniques in wireless sensor networks. Int. J. Comput. Sci. Mobile Comput. 2, 9 (2013), 235--246.Google Scholar
- K. Das and J. Schneider. 2007. Detecting anomalous records in categorical datasets. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’07). 220--229. Google ScholarDigital Library
- K. Das, J. Schneider, and D. B. Neill. 2008. Anomaly pattern detection in categorical datasets. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). 169--176. Google ScholarDigital Library
- Dhwani Dave and Tanvi Varma. 2014. A review of various statistical methods for outlier detection. Int. J. Comput. Sci. Eng. Technol. 5, 2 (2014), 137--140.Google Scholar
- Herv Debar, Marc Dacier, and Andreas Wespi. 1999. Towards a taxonomy of intrusion-detection systems. Comput. Netw. 31, 9 (1999), 805--822. Google ScholarDigital Library
- Alfonso Iodice D’Enza and Michael Greenacre. 2012. Multiple correspondence analysis for the quantification and visualization of large categorical data sets. In Advanced Statistical Methods for the Analysis of Large Data-Sets, Agostino Di Ciaccio, Mauro Coli, and Jose Miguel Angulo Ibañez (Eds.). Springer, 453--463.Google Scholar
- Mr. Mukesh K. Deshmukh and A. S. Kapse. 2016. A survey on outlier detection technique in streaming data using data clustering approach. Int. J. Engineering and Computer Science 5, 1 (2016), 15453--15456.Google Scholar
- Christian Desrosiers and George Karypis. 2011. A comprehensive survey of neighborhood-based recommendation methods. In Recommender Systems Handbook. Springer-Verlag New York, NY, 107--144.Google Scholar
- R. Lakshmi Devi and R. Amalraj. 2015. Hubness in unsupervised outlier detection techniques for high dimensional data--A survey. Int. J. Comput. Appl. Technol. Res. 4, 11 (2015), 797--801.Google Scholar
- Jiten Harishbhai Dhimmar and Raksha Chauhan. 2014. A survey on profile-injection attacks in recommender systems using outlier analysis. Int. J. Adv. Res. Comput. Sci. Manage. Studies 2, 12 (2014), 356--359.Google Scholar
- Xuemei Ding, Yuhua Li, Ammar Belatreche, and Liam P. Maguire. 2014. An experimental evaluation of novelty detection methods. Neurocomputing 135 (2014), 313--327. Google ScholarDigital Library
- K. T. Divya and N. S. Kumaran. 2016. Survey on outlier detection techniques using categorical data. Int. Res. J. Eng. Technol. 3 (2016), 899--904.Google Scholar
- Paul Dokas, Levent Ertoz, Vipin Kumar, Aleksandar Lazarevic, Jaideep Srivastava, and Pang-Ning Tan. 2002. Data mining for network intrusion detection. In Proceedings of the NSF Workshop on Next Generation Data Mining. 21--30.Google Scholar
- Jin Du, Qinghua Zheng, Haifei Li, and Wenbin Yuan. 2005. The research of mining association rules between personality and behavior of learner under web-based learning environment. In Proceedings of the the International Conference on Advances in Web-Based Learning (ICWL’05). 15--26. Google ScholarDigital Library
- David Ebdon. 1991. Statistics in Geography: A Practical Approach-Revised with 17 Programs. Wiley-Blackwell, Hoboken, NJ.Google Scholar
- Syed Masum Emran and Nong Ye. 2001. Robustness of Canberra metric in computer intrusion detection. In Proceedings of the IEEE Workshop on Information Assurance and Security. New York, NY, 80--84.Google Scholar
- Hadi Fanaee-T and João Gama. 2016. Tensor-based anomaly detection: An interdisciplinary survey. Knowl-Based Syst. 98 (2016), 130--147. Google ScholarDigital Library
- Elaine R. Faria, Isabel J. C. R. Goncalves, A. C. P. L. F. de Carvalho, and J. Gama. 2015. Novelty detection in data streams. Artific. Intell. Rev. 45, 2 (2015), 235--269. Google ScholarDigital Library
- E. W. Forgy. 1965. Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics 21 (1965), 768--780.Google Scholar
- A. Frank and A. Asuncion. 2018. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml/datasets.html.Google Scholar
- Jing Gao, Feng Liang, Wei Fan, Chi Wang, Yizhou Sun, and Jiawei Han. 2010. On community outliers and their efficient detection in information networks. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’10). 813--822. Google ScholarDigital Library
- Pedro Garcia-Teodoro, J. Diaz-Verdejo, Gabriel Maciá-Fernández, and Enrique Vázquez. 2009. Anomaly-based network intrusion detection: Techniques, systems and challenges. Comput. Security 28, 1 (2009), 18--28. Google ScholarDigital Library
- Yong Ge, Hui Xiong, Zhi-Hua Zhou, Hasan Ozdemir, Jannite Yu, and K. C. Lee. 2010. TOP-EYE: Top-k evolving trajectory outlier detection. In Proceedings of the ACM Conference on Information and Knowledge Management, (CIKM’10). 1--4. Google ScholarDigital Library
- Dhiren Ghosh and Andrew Vogt. 2012. Outliers: An evaluation of methodologies. In Proceedings of the Joint Statistical Meetings. American Statistical Association, 3455--3460.Google Scholar
- A. Ghoting, M. E. Otey, and S. Parthasarathy. 2004. Loaded: Link-based outlier and anomaly detection in evolving data sets. In Proceedings of the IEEE International Conference on Data Mining (ICDM’04). 387--390. Google ScholarDigital Library
- Amol Ghoting, Srinivasan Parthasarathy, and Matthew Eric Otey. 2008. Fast mining of distance-based outliers in high dimensional datasets. Data Min. Knowl. Discov. J. 16(3) (2008), 349--364. Google ScholarDigital Library
- Prasanta Gogoi, D. K. Bhattacharyya, Bhogeswar Borah, and Jugal K. Kalita. 2011. A survey of outlier detection methods in network anomaly identification. Comput. J. 54, 4 (2011), 570--588. Google ScholarDigital Library
- Gene H. Golub and Charles F. van Loan. 2012. Matrix Computations, 3rd ed. John Hopkins University Press. Google ScholarDigital Library
- Geoffrey Grimmett and David Stirzaker. 2001. Probability and Random Processes, 3rd ed. Oxford University Press, Oxford, UK.Google Scholar
- V. Gunamani and M. Abarna. 2013. A survey on intrusion detection using outlier detection techniques. Int. J. Sci. Eng. Technol. Res. 2, 11 (2013), 2063 --2068.Google Scholar
- Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han. 2014. Outlier detection for temporal data. Synth. Lect. Data Min. Knowl. Discov. 5, 1 (2014), 1--129.Google ScholarDigital Library
- Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han. 2014. Outlier detection for temporal data: A survey. IEEE Trans. Knowl. Data Eng. 26, 9 (2014), 2250--2267.Google ScholarCross Ref
- Ali S. Hadi. 1992. Identifying multiple outliers in multivariate data. J. Roy. Stat. Soc., Ser. B 54 (1992), 761--771.Google Scholar
- Ali S. Hadi. 1992. A new measure of overall potential influence in linear regression. Comput. Stat. Data Anal. 14 (1992), 1--27. Google ScholarDigital Library
- Ali S. Hadi. 1994. A modification of a method for the detection of outliers in multivariate samples. J. Roy. Stat. Soc., Ser. B 56 (1994), 393--396.Google Scholar
- Ali S. Hadi, A. H. M. Rahmatullah Imon, and Mark Werner. 2009. Detection of outliers. Wiley Interdisc. Rev.: Comput. Stat. 1 (2009), 57--70.Google ScholarDigital Library
- Ali S. Hadi and J. S. Simonoff. 1993. Procedure for the identification of outliers in linear models. J. Amer. Stat. Assoc. 88 (1993), 1264--1272.Google ScholarCross Ref
- Xiaojuan Han, Yong Yan, Cheng Cheng, Yueyan Chen, and Yanglin Zhu. 2014. Monitoring of oxygen content in the flue gas at a coal-fired power plant using cloud modeling techniques. IEEE Trans. Instrument. Measure. 63, 4 (2014), 953--963.Google ScholarCross Ref
- Z. He, X. Xu, and S. Deng. 2005. An optimization model for outlier detection in categorical data. In Proceedings of the International Conference on Advances in Intelligent Computing. 400--409. Google ScholarDigital Library
- Z. He, X. Xu, and S. Deng. 2006. A fast greedy algorithm for outlier mining. In Proceedings of the Pacific Asia Knowledge Discovery and Data Mining (PAKDD’06). Singapore, 567--576. Google ScholarDigital Library
- Z. He, X. Xu, J. Z. Huang, and S. Deng. 2005. FP-outlier: Frequent pattern based outlier detection. Comput. Sci. Info. Syst. 2 (2005), 726--732.Google Scholar
- S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. 2011. Statistical outlier detection using direct density ratio estimation. Knowl. Info. Syst. 26, 2 (2011), 309--336.Google ScholarDigital Library
- V. J Hodge and J. Austin. 2004. A survey of outlier detection methodologies. Artific. Intell. Rev. 22 (2004), 85--126. Google ScholarDigital Library
- Zhexue Huang. 1997. A fast clustering algorithm to cluster very large categorical data sets in data mining. In Proceedings of the International Data Mining and Knowledge Discovery (DMKM’97), Workshop at the ACM International Conference on Mangagement of Data (SIGKDD). 1--8.Google Scholar
- Z. Huang and M. K. Ng. 1999. A fuzzy k-modes algorithm for clustering categoircal data. IEEE Trans. Fuzzy Syst. 7 (1999), 446--452. Google ScholarDigital Library
- Dino Ienco, Ruggero G. Pensa, and Rosa Meo. 2012. From context to distance: Learning dissimilarity for categorical data clustering. ACM Trans. Knowl. Discov. Data 6, 1 (2012), 1--12. Google ScholarDigital Library
- Dino Ienco, Ruggero G. Pensa, and Rosa Meo. 2017. A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Trans. Neural Netw. Learn. 28, 5 (2017), 1017--1029.Google ScholarCross Ref
- Francesca Ieva and Anna Maria Paganoni. 2015. Detecting and visualizing outliers in provider profiling via funnel plots and mixed effect models. Health Care Manage. Sci. 18, 2 (2015), 166--172.Google Scholar
- ShengYi Jiang, Xiaoyu Song, Hui Wang, Jian-Jun Han, and Qing-Hua Li. 2006. A clustering-based method for unsupervised intrusion detections. Pattern Recogn. Lett. 27 (2006), 802--810. Google ScholarDigital Library
- Vineet Joshi and Raj Bhatnagar. 2014. CBOF: Cohesiveness-based outlier factor a novel definition of outlier-ness. In Proceedings of the International Workshop on Machine Learning and Data Mining in Pattern Recognition (MLDM’14). 175--189.Google ScholarCross Ref
- Hossein Joudaki, Arash Rashidian, Behrouz Minaei-Bidgoli, Mahmood Mahmoodi, Bijan Geraili, Mahdi Nasiri, and Mohammad Arab. 2015. Using data mining to detect health care fraud and abuse: A review of literature. Global J. Health Sci. 7, 1 (2015), 194--202.Google Scholar
- Leonid Kalinichenko, Ivan Shanin, and Ilia Taraban. 2014. Methods for anomaly detection: A survey. In Proceedings of the All-Russian Conference Digital Libraries: Advanced Methods and Technologies, Digital Collections (RCDL’14). 20--25.Google Scholar
- V. Kathiresan and N. A. Vasanthi. 2015. A survey on outlier detection techniques useful for financial card fraud detection. Int. J. Innovat. Eng. Technol. 6, 1 (2015), 226--235.Google Scholar
- Ravneet Kaur and Sarbjeet Singh. 2015. A survey of data mining and social network analysis based anomaly detection techniques. Egypt. Info. J. 39 (2015), 1--18.Google Scholar
- E. M. Knorr, R. T. Ng, and V. Tucakov. 2000. Distance-based outliers: Algorithms and applications. VLDB J. 8 (2000), 237--253. Google ScholarDigital Library
- Edwin M. Knorr and Raymond T. Ng. 1997. A unified approach for mining outliers. In Proceedings of the International Conference of the Centre for Advanced Studies on Collaborative Research (CASCON’97). 236--248. Google ScholarDigital Library
- A. Koufakou, M. Georgiopoulos, and G. Anagnostopoulos. 2008. Detecting outliers in high-dimensional datasets with mixed attributes. In Proceedings of the International Conference on Data Mining (DMIN’08).Google Scholar
- A. Koufakou, E. Ortiz, M. Georgiopoulos, G. Anagnostopoulos, and K. Reynolds. 2007. A scalable and efficient outlier detection strategy for categorical data. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence (ICTAI’07). 210--217. Google ScholarDigital Library
- Anna Koufakou, Jimmy Secretan, and Michael Georgiopoulos. 2011. Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data. Knowl. Info. Syst. 29, 3 (2011), 697--725. Google ScholarDigital Library
- Aleksandar Lazarevic, Levent Ertöz, Vipin Kumar, Aysel Ozgur, and Jaideep Srivastava. 2003. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of the SIAM International Conference on Data Mining (SDM’03). 25--36.Google ScholarCross Ref
- Dajiang Lei, Liping Zhang, and Lisheng Zhang. 2013. Cloud model-based outlier detect algorithm for categorical data. Int. J. Database Theory Appl. 6, 14 (2013), 199--213.Google Scholar
- Deyi Li. 2000. Uncertainty in knowledge representation. Chinese Eng. Sci. 2, 10 (2000), 73--79.Google Scholar
- Jingchao Li and Jian Guo. 2015. A new feature extraction algorithm based on entropy cloud characteristics of communication signals. Math. Problems Eng. 2015 (2015), 1--8.Google Scholar
- Junli Li, Jifu Zhang, Ning Pang, and Xiao Qin. 2018. Weighted outlier detection of high-dimensional categorical data using feature grouping. IEEE Trans. Syst. Man Cybernet.: Syst. (2018), 1--14.Google Scholar
- Shuxin Li, Robert Lee, and Sheau-Dong Lang. 2007. Mining distance-based outliers from categorical data. In Proceedings of the IEEE International Conference on Data Mining Workshops (ICDM’07). 225--230. Google ScholarDigital Library
- J. Y. Liang, K. S. Chin, and C. Y. Dang. 2002. A new method for measuring uncertainty and fuzziness in rough set theory. Int. J. Gen. Syst. 31 (2002), 331--342.Google ScholarCross Ref
- Song Lin and Donald E. Brown. 2006. An outlier-based data association method for linking criminal incidents. Decis. Support Syst. 41 (2006), 604--615. Google ScholarDigital Library
- Wei Liu, Yu Zheng, Sanjay Chawla, Jing Yuan, and Xing Xie. 2011. Discovering spatio-temporal causal interactions in traffic data streams. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’11). 1010--1018. Google ScholarDigital Library
- Xutong Liu, Feng Chen, and Chang-Tien Lu. 2014. On detecting spatial categorical outliers. GeoInformatica 18, 3 (2014), 501--536. Google ScholarDigital Library
- Arunanshu Mahapatro and Pabitra Mohan Khilar. 2013. Fault diagnosis in wireless sensor networks: A survey. IEEE Commun. Surveys Tutor. 15, 4 (2013), 2000--2026.Google ScholarCross Ref
- Kamal Malik, H. Sadawarti, and G. S. Kalra. 2014. Comparative analysis of outlier detection techniques. Int. J. Comput. Appl. 97, 8 (2014), 12--21.Google Scholar
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. Google Scholar
- José Marinho, Jorge Granjal, and Edmundo Monteiro. 2015. A survey on security attacks and countermeasures with primary user detection in cognitive radio networks. EURASIP J. Info. Secur. 2015, 1 (2015), 1--14.Google ScholarCross Ref
- Markos Markou and Sameer Singh. 2003. Novelty detection: A review-part 1: Statistical approaches. Signal Process. 83 (2003), 2481--2497. Google ScholarDigital Library
- Markos Markou and Sameer Singh. 2003. Novelty detection: A review-part 2: Neural network based approaches. Signal Process. 83 (2003), 2499--2521. Google ScholarDigital Library
- Manoj Mishra and Nitesh Gupta. 2015. To detect outlier for categorical data streaming. Int. J. Sci. Eng. Res. 6, 5 (2015), 1--5.Google Scholar
- Andrew Moore, Mary Soon Lee, and Brigham Anderson. 1998. Cached sufficient statistics for efficient machine learning with large datasets. J. Artific. Intell. Res. 8 (1998), 67--91. Google ScholarDigital Library
- Andrew Moore and W. K. Wong. 2003. Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning. In Proceedings of the 20th International Conference on Machine Learning. 552--559. Google ScholarDigital Library
- Kazuyo Narita and Hiroyuki Kitagawa. 2008. Detecting outliers in categorical record databases based on attribute associations. In Progress in WWW Research and Development. Springer, Berlin, 111--123. Google ScholarDigital Library
- K. Noto, C. Brodley, and D. Slonim. 2010. Anomaly detection using an ensemble of feature models. In Proceedings of the IEEE International Conference on Data Mining (ICDM’10). 953--958. Google ScholarDigital Library
- K. Noto, C. Brodley, and D. Slonim. 2012. FRaC: A feature-modeling approach for semi-supervised and unsupervised anomaly detection. Data Min. Knowl. Discov. 25, 1 (2012), 109--133. Google ScholarDigital Library
- Colin O’Reilly, Alexander Gluhak, Muhammad Ali Imran, and Sutharshan Rajasegarar. 2014. Anomaly detection in wireless sensor networks in a non-stationary environment. IEEE Commun. Surveys Tutor. 16, 3 (2014), 1413--1432.Google ScholarCross Ref
- M. E. Otey, A. Ghoting, and S. Parthasarathy. 2006. Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Discov. 12, 2--3 (May 2006), 203--228. Google ScholarDigital Library
- Matthew Eric Otey, Srinivasan Parthasarathy, and Amol Ghoting. 2005. An empirical comparison of outlier detection algorithms. In Proceedings of the International Workshop on Data Mining Methods for Anomaly Detection at ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’05). 1--8.Google Scholar
- Guansong Pang, Longbing Cao, and Ling Chen. 2016. Outlier detection in complex categorical data by modeling the feature value couplings. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. 1902--1908. Google ScholarDigital Library
- Animesh Patcha and Jung-Min Park. 2007. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Comput. Netw. 51(12) (2007), 3448--3470. Google ScholarDigital Library
- M. S. Pawar, D. Amruta, and S. N. Tambe. 2014. A survey on outlier detection techniques for credit card fraud detection. IOSR J. Comput. Eng. 16, 2 (2014), 44--48.Google ScholarCross Ref
- Zdzisław Pawlak. 1982. Rough sets. Int. J. Comput. Info. Sci. 11, 5 (1982), 341--356.Google ScholarCross Ref
- C. Phua, D. Alahakoon, and V. Lee. 2004. Minority report in fraud detection: Classification of skewed data. ACM SIGKDD Explor. Newslett. 6, 1 (2004), 50--59. Google ScholarDigital Library
- Clifton Phua, Vincent C. S. Lee, Kate Smith-Miles, and Ross W. Gayler. 2010. A comprehensive survey of data mining-based fraud detection research. Retrieved from http://arxiv.org/abs/1009.6119.Google Scholar
- Marco AF Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. 2014. A review of novelty detection. Signal Process. 99 (2014), 215--249. Google ScholarDigital Library
- Srijoni Saha Pradip, Jesica Fernandes Robert, and Jasmine Faujdar Hamza. 2015. Information-theoretic outlier detection for large-scale categorical data. Int. J. Comput. Sci. Mobile Comput. 4, 4 (2015), 873--881.Google Scholar
- Raghav M. Purankar and Pragati Patil. 2015. A survey paper on an effective analytical approaches for detecting outlier in continuous time variant data stream. Int. J. Eng. Comput. Sci. 4, 11 (2015), 14946--14949.Google Scholar
- Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’00). 427--438. Google ScholarDigital Library
- Stephen Ranshous, Shitian Shen, Danai Koutra, Steve Harenberg, Christos Faloutsos, and Nagiza F. Samatova. 2015. Anomaly detection in dynamic networks: A survey. Wiley Interdisc. Rev.: Comput. Stat. 7, 3 (2015), 223--247. Google ScholarDigital Library
- Lida Rashidi, Sattar Hashemi, and Ali Hamzeh. 2011. Anomaly detection in categorical datasets using Bayesian networks. In Proceedings of the 3rd International Conference on Artificial Intelligence and Computational Intelligence, Part II (AICI’11). 610--619. Google ScholarDigital Library
- Murad A. Rassam, M. A. Maarof, and Anazida Zainal. 2012. A survey of intrusion detection schemes in wireless sensor networks. Amer. J. Appl. Sci. 9, 10 (2012), 1636--1652.Google ScholarCross Ref
- Murad A. Rassam, Anazida Zainal, and Mohd Aizaini Maarof. 2013. Advancements of data anomaly detection research in wireless sensor networks: A survey and open issues. Sensors 13, 8 (2013), 10087--10122.Google ScholarCross Ref
- D. Lakshmi Sreenivasa Reddy, B. Raveendra Babu, and A. Govardhan. 2013. Outlier analysis of categorical data using navf. Informat. Econom. 17, 1 (2013), 1--5.Google Scholar
- Abdolazim Rezaei, Zarinah M. Kasirun, Vala Ali Rohani, and Touraj Khodadadi. 2013. Anomaly detection in online social networks using structure-based technique. In Proceedings of the International Conference for Internet Technology and Secured Transactions (ICITST’13). 619--622.Google Scholar
- Ritika, Tarun Kumar, and Amandeep Kaur. 2013. Outlier detection in WSN: A survey. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 7 (2013), 609--617.Google Scholar
- N. Rokhman, Subanar, and E. Winarko. 2016. Improving the performance of outlier detection methods for Categorical data by using weighting function. J. Theor. Appl.d Info.n Technol. 83 (2016), 327--336.Google Scholar
- Peter J. Rousseeuw and Katrien Van Driessen. 1998. A fast algorithm for the minimum covariance determinant estimator. Technometrics 41 (1998), 212--223. Google ScholarDigital Library
- Ashwini G. Sagade and Ritesh Thakur. 2014. Excess entropy based outlier detection in categorical data set. Int. J. Adv. Comput. Eng. Netw. 2, 8 (2014), 56--61.Google Scholar
- Aiman Moyaid Said, Dhanapal Durai Dominic, and Brahim Belhaouari Samir. 2013. Outlier detection scoring measurements based on frequent pattern technique. Res. J. Appl. Sci. Eng. Technol. 6, 8 (2013), 1340--134.Google ScholarCross Ref
- Arif Sari. 2015. A review of anomaly detection systems in cloud networks and survey of cloud security measures in cloud storage applications. J. Info. Secur. 6, 2 (2015), 142--154.Google ScholarCross Ref
- Debajit Sen Sarma and Samar Sen Sarma. 2015. A survey on different graph based anomaly detection techniques. Indian J. Sci. Technol. 8, 31 (2015), 1--7.Google ScholarCross Ref
- David Savage, Xiuzhen Zhang, Xinghuo Yu, Pauline Chou, and Qingmai Wang. 2014. Anomaly detection in online social networks. Soc. Netw. 39 (2014), 62--70.Google ScholarCross Ref
- Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Alex J. Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural Comput. 13, 7 (2001), 1443--1471. Google ScholarDigital Library
- Junhee Seok and Yeong Seon Kang. 2015. Mutual information between discrete variables with many categories using recursive adaptive partitioning. Sci. Rep. 5 (2015), 1--10.Google Scholar
- Nauman Shahid, Ijaz Haider Naqvi, and Saad Bin Qaisar. 2015. Characteristics and classification of outlier detection techniques for wireless sensor networks in harsh environments: A survey. Artific. Intell. Rev. 43, 2 (2015), 193--228. Google ScholarDigital Library
- Claude Elwood Shannon. 1948. A mathematical theory of communication. Bell Tele. Syst. Techn. Publ. 27, 3 (1948), 379--423.Google ScholarCross Ref
- Deep Shikha Shukla, Avinash Chandra Pandey, and Ankur Kulhari. 2014. Outlier detection: A survey on techniques of WSNs involving event and error based outliers. In Proceedings of the International Conference of Innovative Applications of Computational Intelligence on Power, Energy and Controls with their Impact on Humanity (CIPECH’14). 113--116.Google ScholarCross Ref
- M. Shyu, K. Sarinnapakorn, I. Kuruppu-Appuhamilage, S. Chen, L. W. Chang, and T. Goldring. 2005. Handling nominal features in anomaly intrusion detection problems. In Proceedings of the International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications. 55--62. Google ScholarDigital Library
- Karanjit Singh and Shuchita Upadhyaya. 2012. Outlier detection: Applications and techniques. Int. J. Comput. Sci. Iss. 9, 1 (2012), 307--323.Google Scholar
- Koen Smets and Jilles Vreeken. 2011. The odd one out: Identifying and characterising anomalies. In Proceedings of the SIAM International Conference on Data Mining (SDM’11). 804--815.Google ScholarCross Ref
- Angela A. Sodemann, Matthew P. Ross, and Brett J. Borghetti. 2012. A review of anomaly detection in automated surveillance. IEEE Trans. Syst. Man Cybernet., Part C: Appl. Rev. 42, 6 (2012), 1257--1272. Google ScholarDigital Library
- Garule Supriya and Sharmila M. Shinde. 2015. Outliers detection using subspace method: A survey. Int. J. Comput. Appl. 112, 16 (2015), 20--22.Google Scholar
- N. N. R. R. Suri, M. N. Murty, and G. Athithan. 2012. An algorithm for mining outliers in categorical data through ranking. In Proceedings of the 12th IEEE International Conference on Hybrid Intelligent Systems (HIS’12). 247--252.Google Scholar
- N. N. R. R. Suri, M. N. Murty, and G. Athithan. 2013. A rough clustering algorithm for mining outliers in categorical data. In Proceedings of the 4th International Conference on Pattern Recognition and Machine Intelligence (PReMI’13). 170--175.Google Scholar
- N. N. R. R. Suri, M. N. Murty, and G. Athithan. 2014. A ranking-based algorithm for detection of outliers in categorical data. Int. J. Hybrid Intell. Syst. 11 (2014), 1--11. Google ScholarDigital Library
- N. N. R. R. Suri, M. N. Murty, and G. Athithan. 2016. Detecting outliers in categorical data through rough clustering. Nat. Comput. 15 (2016), 385--394. Google ScholarDigital Library
- Ayman Taha and Ali S. Hadi. 2013. A general approach for automating outliers identification in categorical data. In Proceedings of the ACS/IEEE International Conference on Computer Systems and Applications (AICCSA’13). 1--8.Google Scholar
- Ayman Taha and Ali S. Hadi. 2016. Pair-wise association for categorical and mixed attributes. Info. Sci. 346 (2016), 73--89. Google ScholarDigital Library
- Ayman Taha and Osman Hegazy. 2010. A proposed outliers identification algorithm for categorical data sets. In Proceedings of International Conference on Informatics and Systems (INFOS’10). 1--5.Google Scholar
- Yun Wang. 2008. Statistical Techniques for Network Security: Modern Statistically-Based Intrusion Detection and Protection. IGI Global, New York, NY. Google ScholarDigital Library
- Yibo Wang and Wei Xu. 2018. Leveraging deep learning with LDA-based text analytics to detect automobile insurance fraud. Decis. Support Syst. 105 (2018), 87--95.Google ScholarCross Ref
- Li Wei, Weining Qian, Aoying Zhou, Wen Jin, and Jeffrey X. Yu. 2003. Hypergraph-based outlier test for categorical data. In Proceedings of the ACM International Conference on Knowledge Discovery and data Mining (SIGKDD’03). 399--410. Google ScholarDigital Library
- David J. Weller-Fahy, Brett J. Borghetti, and Angela A. Sodemann. 2015. A survey of distance and similarity measures used within network intrusion anomaly detection. IEEE Commun. Surveys Tutor. 17, 1 (2015), 70--91.Google ScholarDigital Library
- Jarrod West and Maumita Bhattacharya. 2016. Intelligent financial fraud detection: A comprehensive review. Comput. Secur. 57 (2016), 47--66. Google ScholarDigital Library
- Shu Wu and Shengrui Wang. 2011. Parameter-free anomaly detection for categorical data. Machine Learning and Data Mining in Pattern Recognition. Lecture Notes in Computer Science 6871 (2011), 112--126. Google ScholarDigital Library
- Shu Wu and Shengrui Wang. 2013. Information-theoretic outlier detection for large-scale categorical data. IEEE Trans. Knowl. Data Eng. 25, 3 (2013), 589--602. Google ScholarDigital Library
- Warusia Yassin, Nur Izura Udzir, Zaiton Muda, and Nasir Sulaiman. 2013. Anomaly-based intrusion detection through k-means clustering and naives Bayes classification. In Proceedings of the International Conference on Computing and Informatics (ICOCI’13). 298--303.Google Scholar
- Jeffrey Xu Yu, Weining Qian, Hongjun Lu, and Aoying Zhou. 2006. Finding centric local outliers in categorical/numerical spaces. Knowl. Info. Syst. 9 (2006), 309--338.Google ScholarDigital Library
- Rose Yu, Huida Qiu, Zhen Wen, Ching-Yung Lin, and Yan Liu. 2016. A survey on social media anomaly detection. Retrieevd from http://arxiv.org/pdf/1601.01102.Google Scholar
- Ji Zhang. 2013. Advancements of outlier detection: A survey. ICST Trans. Scal. Info. Syst. 13, 1 (2013), 1--26.Google Scholar
- Yang Zhang, Nirvana Meratnia, and Paul Havinga. 2010. Outlier detection techniques for wireless sensor networks: A survey. IEEE Commun. Surveys Tutor. 12, 2 (2010), 159--170.Google ScholarDigital Library
- Xingwang Zhao, Jiye Liang, and Fuyuan Cao. 2014. A simple and effective outlier detection algorithm for categorical data. Int. J. Mach. Learn. Cybernet. 5 (2014), 469--477.Google ScholarCross Ref
- Wobbe P. Zijlstra, L. Andries van der Ark, and Klaas Sijtsma. 2011. Outliers in questionnaire data: Can they be detected and should they be removed. J. Edu. Behav. Stat. 36 (2011), 186--212.Google ScholarCross Ref
- Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. 2012. A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5, 5 (2012), 363--387. Google ScholarDigital Library
Index Terms
- Anomaly Detection Methods for Categorical Data: A Review
Recommendations
Anomaly pattern detection in categorical datasets
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningWe propose a new method for detecting patterns of anomalies in categorical datasets. We assume that anomalies are generated by some underlying process which affects only a particular subset of the data. Our method consists of two steps: we first use a "...
Unsupervised Anomaly Detection in Stream Data with Online Evolving Spiking Neural Networks
AbstractUnsupervised anomaly discovery in stream data is a research topic with many practical applications. However, in many cases, it is not easy to collect enough training data with labeled anomalies for supervised learning of an anomaly ...
Deep Learning for Anomaly Detection: Challenges, Methods, and Opportunities
WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data MiningIn this tutorial we aim to present a comprehensive survey of the advances in deep learning techniques specifically designed for anomaly detection (deep anomaly detection for short). Deep learning has gained tremendous success in transforming many data ...
Comments