Abstract
Detecting outliers or anomalies is a common data analysis task. As a sub-field of unsupervised machine learning, a large variety of approaches exist, but the vast majority treats the input features as independent and often fails to recognize even simple (linear) relationships in the input feature space. Hence, we introduce RECol, a generic data pre-processing approach to generate additional columns (features) in a leave-one-out fashion: For each column, we try to predict its values based on the other columns, generating reconstruction error columns. We run experiments across a large variety of common baseline approaches and benchmark datasets with and without our RECol pre-processing method. From the more than 88k experiments, we conclude that the generated reconstruction error feature space generally seems to support common outlier detection methods and often considerably improves their ROC-AUC and PR-AUC values. Further, we provide parameter recommendations, such as starting with a simple squared error based random forest regression to generate RECols for new practical use-cases.
This paper represents the authors’ personal opinions and does not necessarily reflect the views of the Deutsche Bundesbank, the Eurosystem or their staff.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This can only harm us, because we might miss many possible experiments that could outperform the baseline.
- 2.
These choices can only harm us in that we restrict ourselves to fewer options compared to the baseline results.
- 3.
Result and code available at https://github.com/DayanandVH/RECol.
References
Aggarwal, C.C.: Outlier Analysis. In: Aggarwal, C.C. (ed.) Data Mining, pp. 237–263. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14142-8_8
Amarbayasgalan, T., Park, K.H., Lee, J.Y., Ryu, K.H.: Reconstruction error based deep neural networks for coronary heart disease risk prediction. PLoS ONE 14(12), 1–17 (2019). https://doi.org/10.1371/journal.pone.0225991
Amer, M., Goldstein, M.: Nearest-neighbor and clustering based anomaly detection algorithms for rapidminer. In: Fischer, S., Mierswa, I. (eds.) Proceedings of the 3rd RapidMiner Community Meeting and Conferernce (RCOMM 2012). RapidMiner Community Meeting and Conference (RCOMM-2012), 28–31 August, Budapest, Hungary, pp. 1–12. Shaker Verlag GmbH (2012)
Amer, M., Goldstein, M., Abdennadher, S.: Enhancing one-class support vector machines for unsupervised anomaly detection. In: Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, ODD 2013, pp. 8–15. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2500853.2500857
Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. SIGMOD Rec. 29(2), 93–104 (2000). https://doi.org/10.1145/335191.335388
Chalapathy, R., Chawla, S.: Deep learning for anomaly detection: a survey. CoRR abs/1901.03407 (2019). http://arxiv.org/abs/1901.03407
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3) (2009). https://doi.org/10.1145/1541880.1541882
Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The Amsterdam library of object images. Int. J. Comput. Vision 61(1), 103–112 (2005). https://doi.org/10.1023/B:VISI.0000042993.50813.60
Gogoi, P., Bhattacharyya, D., Borah, B., Kalita, J.K.: A survey of outlier detection methods in network anomaly identification. Comput. J. 54(4), 570–588 (2011). https://doi.org/10.1093/comjnl/bxr026
Goldstein, M., Dengel, A.: Histogram-based outlier score (HBOS): a fast unsupervised anomaly detection algorithm. In: Wölfl, S. (ed.) KI-2012: Poster and Demo Track. German Conference on Artificial Intelligence (KI-2012), 24–27 September, Saarbrücken, Germany, pp. 59–63. Online (2012)
Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4), 1–31 (2016). https://doi.org/10.1371/journal.pone.0152173
Gong, D., et al.: Memorizing Normality to Detect Anomaly: Memory-augmented Deep Autoencoder for Unsupervised Anomaly Detection. arXiv e-prints arXiv:1904.02639 (2019)
He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recogn. Lett. 24(9–10), 1641–1650 (2003). https://doi.org/10.1016/S0167-8655(03)00003-5
Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004). https://doi.org/10.1023/B:AIRE.0000045502.10941.a9
Huang, Y.A., Fan, W., Lee, W., Yu, P.S.: Cross-feature analysis for detecting ad-hoc routing anomalies. In: Proceedings of the 23rd International Conference on Distributed Computing Systems, ICDCS 2003, p. 478. IEEE Computer Society, USA (2003). https://doi.org/10.1109/ICDCS.2003.1203498
Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A.: Loop: local outlier probabilities. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, pp. 1649–1652. Association for Computing Machinery, New York (2009). https://doi.org/10.1145/1645953.1646195
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 413–422. IEEE Computer Society, USA (2008). https://doi.org/10.1109/ICDM.2008.17
Ma, M.Q., Zhao, Y., Zhang, X., Akoglu, L.: The need for unsupervised outlier model selection: a review and evaluation of internal evaluation strategies. ACM SIGKDD Explor. Newsl. 25(1) (2023)
Micenková, B., McWilliams, B., Assent, I.: Learning outlier ensembles: the best of both worlds-supervised and unsupervised. In: ACM SIGKDD 2014 Workshop ODD (2014)
Noto, K., Brodley, C., Slonim, D.: FRaC: a feature-modeling approach for semi-supervised and unsupervised anomaly detection. Data Min. Knowl. Discov. 25(1), 109–133 (2012). https://doi.org/10.1007/s10618-011-0234-x
Pang, G., Shen, C., Cao, L., Hengel, A.V.D.: Deep learning for anomaly detection: a review. ACM Comput. Surv. 54(2) (2021). https://doi.org/10.1145/3439950
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. SIGMOD Rec. 29(2), 427–438 (2000). https://doi.org/10.1145/335191.335437
Sakurada, M., Yairi, T.: Anomaly detection using autoencoders with nonlinear dimensionality reduction. In: Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, MLSDA 2014, pp. 4–11. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2689746.2689747
Sattarov, T., Herurkar, D., Hees, J.: Explaining anomalies using denoising autoencoders for financial tabular data (2022)
Schölkopf, B., Platt, J.C., Shawe-Taylor, J.C., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Comput. 13(7), 1443–1471 (2001). https://doi.org/10.1162/089976601750264965
Schreyer, M., Sattarov, T., Schulze, C., Reimer, B., Borth, D.: Detection of accounting anomalies in the latent space using adversarial autoencoder neural networks (2019). https://doi.org/10.48550/ARXIV.1908.00734. https://arxiv.org/abs/1908.00734
Ted, E., et al.: Detecting insider threats in a real corporate database of computer usage activity. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1393–1401. Association for Computing Machinery (2013). https://doi.org/10.1145/2487575.2488213
Xia, Y., Cao, X., Wen, F., Hua, G., Sun, J.: Learning discriminative reconstructions for unsupervised outlier removal. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 7–13 December 2015, pp. 1511–1519. IEEE Computer Society (2015). https://doi.org/10.1109/ICCV.2015.177
Zhou, C., Paffenroth, R.C.: Anomaly detection with robust deep autoencoders. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017, pp. 665–674. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3097983.3098052
Acknowledgements
This work was supported by the BMWK project EuroDaT (Grant 68GX21010K) and XAINES (Grant 01IW20005).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Herurkar, D., Meier, M., Hees, J. (2023). RECol: Reconstruction Error Columns for Outlier Detection. In: Seipel, D., Steen, A. (eds) KI 2023: Advances in Artificial Intelligence. KI 2023. Lecture Notes in Computer Science(), vol 14236. Springer, Cham. https://doi.org/10.1007/978-3-031-42608-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-42608-7_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42607-0
Online ISBN: 978-3-031-42608-7
eBook Packages: Computer ScienceComputer Science (R0)