A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes

Koufakou, Anna; Georgiopoulos, Michael

doi:10.1007/s10618-009-0148-z

A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes

Published: 11 November 2009

Volume 20, pages 259–289, (2010)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Anna Koufakou^1,2 &
Michael Georgiopoulos²

1219 Accesses
79 Citations
3 Altmetric
Explore all metrics

Abstract

Outlier detection has attracted substantial attention in many applications and research areas; some of the most prominent applications are network intrusion detection or credit card fraud detection. Many of the existing approaches are based on calculating distances among the points in the dataset. These approaches cannot easily adapt to current datasets that usually contain a mix of categorical and continuous attributes, and may be distributed among different geographical locations. In addition, current datasets usually have a large number of dimensions. These datasets tend to be sparse, and traditional concepts such as Euclidean distance or nearest neighbor become unsuitable. We propose a fast distributed outlier detection strategy intended for datasets containing mixed attributes. The proposed method takes into consideration the sparseness of the dataset, and is experimentally shown to be highly scalable with the number of points and the number of attributes in the dataset. Experimental results show that the proposed outlier detection method compares very favorably with other state-of-the art outlier detection strategies proposed in the literature and that the speedup achieved by its distributed version is very close to linear.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

Article 20 November 2015

KAGO: an approximate adaptive grid-based outlier detection approach using kernel density estimate

Article 12 July 2021

Outlier detection based on multi-dimensional clustering and local density

Article 01 June 2017

References

Acuna E, Rodriguez C (2004) A meta analysis study of outlier detection methods in classification. Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez. Available at http://academic.uprm.edu~eacuna/paperout.pdf
Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. ACM SIGMOD Record 30(2): 37–46
Article Google Scholar
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the international conference on very large data bases, pp 487–499
Aha D, Bankert R (1994) Feature selection for case-based classification of cloud types: an empirical comparison. In: Proceedings of the 1994 AAAI workshop on case-based reasoning, pp 106–112
Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Transac Knowl Data Engin 17(2): 203–215
Article MathSciNet Google Scholar
Barnett V, Lewis T (1978) Outliers in statistical data. Wiley, NY
MATH Google Scholar
Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 29–38
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of the 7th international conference on database theory, pp 217–235
Biba M, Esposito F, Ferilli S, Di Mauro N, Basile T (2007) Unsupervised discretization using kernel density estimation. In: Proceedings of the 20-th international conferece on artificial intelligence, pp 696–701
Blake C, Merz C (1998) UCI repository of machine learning databases. http://archive.ics.uci.edu. Accessed Sept 2008
Bolton R, Hand D (2002) Statistical fraud detection: a review. Stat Sci 17(3): 235–255
Article MATH MathSciNet Google Scholar
Branch J, Szymanski B, Giannella C, Wolff R, Kargupta H (2006) In-network outlier detection in wireless sensor networks. In: Proceedings 26th international conference on distributed computing systems
Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. ACM SIGMOD Record 29(2): 93–104
Article Google Scholar
Calders T, Rigotti C, Boulicaut J (2004) A survey on condensed representations for frequent sets. LNCS Constraint-Based Mining and Inductive Databases 3848: 64–80
Article Google Scholar
Catlett J (1991) Megainduction: machine learning on very large databases, PhD thesis, Basser Department of Computer Science, University of Sydney, Australia
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: USENIX symposium on operating systems design and implementation OSDI
Dokas P, Ertoz L, Kumar V, Lazarevic A, Srivastava J, Tan P (2002) Data mining for network intrusion detection. In: Proceedings NSF workshop on next generation data mining, pp 21–30
Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: SIAM international conference on data mining, pp 47–58
Geerts F, Goethals B, Van den Bussche J (2005) Tight upper bounds on the number of candidate patterns. ACM Transac Database System (TODS) 30(2): 333–363
Article Google Scholar
Hawkins D (1980) Identification of outliers. Chapman and Hall, London
MATH Google Scholar
Hawkins S, He H, Williams G, Baxter R (2002) Outlier detection using replicator neural networks. In: Proceedings of the 4th international conference on data warehousing and knowledge discovery, pp 170–180
Hays C (2004) What Wal-Mart knows about customers habits. The New York Times, November 14
He Z, Xu X, Deng S, Calvanese D, De Giacomo G, Lenzerini M (2006) A fast greedy algorithm for outlier mining. In: Proceedings of 10th Pacific-Asia conference on knowledge and data discovery, pp 567–576
Hettich S, Bay S (1999) The UCI KDD archive. http://kdd.ics.uci.edu
Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2): 85–126
Article MATH Google Scholar
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24th international conference on very large data bases, pp 392–403
Knorr E, Ng R, Tucakov V (2000) Distance-based outliers: algorithms and applications. Int J Very Large Data Bases VLDB 8(3): 237–253
Article Google Scholar
Knuth D (1968) The art of computer programming, vol 1. Addison-Wesley, Reading, MA
MATH Google Scholar
Koufakou A, Georgiopoulos M, Anagnostopoulos G (2008b) Detecting outliers in high-dimensional datasets with mixed attributes. In: International conference on data mining DMIN, pp 427–433
Koufakou A, Ortiz E, Georgiopoulos M, Anagnostopoulos G, Reynolds K (2007) A scalable and efficient outlier detection strategy for categorical data. In: IEEE international conference on tools with artificial intelligence ICTAI, pp 210–217
Koufakou A, Secretan J, Reeder J, Cardona K, Georgiopoulos M (2008a) Fast parallel outlier detection for categorical datasets using MapReduce. In: IEEE world congress on computational intelligence international joint conference on neural networks IJCNN, pp 3298–3304
Latecki L, Lazarevic A, Pokrajac D (2007) Outlier detection with kernel density functions. Lecture Notes in Computer Science 4571: 61
Article Google Scholar
Lazarevic A, Ertoz L, Kumar V, Ozgur A, Srivastava J (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of the 3rd SIAM international conference on data mining, p 25
Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Transac Knowl Data Engin 17(9): 1174–1185
Article Google Scholar
Otey M, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Mining Knowl Discov 12(2): 203–228
Article MathSciNet Google Scholar
Papadimitriou S, Kitagawa H, Gibbons P, Faloutsos C, (2003) LOCI: fast outlier detection using the local correlation integral. In: Proceedings 19th international conference on data engineering, pp 315–326
Penny K, Jolliffe I (2001) A comparison of multivariate outlier detection methods for clinical laboratory safety data. The Statistician 50(3): 295–308
MathSciNet Google Scholar
Preparata F, Shamos M (1985) Computational geometry: an introduction. Springer, Berlin
Google Scholar
Roberts S, Tarassenko L (1994) A probabilistic resource allocating network for novelty detection. Neural Comput 6(2): 270–284
Article Google Scholar
Rousseeuw P (1985) Multivariate estimation with high breakdown point. Math Stat Appl 8: 283–297
MathSciNet Google Scholar
Rousseeuw P, Leroy A (1987) Robust regression and outlier detection. Wiley, NY
Book MATH Google Scholar
Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Pearson Addison Wesley, London
Google Scholar
Tax D, Duin R (2004) Support vector data description. Mach Learn 54(1): 45–66
Article MATH Google Scholar
Yu J, Qian W, Lu H, Zhou A (2006) Finding centric local outliers in categorical/numerical spaces. Knowl Inform Syst 9(3): 309–338
Article Google Scholar

Download references

Author information

Authors and Affiliations

U.A. Whitaker School of Engineering, Florida Gulf Coast University, Fort Myers, FL, 33965, USA
Anna Koufakou
School of EECS, University of Central Florida, Orlando, FL, 32816, USA
Anna Koufakou & Michael Georgiopoulos

Authors

Anna Koufakou
View author publications
You can also search for this author in PubMed Google Scholar
Michael Georgiopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Koufakou.

Additional information

Responsible editor: Sanjay Chawla.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koufakou, A., Georgiopoulos, M. A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Disc 20, 259–289 (2010). https://doi.org/10.1007/s10618-009-0148-z

Download citation

Received: 22 May 2008
Accepted: 24 August 2009
Published: 11 November 2009
Issue Date: March 2010
DOI: https://doi.org/10.1007/s10618-009-0148-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes

Abstract

Access this article

Similar content being viewed by others

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

KAGO: an approximate adaptive grid-based outlier detection approach using kernel density estimate

Outlier detection based on multi-dimensional clustering and local density

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes

Abstract

Access this article

Similar content being viewed by others

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

KAGO: an approximate adaptive grid-based outlier detection approach using kernel density estimate

Outlier detection based on multi-dimensional clustering and local density

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation