An effective and efficient algorithm for high-dimensional outlier detection

Aggarwal, Charu C.; Yu, Philip S.

doi:10.1007/s00778-004-0125-5

An effective and efficient algorithm for high-dimensional outlier detection

Published: April 2005

Volume 14, pages 211–221, (2005)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Charu C. Aggarwal¹ &
Philip S. Yu¹

822 Accesses
140 Citations
Explore all metrics

Abstract.

The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are most important for high-dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms have been proposed for outlier detection that use several concepts of proximity in order to find the outliers based on their relationship to the other points in the data. However, in high-dimensional space, the data are sparse and concepts using the notion of proximity fail to retain their effectiveness. In fact, the sparsity of high-dimensional data can be understood in a different way so as to imply that every point is an equally good outlier from the perspective of distance-based definitions. Consequently, for high-dimensional data, the notion of finding meaningful outliers becomes substantially more complex and nonobvious. In this paper, we discuss new techniques for outlier detection that find the outliers by studying the behavior of projections from the data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal CC (2001) Re-designing distance functions and distance based applications for high dimensional data. ACM SIGMOD Rec 30(1):13-18
Google Scholar
Aggarwal CC(1999) Fast algorithms for projected clustering. In: Proceedings of ACM SIGMOD, pp 61-72
Aggarwal CC, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of ACM SIGMOD, pp 70-81
Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional space. In: Proceedings of ICDT, pp 420-434
Aggarwal CC, Orlin JB, Tai RP (1997) Optimized crossover for the independent set problem. Operat Res 45(2):226-234
MathSciNet MATH Google Scholar
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD, pp 94-105
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of ACM SIGMOD, pp 207-216
Arning A, Agrawal R, Raghavan P (1996) A linear method for deviation detection in large databases. In: Proceedings of KDD, pp 164-169
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbors meaningful? In: Proceedings of ICDT, pp 217-235
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD, pp 93-104
Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. In: Proceedings of the VLDB conference, pp 89-104
Darwin C (1859) The origin of species by natural selection. Available at: http://www.literature.org/authors/darwin-charles/the-origin-of-species/
De Jong KA (1975) Analysis of the behaviour of a class of genetic adaptive systems. PhD dissertation, University of Michigan, Ann Arbor, MI
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, pp 226-231
Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading, MA
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD, pp 73-84
Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: Proceedings of the VLDB conference, pp 506-515
Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor, MI
Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4589):671-680
MathSciNet Google Scholar
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large data sets. In: Proceedings of the VLDB conference, pp 392-403
Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the VLDB conference, pp 211-222
Parthasarathy, S Aggarwal CC (2003) On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Trans Knowl Data Eng 15(6):1512-1531
Article Google Scholar
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of ACM SIGMOD, pp 427-438
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD, pp 103-114

Download references

Author information

Authors and Affiliations

IBM T.J. Watson Research Center, 19 Skyline Drive, NY 10532, Hawthorne, USA
Charu C. Aggarwal & Philip S. Yu

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar
Philip S. Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charu C. Aggarwal.

Additional information

Received: 19 November 2002, Accepted: 6 February 2004, Published online: 19 August 2004

Edited by: R. Ng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, C.C., Yu, P.S. An effective and efficient algorithm for high-dimensional outlier detection. The VLDB Journal 14, 211–221 (2005). https://doi.org/10.1007/s00778-004-0125-5

Download citation

Issue Date: April 2005
DOI: https://doi.org/10.1007/s00778-004-0125-5

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An effective and efficient algorithm for high-dimensional outlier detection

Abstract.

Access this article

Similar content being viewed by others

Outlier Detection Techniques: A Comparative Study

On normalization and algorithm selection for unsupervised outlier detection

Outlier Detection in High Dimension Using Regularization

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords:

Navigation

An effective and efficient algorithm for high-dimensional outlier detection

Abstract.

Access this article

Similar content being viewed by others

Outlier Detection Techniques: A Comparative Study

On normalization and algorithm selection for unsupervised outlier detection

Outlier Detection in High Dimension Using Regularization

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords:

Search

Navigation