Skip to main content
Log in

An effective and efficient algorithm for high-dimensional outlier detection

  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract.

The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are most important for high-dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms have been proposed for outlier detection that use several concepts of proximity in order to find the outliers based on their relationship to the other points in the data. However, in high-dimensional space, the data are sparse and concepts using the notion of proximity fail to retain their effectiveness. In fact, the sparsity of high-dimensional data can be understood in a different way so as to imply that every point is an equally good outlier from the perspective of distance-based definitions. Consequently, for high-dimensional data, the notion of finding meaningful outliers becomes substantially more complex and nonobvious. In this paper, we discuss new techniques for outlier detection that find the outliers by studying the behavior of projections from the data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aggarwal CC (2001) Re-designing distance functions and distance based applications for high dimensional data. ACM SIGMOD Rec 30(1):13-18

    Google Scholar 

  2. Aggarwal CC(1999) Fast algorithms for projected clustering. In: Proceedings of ACM SIGMOD, pp 61-72

  3. Aggarwal CC, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of ACM SIGMOD, pp 70-81

  4. Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional space. In: Proceedings of ICDT, pp 420-434

  5. Aggarwal CC, Orlin JB, Tai RP (1997) Optimized crossover for the independent set problem. Operat Res 45(2):226-234

    MathSciNet  MATH  Google Scholar 

  6. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD, pp 94-105

  7. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of ACM SIGMOD, pp 207-216

  8. Arning A, Agrawal R, Raghavan P (1996) A linear method for deviation detection in large databases. In: Proceedings of KDD, pp 164-169

  9. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbors meaningful? In: Proceedings of ICDT, pp 217-235

  10. Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD, pp 93-104

  11. Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. In: Proceedings of the VLDB conference, pp 89-104

  12. Darwin C (1859) The origin of species by natural selection. Available at: http://www.literature.org/authors/darwin-charles/the-origin-of-species/

  13. De Jong KA (1975) Analysis of the behaviour of a class of genetic adaptive systems. PhD dissertation, University of Michigan, Ann Arbor, MI

  14. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, pp 226-231

  15. Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading, MA

  16. Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD, pp 73-84

  17. Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: Proceedings of the VLDB conference, pp 506-515

  18. Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor, MI

  19. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4589):671-680

    MathSciNet  Google Scholar 

  20. Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large data sets. In: Proceedings of the VLDB conference, pp 392-403

  21. Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the VLDB conference, pp 211-222

  22. Parthasarathy, S Aggarwal CC (2003) On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Trans Knowl Data Eng 15(6):1512-1531

    Article  Google Scholar 

  23. Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of ACM SIGMOD, pp 427-438

  24. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD, pp 103-114

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charu C. Aggarwal.

Additional information

Received: 19 November 2002, Accepted: 6 February 2004, Published online: 19 August 2004

Edited by: R. Ng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, C.C., Yu, P.S. An effective and efficient algorithm for high-dimensional outlier detection. The VLDB Journal 14, 211–221 (2005). https://doi.org/10.1007/s00778-004-0125-5

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-004-0125-5

Keywords:

Navigation