skip to main content
10.1145/502512.502539acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Efficient discovery of error-tolerant frequent itemsets in high dimensions

Published:26 August 2001Publication History

ABSTRACT

We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies error-tolerant frequent clusters of items in transactional data (customer-purchase data, web browsing data, text, etc.). The algorithm exploits sparseness of the underlying data to find large groups of items that are correlated over database records (rows). The notion of transaction coverage allows us to extend the algorithm and view it as a fast clustering algorithm for discovering segments of similar transactions in binary sparse data. We evaluate the new algorithm on three real-world applications: clustering high-dimensional data, query selectivity estimation and collaborative filtering. Results show that the algorithm consistently uncovers structure in large sparse databases that other traditional clustering algorithms fail to find.

References

  1. AGGR98.R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In Proc. of the ACM S1GMOD Conf., 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AIS93.R. Agrawal, T. Imielinski and A. Swami. Mining Association Rules Between Sets of Items in Large Databases. In Proc. of the ACM SIGMOD Conf., 1993, pp. 207-216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AMSTV96.R. Agrawal, H. Mannila, R. Srikant, H. Toivonen and A. I. Verkamo. Fast Discovery of Association Rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, pp. 307-328, AAAI Press, Menlo Park, CA, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. AS94.R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proceedings of the 20 th International Conference on Very Large Databases, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B98.R. J. Bayardo Jr. Efficiently Mining Long Patterns from Databases. In Proc. of the 1998 ACM SIGMOD lnt'l Conf. on Management of Data, pp. 85-93, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. BFG99.K. P. Bennett, U. M. Fayyad and D. Geiger. Densitybased Indexing for Approximating Nearest Neighbor Queries. Proc. KDD-99, ACM Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. BFR98.P. S. Bradley, U. M. Fayyad and C. Reina. Scaling EM (Expectation-Maximization) Clustering to Large Databases. Technical Report MSR-TR-98-35, Microsoft Research, 1998.Google ScholarGoogle Scholar
  8. BHK98.J. Breese, D. Heckerman and C. Kadie. Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Technical Report MSR-TR-98-12, Microsoft Research, 1998.Google ScholarGoogle Scholar
  9. CS96.P. Cheeseman and J. Stutz. Bayesian Classification (AutoClass): Theory and Results. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurasamy (eds.) Advances in Knowledge Discovery and Data Mining, pages 153-180. AAAI Press, Menlo Park, CA, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. DLR77.A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B, 39:1-38, 1973.Google ScholarGoogle Scholar
  11. DH73.R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, 1973.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. FI93.U.M. Fayyad and K.B. Irani. "Multi-interval Discretization of Continuous-valued Attributes for Classification Learning." Proc. of the 13th Intl. Joint Conf. on Artificial Intelligence. HCAI-93: Chambery, France (1993).Google ScholarGoogle Scholar
  13. GGR99.V. Ganti, J. Gehrke, and R. Ramakrishnan. "CACTUS - Clustering Categorical Data Using Summaries". Proc. of KDD-99, ACM Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. GKP89.R. L. Graham, D. E. Knuth and O. Patashnik. Concrete Mathematics. Addison Wesley, Reading, MA, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. GKR98.D. Gibson, J. Kleinburg, and P. Raghavan. "Clustering categorical data: an approach based on dynamical systems". Proc. VLDB-98, pp. 311-323. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. GMS97.G. Gunopulos, H. Mannila and S. Saluja. Discovering All Most Specific Sentences by Randomized Algorithms. In Proc. Of the 6 th Int'l Conf. On Database Theory, pp. 215-229, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. GRS98.S. Guha, R. Rastogi and K. Shim. CURE: An efficient algorithm for clustering large databases. In Proceedings of the ACM SIGMOD conference, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. GRS99.S. Guha, R. Rastogi, K. Shim. "A Robust Clustering Algorithm for Categorical Attributes". Proc. ICDE-99, IEEE Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. MH98.M. Meila and D. Heckerman. An Experimental Comparison of Several Clustering and Initialization Methods. Technical Report MSR-TR-98-06, Microsoft Research, 1998.Google ScholarGoogle Scholar
  20. MRK97.B. Miller, J. Riedl and J. Konstan. Experiences with GroupLens: Making Usenet Useful Again. In Proc. USENIX 1997 Tech. Conf., pp. 219-231, Anaheim, CA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. NH94.R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proceedings of the International Conference on Very Large Databases, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. RG99.R. Ramakrishnan and J. Gehrke. Principles of Database Management (2 ad Edition). 1999.Google ScholarGoogle Scholar
  23. R*97.P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl. GroupLens: An Open Architecture for Collaborative Filtering of Netnew. In Proc. ACM 1994 Conf. Computer Supported Cooperative Work, pp. 175- 186, New York. ACM, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. RV97.P. Resnick and H. Varian. Recommender Systems. Comm. of the ACM, 40(3):56-58, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. SFB99.J. Shanmugusundaram, U. M. Fayyad and P. S. Bradley. Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions. In Proc. 5 th lntL Conf. on Knowledge Discovery and Data Mining, pp. 223-232, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. SA96.R. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational Tables. In Proceedings of the ACM SIGMOD Conference, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. YFB00.C. Yang, U. Fayyad and P. S. Bradley. Efficient Discovery of Error-Tolerant Frequent Itemsets in High Dimensions. Technical Report MSR-TR-2000-20, Microsoft Research, 2000.Google ScholarGoogle Scholar
  28. ZPOL97.M. J. Zaki, S. Parthasarathy, M. Ogihara and W. Li. New Algorithms for Fast Discovery of Association Rules. In Proc. of the Third lnt'l Conf. On Knowledge Discovery in Databases and Data Mining, pp. 283-286, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. ZRL96.T. Zhang, R. Ramakrishnan and M. Livny. Birch: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference, 1996, pp. 103-114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Z49.G.E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949.Google ScholarGoogle Scholar

Index Terms

  1. Efficient discovery of error-tolerant frequent itemsets in high dimensions

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
            August 2001
            493 pages
            ISBN:158113391X
            DOI:10.1145/502512

            Copyright © 2001 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 26 August 2001

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            KDD '01 Paper Acceptance Rate31of237submissions,13%Overall Acceptance Rate1,133of8,635submissions,13%

            Upcoming Conference

            KDD '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader