Article

Efficient discovery of error-tolerant frequent itemsets in high dimensions

Authors:
Cheng Yang

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Usama Fayyad

digiMine, Inc., Bellevue, WA

digiMine, Inc., Bellevue, WA
View Profile

,
Paul S. Bradley

digiMine, Inc., Bellevue, WA

digiMine, Inc., Bellevue, WA
View Profile

KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2001Pages 194–203https://doi.org/10.1145/502512.502539

Published:26 August 2001Publication History

KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 194–203

ABSTRACT

We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies error-tolerant frequent clusters of items in transactional data (customer-purchase data, web browsing data, text, etc.). The algorithm exploits sparseness of the underlying data to find large groups of items that are correlated over database records (rows). The notion of transaction coverage allows us to extend the algorithm and view it as a fast clustering algorithm for discovering segments of similar transactions in binary sparse data. We evaluate the new algorithm on three real-world applications: clustering high-dimensional data, query selectivity estimation and collaborative filtering. Results show that the algorithm consistently uncovers structure in large sparse databases that other traditional clustering algorithms fail to find.

References

AGGR98.R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In Proc. of the ACM S1GMOD Conf., 1998. Google ScholarDigital Library
AIS93.R. Agrawal, T. Imielinski and A. Swami. Mining Association Rules Between Sets of Items in Large Databases. In Proc. of the ACM SIGMOD Conf., 1993, pp. 207-216. Google ScholarDigital Library
AMSTV96.R. Agrawal, H. Mannila, R. Srikant, H. Toivonen and A. I. Verkamo. Fast Discovery of Association Rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, pp. 307-328, AAAI Press, Menlo Park, CA, 1996. Google ScholarDigital Library
AS94.R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proceedings of the 20 th International Conference on Very Large Databases, 1994. Google ScholarDigital Library
B98.R. J. Bayardo Jr. Efficiently Mining Long Patterns from Databases. In Proc. of the 1998 ACM SIGMOD lnt'l Conf. on Management of Data, pp. 85-93, 1998. Google ScholarDigital Library
BFG99.K. P. Bennett, U. M. Fayyad and D. Geiger. Densitybased Indexing for Approximating Nearest Neighbor Queries. Proc. KDD-99, ACM Press, 1999. Google ScholarDigital Library
BFR98.P. S. Bradley, U. M. Fayyad and C. Reina. Scaling EM (Expectation-Maximization) Clustering to Large Databases. Technical Report MSR-TR-98-35, Microsoft Research, 1998.Google Scholar
BHK98.J. Breese, D. Heckerman and C. Kadie. Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Technical Report MSR-TR-98-12, Microsoft Research, 1998.Google Scholar
CS96.P. Cheeseman and J. Stutz. Bayesian Classification (AutoClass): Theory and Results. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurasamy (eds.) Advances in Knowledge Discovery and Data Mining, pages 153-180. AAAI Press, Menlo Park, CA, 1996. Google ScholarDigital Library
DLR77.A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B, 39:1-38, 1973.Google Scholar
DH73.R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, 1973.Google ScholarDigital Library
FI93.U.M. Fayyad and K.B. Irani. "Multi-interval Discretization of Continuous-valued Attributes for Classification Learning." Proc. of the 13th Intl. Joint Conf. on Artificial Intelligence. HCAI-93: Chambery, France (1993).Google Scholar
GGR99.V. Ganti, J. Gehrke, and R. Ramakrishnan. "CACTUS - Clustering Categorical Data Using Summaries". Proc. of KDD-99, ACM Press, 1999. Google ScholarDigital Library
GKP89.R. L. Graham, D. E. Knuth and O. Patashnik. Concrete Mathematics. Addison Wesley, Reading, MA, 1989. Google ScholarDigital Library
GKR98.D. Gibson, J. Kleinburg, and P. Raghavan. "Clustering categorical data: an approach based on dynamical systems". Proc. VLDB-98, pp. 311-323. 1998. Google ScholarDigital Library
GMS97.G. Gunopulos, H. Mannila and S. Saluja. Discovering All Most Specific Sentences by Randomized Algorithms. In Proc. Of the 6 th Int'l Conf. On Database Theory, pp. 215-229, 1997. Google ScholarDigital Library
GRS98.S. Guha, R. Rastogi and K. Shim. CURE: An efficient algorithm for clustering large databases. In Proceedings of the ACM SIGMOD conference, 1998. Google ScholarDigital Library
GRS99.S. Guha, R. Rastogi, K. Shim. "A Robust Clustering Algorithm for Categorical Attributes". Proc. ICDE-99, IEEE Press, 1999. Google ScholarDigital Library
MH98.M. Meila and D. Heckerman. An Experimental Comparison of Several Clustering and Initialization Methods. Technical Report MSR-TR-98-06, Microsoft Research, 1998.Google Scholar
MRK97.B. Miller, J. Riedl and J. Konstan. Experiences with GroupLens: Making Usenet Useful Again. In Proc. USENIX 1997 Tech. Conf., pp. 219-231, Anaheim, CA, 1997. Google ScholarDigital Library
NH94.R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proceedings of the International Conference on Very Large Databases, 1994. Google ScholarDigital Library
RG99.R. Ramakrishnan and J. Gehrke. Principles of Database Management (2 ad Edition). 1999.Google Scholar
R*97.P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl. GroupLens: An Open Architecture for Collaborative Filtering of Netnew. In Proc. ACM 1994 Conf. Computer Supported Cooperative Work, pp. 175- 186, New York. ACM, 1997. Google ScholarDigital Library
RV97.P. Resnick and H. Varian. Recommender Systems. Comm. of the ACM, 40(3):56-58, 1997. Google ScholarDigital Library
SFB99.J. Shanmugusundaram, U. M. Fayyad and P. S. Bradley. Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions. In Proc. 5 th lntL Conf. on Knowledge Discovery and Data Mining, pp. 223-232, 1999. Google ScholarDigital Library
SA96.R. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational Tables. In Proceedings of the ACM SIGMOD Conference, 1996. Google ScholarDigital Library
YFB00.C. Yang, U. Fayyad and P. S. Bradley. Efficient Discovery of Error-Tolerant Frequent Itemsets in High Dimensions. Technical Report MSR-TR-2000-20, Microsoft Research, 2000.Google Scholar
ZPOL97.M. J. Zaki, S. Parthasarathy, M. Ogihara and W. Li. New Algorithms for Fast Discovery of Association Rules. In Proc. of the Third lnt'l Conf. On Knowledge Discovery in Databases and Data Mining, pp. 283-286, 1997.Google ScholarDigital Library
ZRL96.T. Zhang, R. Ramakrishnan and M. Livny. Birch: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference, 1996, pp. 103-114. Google ScholarDigital Library
Z49.G.E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949.Google Scholar

Index Terms

Efficient discovery of error-tolerant frequent itemsets in high dimensions
1. Applied computing
  1. Physical sciences and engineering
    1. Mathematics and statistics
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering
    2. Decision support systems
      1. Data analytics

Recommendations

An efficient pattern growth approach for mining fault tolerant frequent itemsets
Highlights
- Mining fault tolerant (FT) frequent itemsets are computationally expensive.
- ...
Abstract
Mining fault tolerant (FT) frequent itemsets from transactional databases are computationally more expensive than mining exact matching frequent itemsets. Previous algorithms mine FT frequent itemsets using Apriori heuristic. Apriori-...
Read More
Discovery of maximum length frequent itemsets

The use of frequent itemsets has been limited by the high computational cost as well as the large number of resulting itemsets. In many real-world scenarios, however, it is often sufficient to mine a small representative subset of frequent itemsets with ...
Read More
An Efficient Algorithm for Frequent Closed Itemsets Mining
CSSE '08: Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 04

Efficient algorithms for mining frequent itemsets are crucial for mining association rules. Most existing work focuses on mining all frequent itemsets. However, since any subset of a frequent set also is frequent, it is sufficient to mine the set of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
August 2001
493 pages
ISBN:158113391X
DOI:10.1145/502512
Conference Chair:
Doheon Lee
Chonnam National University, Korea
,
General Chair:
Mario Schkolnick
SGI
,
Program Chairs:
Foster Provost
New York University
,
Ramakrishnan Srikant
IBM Almaden Research Center
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 August 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Error-tolerant frequent itemset
clustering
collaborative filtering
high dimensions
query selectivity estimation
Qualifiers
- Article
Conference

Acceptance Rates
KDD '01 Paper Acceptance Rate31of237submissions,13%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 94
  Total Citations
  View Citations
- 824
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient discovery of error-tolerant frequent itemsets in high dimensions

KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

An efficient pattern growth approach for mining fault tolerant frequent itemsets

Discovery of maximum length frequent itemsets

An Efficient Algorithm for Frequent Closed Itemsets Mining