Mining top-K frequent itemsets through progressive sampling

Pietracaprina, Andrea; Riondato, Matteo; Upfal, Eli; Vandin, Fabio

doi:10.1007/s10618-010-0185-7

Mining top-K frequent itemsets through progressive sampling

Published: 23 July 2010

Volume 21, pages 310–326, (2010)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Andrea Pietracaprina¹,
Matteo Riondato²,
Eli Upfal² &
…
Fabio Vandin²

557 Accesses
32 Citations
Explore all metrics

Abstract

We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets’ frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real benchmark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams. Theor Comput Sci 312(1): 3–15
Article MATH MathSciNet Google Scholar
Chakaravarthy VT, Pandit V, Sabharwal Y (2009) Analysts of sampling techniques for association rule mining. Proceedings of ICDT 2009, pp 276–283
Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. Proceedings of KDD 2002, pp 462–468
Cohen E, Grossaug N, Kaplan H (2008) Processing top-k queries from samples. Comput Netw 52(14): 2605–2622
Article MATH Google Scholar
Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. Proceedings of SIGMOD 1998, pp 331–342
John GH, Langley P (1996) Static versus dynamic sampling for data mining. Proceedings of KDD 1996, pp 367–370
Li Y, Gopalan RP (2004) Effective sampling for mining association rules. Proceedings of AUS-AI 2004, pp 391–401
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. Proceedings of VLDB 2002, pp 346–357
Metwally A, Agrawal D, El Abbadi A (2005) Efficient computation of frequent and top-k elements in data streams. Proceedings of ICDT 2005, pp 398–412
Mitzenmacher M, Upfal E (2005) Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press, Cambridge
MATH Google Scholar
Parthasarathy S (2002) Efficient progressive sampling for association rules. Proceedings of ICDM 2002, pp 354–361
Pietracaprina A, Vandin F (2007) Efficient incremental mining of top-K frequent closed itemsets. Proceedings of discovery science 2007, pp 275–280
Toivonen H (1996) Sampling large databases for association rules. Proceedings of VLDB 1996, pp 134–145
Vasudevan D, Vjnović M (2009) Ranking through random sampling. Manuscript
Wang J, Han J, Lu Y, Tzvetkov P (2005) TFP: an efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans Knowl Data Eng 17(5): 652–664
Article Google Scholar
Wong RC-W, Fu AW-C (2006) Mining top-K frequent itemsets from data streams. Data Min Knowl Discov 13(2): 193–217
Article MathSciNet Google Scholar
Zaki MJ, Parthasarathy S, Li W, Ogihara M (1997) Evaluation of sampling for data mining of association rules. Proceedings of RIDE 1997, pp 42–50

Download references

Author information

Authors and Affiliations

Dipartimento di Ingegneria dell’Informazione, Università di Padova, Padova, Italy
Andrea Pietracaprina
Department of Computer Science, Brown University, Providence, RI, USA
Matteo Riondato, Eli Upfal & Fabio Vandin

Authors

Andrea Pietracaprina
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Riondato
View author publications
You can also search for this author in PubMed Google Scholar
Eli Upfal
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Vandin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matteo Riondato.

Additional information

Responsible editors: José L Balcázar, Francesco Bonchi, Aristides Gionis, Michèle Sebag.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pietracaprina, A., Riondato, M., Upfal, E. et al. Mining top-K frequent itemsets through progressive sampling. Data Min Knowl Disc 21, 310–326 (2010). https://doi.org/10.1007/s10618-010-0185-7

Download citation

Received: 30 April 2010
Accepted: 20 June 2010
Published: 23 July 2010
Issue Date: September 2010
DOI: https://doi.org/10.1007/s10618-010-0185-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining top-K frequent itemsets through progressive sampling

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Stratified random sampling from streaming and stored data

A review on design inspired subsampling for big data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mining top-K frequent itemsets through progressive sampling

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Stratified random sampling from streaming and stored data

A review on design inspired subsampling for big data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation