research-article

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

Authors:
Leonardo Pellegrina

Università di Padova, Padova, Italy

Università di Padova, Padova, Italy

0000-0002-6601-5526
View Profile

,
Cyrus Cousins

Brown University, Providence, RI

Brown University, Providence, RI

0000-0002-1691-0282
View Profile

,
Fabio Vandin

Università di Padova, Padova, Italy

Università di Padova, Padova, Italy

0000-0003-2244-2320
View Profile

,
Matteo Riondato

Amherst College, Amherst, MA

Amherst College, Amherst, MA

0000-0003-2523-4420
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 16 Issue 6Article No.: 124pp 1–29https://doi.org/10.1145/3532187

Published:30 July 2022Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

“I’m an MC still as honest” – Eminem, Rap God

We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both (1) statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and (2) approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This flexibility offered by MCRapper is a big advantage over previously proposed solutions, which could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining, by appropriately computing approximations of the negative and positive borders of the collection of patterns of interest, which allow an effective pruning of the pattern space and the computation of strong bounds to the supremum deviation. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.

REFERENCES

[1] Agrawal Rakesh, Imieliński Tomasz, and Swami Arun. 1993. Mining association rules between sets of items in large databases. ACM SIGMOD Record 22, 2 (June 1993), 207–216. DOI:Google ScholarDigital Library
[2] Agrawal Rakesh and Srikant Ramakrishnan. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering. IEEE, 3–14.Google ScholarDigital Library
[3] Ahmed N. K., Neville J., Rossi R. A., and N. Duffield2015. Efficient graphlet counting for large networks. In Proceedings of the 2015 IEEE International Conference on Data Mining. 1–10. DOI:Google ScholarDigital Library
[4] Bartlett Peter L. and Mendelson Shahar. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3, Nov (2002), 463–482.Google Scholar
[5] Bay Stephen D. and Pazzani Michael J.. 2001. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery 5, 3 (2001), 213–246.Google ScholarDigital Library
[6] Boley Mario, Lucchese Claudio, Paurat Daniel, and Gärtner Thomas. 2011. Direct local pattern sampling by efficient two-step random procedures. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011). DOI:Google ScholarDigital Library
[7] Bousquet Olivier. 2002. A Bennet concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique 334, 6 (2002), 495–500.Google ScholarCross Ref
[8] Chakaravarthy Venkatesan T., Pandit Vinayaka, and Sabharwal Yogish. 2009. Analysis of sampling techniques for association rule mining. In Proceedings of the 12th International Conference Database Theory (St. Petersburg, Russia). ACM, New York, NY, 276–283. DOI:Google ScholarDigital Library
[9] Cousins Cyrus and Riondato Matteo. 2020. Sharp uniform convergence bounds through empirical centralization. In Proceedings of the Advances in Neural Information Processing Systems, Larochelle H., Ranzato M., Hadsell R., Balcan M. F., and Lin H. (Eds.), Vol. 33. Curran Associates, Inc., 15123–15132. Retrieved from https://proceedings.neurips.cc/paper/2020/file/ac457ba972fb63b7994befc83f774746-Paper.pdf.Google Scholar
[10] Cousins Cyrus, Wohlgemuth Chloe, and Riondato Matteo. 2021. Bavarian: Betweenness centrality approximation with variance-aware Rademacher averages. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.Google ScholarDigital Library
[11] Stefani L. De and Upfal E.. 2019. A Rademacher complexity based method for controlling power and confidence level in adaptive statistical analysis. In Proceedings of the 2019 IEEE International Conference on Data Science and Advanced Analytics. 71–80. DOI:Google ScholarCross Ref
[12] Dzyuba Vladimir, Leeuwen Matthijs van, and Raedt Luc De. 2017. Flexible constrained sampling with guarantees for pattern mining. Data Mining and Knowledge Discovery 31, 5 (Mar 2017), 1266–1293. DOI:Google ScholarDigital Library
[13] Fournier-Viger Philippe, Lin Jerry Chun-Wei, Truong-Chi Tin, and Nkambou Roger. 2019. A survey of high utility itemset mining. In Proceedings of the High-Utility Pattern Mining. Springer International Publishing.Google ScholarCross Ref
[14] Hämäläinen Wilhelmiina and Webb Geoffrey I.. 2018. A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery 33, 2 (Dec 2018), 325–377. DOI:Google ScholarDigital Library
[15] Kirsch Adam, Mitzenmacher Michael, Pietracaprina Andrea, Pucci Geppino, Upfal Eli, and Vandin Fabio. 2012. An efficient rigorous approach for identifying statistically significant frequent itemsets. Journal of the ACM 59, 3 (2012), 1–22.Google ScholarDigital Library
[16] Klösgen Willi. 1992. Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. International Journal of Intelligent Systems 7, 7 (1992), 649–673.Google ScholarCross Ref
[17] Koltchinskii Vladimir and Panchenko Dmitriy. 2000. Rademacher processes and bounding the risk of function learning. In Proceedings of the High Dimensional Probability II. Springer, 443–457.Google ScholarCross Ref
[18] Mannila Heikki and Toivonen Hannu. 1996. On an algorithm for finding all interesting sentences. In Proceedings of the 13th European Meeting on Cybernetics and Systems Research, Vol. II. Citeseer.Google Scholar
[19] McDiarmid Colin. 1989. On the method of bounded differences. Surveys in Combinatorics 141, 1 (1989), 148–188.Google Scholar
[20] Oneto Luca, Ghio Alessandro, Anguita Davide, and Ridella Sandro. 2013. An improved analysis of the Rademacher data-dependent bound using its self bounding property. Neural Networks 44 (2013), 107–111. https://www.sciencedirect.com/science/article/abs/pii/S0893608013001020.Google ScholarCross Ref
[21] Pellegrina Leonardo. 2021. Rigorous and Efficient Algorithms for Significant and Approximate Pattern Mining. Ph.D. Thesis. Universitá degli Studi di Padova. Retrieved from http://www.dei.unipd.it/pellegri/thesis/leonardo_pellegrina_tesi.pdf.Google Scholar
[22] Pellegrina Leonardo, Cousins Cyrus, Vandin Fabio, and Riondato Matteo. 2020. MCRapper: Monte-Carlo Rademacher averages for poset families and approximate pattern mining. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.Google ScholarDigital Library
[23] Pellegrina Leonardo, Riondato Matteo, and Vandin Fabio. 2019. SPuManTE: Significant pattern mining with unconditional testing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York, NY, 1528–1538. DOI:Google ScholarDigital Library
[24] Pellegrina Leonardo and Vandin Fabio. 2020. Efficient mining of the most significant patterns with permutation testing. Data Mining and Knowledge Discovery 34, 4 (2020), 1201–1234.Google ScholarCross Ref
[25] Riondato Matteo and Upfal Eli. 2014. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Transactions on Knowledge Discovery from Data 8, 4 (2014), 20. DOI:Google ScholarDigital Library
[26] Riondato Matteo and Upfal Eli. 2015. Mining frequent itemsets through progressive sampling with Rademacher averages. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1005–1014.Google ScholarDigital Library
[27] Riondato Matteo and Vandin Fabio. 2014. Finding the true frequent itemsets. In Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, 497–505.Google ScholarCross Ref
[28] Riondato Matteo and Vandin Fabio. 2018. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2130–2139.Google ScholarDigital Library
[29] Riondato Matteo and Vandin Fabio. 2020. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. ACM Transactions on Knowledge Discovery from Data 14, 5 (June 2020), Article 56, 31 pages. DOI:Google ScholarDigital Library
[30] Santoro Diego, Tonon Andrea, and Vandin Fabio. 2020. Mining sequential patterns with VC-dimension and Rademacher complexity. Algorithms 13, 5 (2020), 123.Google ScholarCross Ref
[31] Servan-Schreiber Sacha, Riondato Matteo, and Zgraggen Emanuel. 2018. ProSecCo: Progressive sequence mining with convergence guarantees. In Proceedings of the 18th IEEE International Conference on Data Mining. 417–426.Google ScholarCross Ref
[32] Servan-Schreiber Sacha, Riondato Matteo, and Zgraggen Emanuel. 2020. ProSecCo: Progressive sequence mining with convergence guarantees. Knowledge and Information Systems 62, 4 (2020), 1313–1340.Google ScholarDigital Library
[33] Shalev-Shwartz Shai and Ben-David Shai. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.Google ScholarCross Ref
[34] Sugiyama Mahito, Llinares-López Felipe, Kasenburg Niklas, and Borgwardt Karsten M. 2015. Significant subgraph mining with multiple testing correction. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 37–45.Google ScholarCross Ref
[35] Terada Aika, Okada-Hatakeyama Mariko, Tsuda Koji, and Sese Jun. 2013. Statistical significance of combinatorial regulations. Proceedings of the National Academy of Sciences 110, 32 (2013), 12996–13001.Google ScholarCross Ref
[36] Toivonen Hannu. 1996. Sampling large databases for association rules. In Proceedings of the 22nd International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, 134–145. Google ScholarDigital Library
[37] Tonon Andrea and Vandin Fabio. 2019. Permutation strategies for mining significant sequential patterns. In Proceedings of the 2019 IEEE International Conference on Data Mining. IEEE, 1330–1335.Google ScholarCross Ref
[38] Vapnik Vladimir N.. 1998. Statistical Learning Theory. Wiley. Google ScholarDigital Library

Index Terms

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

Recommendations

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA ...
Read More
Closed frequent similar pattern mining

The concept of closed frequent similar pattern mining is introduced.Several lemmas to prune the search space are introduced and proved.A novel closed frequent similar pattern mining algorithm (CFSP-Miner), is proposed.CFSP-Miner is more efficient than ...
Read More
Developing Novel and Effective Approach for Association Rule Mining Using Progressive Sampling
ICCEE '09: Proceedings of the 2009 Second International Conference on Computer and Electrical Engineering - Volume 01

A challenging task in data mining is the process of discovering association rules from a large database. Most of the existing association rule mining algorithms make repeated passes over the entire database to determine the frequent itemsets, which is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Knowledge Discovery from Data Volume 16, Issue 6
December 2022
631 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3543989
Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 July 2022
- Online AM: 25 April 2022
- Accepted: 1 April 2022
- Revised: 1 December 2021
- Received: 1 August 2021
Published in tkdd Volume 16, Issue 6

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Approximation algorithms
frequent patterns
itemsets
sampling
significant patterns
statistical testing
statistical learning theory
subgroup discovery
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 181
  Total Downloads
- Downloads (Last 12 months)49
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

ACM Transactions on Knowledge Discovery from Data

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

Closed frequent similar pattern mining

Developing Novel and Effective Approach for Association Rule Mining Using Progressive Sampling