skip to main content
research-article

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

Published:30 July 2022Publication History
Skip Abstract Section

Abstract

“I’m an MC still as honest” – Eminem, Rap God

We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both (1) statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and (2) approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This flexibility offered by MCRapper is a big advantage over previously proposed solutions, which could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining, by appropriately computing approximations of the negative and positive borders of the collection of patterns of interest, which allow an effective pruning of the pattern space and the computation of strong bounds to the supremum deviation. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.

REFERENCES

  1. [1] Agrawal Rakesh, Imieliński Tomasz, and Swami Arun. 1993. Mining association rules between sets of items in large databases. ACM SIGMOD Record 22, 2 (June 1993), 207216. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Agrawal Rakesh and Srikant Ramakrishnan. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering. IEEE, 314.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Ahmed N. K., Neville J., Rossi R. A., and N. Duffield2015. Efficient graphlet counting for large networks. In Proceedings of the 2015 IEEE International Conference on Data Mining. 110. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bartlett Peter L. and Mendelson Shahar. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3, Nov (2002), 463482.Google ScholarGoogle Scholar
  5. [5] Bay Stephen D. and Pazzani Michael J.. 2001. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery 5, 3 (2001), 213246.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Boley Mario, Lucchese Claudio, Paurat Daniel, and Gärtner Thomas. 2011. Direct local pattern sampling by efficient two-step random procedures. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Bousquet Olivier. 2002. A Bennet concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique 334, 6 (2002), 495500.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chakaravarthy Venkatesan T., Pandit Vinayaka, and Sabharwal Yogish. 2009. Analysis of sampling techniques for association rule mining. In Proceedings of the 12th International Conference Database Theory (St. Petersburg, Russia). ACM, New York, NY, 276283. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Cousins Cyrus and Riondato Matteo. 2020. Sharp uniform convergence bounds through empirical centralization. In Proceedings of the Advances in Neural Information Processing Systems, Larochelle H., Ranzato M., Hadsell R., Balcan M. F., and Lin H. (Eds.), Vol. 33. Curran Associates, Inc., 1512315132. Retrieved from https://proceedings.neurips.cc/paper/2020/file/ac457ba972fb63b7994befc83f774746-Paper.pdf.Google ScholarGoogle Scholar
  10. [10] Cousins Cyrus, Wohlgemuth Chloe, and Riondato Matteo. 2021. Bavarian: Betweenness centrality approximation with variance-aware Rademacher averages. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Stefani L. De and Upfal E.. 2019. A Rademacher complexity based method for controlling power and confidence level in adaptive statistical analysis. In Proceedings of the 2019 IEEE International Conference on Data Science and Advanced Analytics. 7180. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Dzyuba Vladimir, Leeuwen Matthijs van, and Raedt Luc De. 2017. Flexible constrained sampling with guarantees for pattern mining. Data Mining and Knowledge Discovery 31, 5 (Mar 2017), 12661293. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Fournier-Viger Philippe, Lin Jerry Chun-Wei, Truong-Chi Tin, and Nkambou Roger. 2019. A survey of high utility itemset mining. In Proceedings of the High-Utility Pattern Mining. Springer International Publishing.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Hämäläinen Wilhelmiina and Webb Geoffrey I.. 2018. A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery 33, 2 (Dec 2018), 325–377. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Kirsch Adam, Mitzenmacher Michael, Pietracaprina Andrea, Pucci Geppino, Upfal Eli, and Vandin Fabio. 2012. An efficient rigorous approach for identifying statistically significant frequent itemsets. Journal of the ACM 59, 3 (2012), 122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Klösgen Willi. 1992. Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. International Journal of Intelligent Systems 7, 7 (1992), 649673.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Koltchinskii Vladimir and Panchenko Dmitriy. 2000. Rademacher processes and bounding the risk of function learning. In Proceedings of the High Dimensional Probability II. Springer, 443457.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Mannila Heikki and Toivonen Hannu. 1996. On an algorithm for finding all interesting sentences. In Proceedings of the 13th European Meeting on Cybernetics and Systems Research, Vol. II. Citeseer.Google ScholarGoogle Scholar
  19. [19] McDiarmid Colin. 1989. On the method of bounded differences. Surveys in Combinatorics 141, 1 (1989), 148188.Google ScholarGoogle Scholar
  20. [20] Oneto Luca, Ghio Alessandro, Anguita Davide, and Ridella Sandro. 2013. An improved analysis of the Rademacher data-dependent bound using its self bounding property. Neural Networks 44 (2013), 107111. https://www.sciencedirect.com/science/article/abs/pii/S0893608013001020.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Pellegrina Leonardo. 2021. Rigorous and Efficient Algorithms for Significant and Approximate Pattern Mining. Ph.D. Thesis. Universitá degli Studi di Padova. Retrieved from http://www.dei.unipd.it/pellegri/thesis/leonardo_pellegrina_tesi.pdf.Google ScholarGoogle Scholar
  22. [22] Pellegrina Leonardo, Cousins Cyrus, Vandin Fabio, and Riondato Matteo. 2020. MCRapper: Monte-Carlo Rademacher averages for poset families and approximate pattern mining. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Pellegrina Leonardo, Riondato Matteo, and Vandin Fabio. 2019. SPuManTE: Significant pattern mining with unconditional testing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York, NY, 15281538. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Pellegrina Leonardo and Vandin Fabio. 2020. Efficient mining of the most significant patterns with permutation testing. Data Mining and Knowledge Discovery 34, 4 (2020), 1201–1234.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Riondato Matteo and Upfal Eli. 2014. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Transactions on Knowledge Discovery from Data 8, 4 (2014), 20. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Riondato Matteo and Upfal Eli. 2015. Mining frequent itemsets through progressive sampling with Rademacher averages. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 10051014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Riondato Matteo and Vandin Fabio. 2014. Finding the true frequent itemsets. In Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, 497505.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Riondato Matteo and Vandin Fabio. 2018. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 21302139.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Riondato Matteo and Vandin Fabio. 2020. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. ACM Transactions on Knowledge Discovery from Data 14, 5 (June 2020), Article 56, 31 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Santoro Diego, Tonon Andrea, and Vandin Fabio. 2020. Mining sequential patterns with VC-dimension and Rademacher complexity. Algorithms 13, 5 (2020), 123.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Servan-Schreiber Sacha, Riondato Matteo, and Zgraggen Emanuel. 2018. ProSecCo: Progressive sequence mining with convergence guarantees. In Proceedings of the 18th IEEE International Conference on Data Mining. 417426.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Servan-Schreiber Sacha, Riondato Matteo, and Zgraggen Emanuel. 2020. ProSecCo: Progressive sequence mining with convergence guarantees. Knowledge and Information Systems 62, 4 (2020), 13131340.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Shalev-Shwartz Shai and Ben-David Shai. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Sugiyama Mahito, Llinares-López Felipe, Kasenburg Niklas, and Borgwardt Karsten M. 2015. Significant subgraph mining with multiple testing correction. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 3745.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Terada Aika, Okada-Hatakeyama Mariko, Tsuda Koji, and Sese Jun. 2013. Statistical significance of combinatorial regulations. Proceedings of the National Academy of Sciences 110, 32 (2013), 1299613001.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Toivonen Hannu. 1996. Sampling large databases for association rules. In Proceedings of the 22nd International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, 134145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Tonon Andrea and Vandin Fabio. 2019. Permutation strategies for mining significant sequential patterns. In Proceedings of the 2019 IEEE International Conference on Data Mining. IEEE, 13301335.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Vapnik Vladimir N.. 1998. Statistical Learning Theory. Wiley. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Knowledge Discovery from Data
          ACM Transactions on Knowledge Discovery from Data  Volume 16, Issue 6
          December 2022
          631 pages
          ISSN:1556-4681
          EISSN:1556-472X
          DOI:10.1145/3543989
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 July 2022
          • Online AM: 25 April 2022
          • Accepted: 1 April 2022
          • Revised: 1 December 2021
          • Received: 1 August 2021
          Published in tkdd Volume 16, Issue 6

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format