Abstract
“I’m an MC still as honest” – Eminem, Rap God
We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both (1) statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and (2) approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This flexibility offered by MCRapper is a big advantage over previously proposed solutions, which could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining, by appropriately computing approximations of the negative and positive borders of the collection of patterns of interest, which allow an effective pruning of the pattern space and the computation of strong bounds to the supremum deviation. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.
- [1] . 1993. Mining association rules between sets of items in large databases. ACM SIGMOD Record 22, 2 (
June 1993), 207–216.DOI: Google ScholarDigital Library - [2] . 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering. IEEE, 3–14.Google ScholarDigital Library
- [3] 2015. Efficient graphlet counting for large networks. In Proceedings of the 2015 IEEE International Conference on Data Mining. 1–10.
DOI: Google ScholarDigital Library - [4] . 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3, Nov (2002), 463–482.Google Scholar
- [5] . 2001. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery 5, 3 (2001), 213–246.Google ScholarDigital Library
- [6] . 2011. Direct local pattern sampling by efficient two-step random procedures. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011).
DOI: Google ScholarDigital Library - [7] . 2002. A Bennet concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique 334, 6 (2002), 495–500.Google ScholarCross Ref
- [8] . 2009. Analysis of sampling techniques for association rule mining. In Proceedings of the 12th International Conference Database Theory (St. Petersburg, Russia). ACM, New York, NY, 276–283.
DOI: Google ScholarDigital Library - [9] . 2020. Sharp uniform convergence bounds through empirical centralization. In Proceedings of the Advances in Neural Information Processing Systems, , , , , and (Eds.), Vol. 33. Curran Associates, Inc., 15123–15132. Retrieved from https://proceedings.neurips.cc/paper/2020/file/ac457ba972fb63b7994befc83f774746-Paper.pdf.Google Scholar
- [10] . 2021. Bavarian: Betweenness centrality approximation with variance-aware Rademacher averages. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.Google ScholarDigital Library
- [11] . 2019. A Rademacher complexity based method for controlling power and confidence level in adaptive statistical analysis. In Proceedings of the 2019 IEEE International Conference on Data Science and Advanced Analytics. 71–80.
DOI: Google ScholarCross Ref - [12] . 2017. Flexible constrained sampling with guarantees for pattern mining. Data Mining and Knowledge Discovery 31, 5 (
Mar 2017), 1266–1293.DOI: Google ScholarDigital Library - [13] . 2019. A survey of high utility itemset mining. In Proceedings of the High-Utility Pattern Mining. Springer International Publishing.Google ScholarCross Ref
- [14] . 2018. A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery 33, 2 (
Dec 2018), 325–377.DOI: Google ScholarDigital Library - [15] . 2012. An efficient rigorous approach for identifying statistically significant frequent itemsets. Journal of the ACM 59, 3 (2012), 1–22.Google ScholarDigital Library
- [16] . 1992. Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. International Journal of Intelligent Systems 7, 7 (1992), 649–673.Google ScholarCross Ref
- [17] . 2000. Rademacher processes and bounding the risk of function learning. In Proceedings of the High Dimensional Probability II. Springer, 443–457.Google ScholarCross Ref
- [18] . 1996. On an algorithm for finding all interesting sentences. In Proceedings of the 13th European Meeting on Cybernetics and Systems Research, Vol. II. Citeseer.Google Scholar
- [19] . 1989. On the method of bounded differences. Surveys in Combinatorics 141, 1 (1989), 148–188.Google Scholar
- [20] . 2013. An improved analysis of the Rademacher data-dependent bound using its self bounding property. Neural Networks 44 (2013), 107–111. https://www.sciencedirect.com/science/article/abs/pii/S0893608013001020.Google ScholarCross Ref
- [21] . 2021. Rigorous and Efficient Algorithms for Significant and Approximate Pattern Mining. Ph.D. Thesis. Universitá degli Studi di Padova. Retrieved from http://www.dei.unipd.it/pellegri/thesis/leonardo_pellegrina_tesi.pdf.Google Scholar
- [22] . 2020. MCRapper: Monte-Carlo Rademacher averages for poset families and approximate pattern mining. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.Google ScholarDigital Library
- [23] . 2019. SPuManTE: Significant pattern mining with unconditional testing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York, NY, 1528–1538.
DOI: Google ScholarDigital Library - [24] . 2020. Efficient mining of the most significant patterns with permutation testing. Data Mining and Knowledge Discovery 34, 4 (2020), 1201–1234.Google ScholarCross Ref
- [25] . 2014. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Transactions on Knowledge Discovery from Data 8, 4 (2014), 20.
DOI: Google ScholarDigital Library - [26] . 2015. Mining frequent itemsets through progressive sampling with Rademacher averages. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1005–1014.Google ScholarDigital Library
- [27] . 2014. Finding the true frequent itemsets. In Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, 497–505.Google ScholarCross Ref
- [28] . 2018. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2130–2139.Google ScholarDigital Library
- [29] . 2020. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. ACM Transactions on Knowledge Discovery from Data 14, 5 (
June 2020), Article56 , 31 pages.DOI: Google ScholarDigital Library - [30] . 2020. Mining sequential patterns with VC-dimension and Rademacher complexity. Algorithms 13, 5 (2020), 123.Google ScholarCross Ref
- [31] . 2018. ProSecCo: Progressive sequence mining with convergence guarantees. In Proceedings of the 18th IEEE International Conference on Data Mining. 417–426.Google ScholarCross Ref
- [32] . 2020. ProSecCo: Progressive sequence mining with convergence guarantees. Knowledge and Information Systems 62, 4 (2020), 1313–1340.Google ScholarDigital Library
- [33] . 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.Google ScholarCross Ref
- [34] . 2015. Significant subgraph mining with multiple testing correction. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 37–45.Google ScholarCross Ref
- [35] . 2013. Statistical significance of combinatorial regulations. Proceedings of the National Academy of Sciences 110, 32 (2013), 12996–13001.Google ScholarCross Ref
- [36] . 1996. Sampling large databases for association rules. In Proceedings of the 22nd International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, 134–145. Google ScholarDigital Library
- [37] . 2019. Permutation strategies for mining significant sequential patterns. In Proceedings of the 2019 IEEE International Conference on Data Mining. IEEE, 1330–1335.Google ScholarCross Ref
- [38] . 1998. Statistical Learning Theory. Wiley. Google ScholarDigital Library
Index Terms
- MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining
Recommendations
MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningWe present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA ...
Closed frequent similar pattern mining
The concept of closed frequent similar pattern mining is introduced.Several lemmas to prune the search space are introduced and proved.A novel closed frequent similar pattern mining algorithm (CFSP-Miner), is proposed.CFSP-Miner is more efficient than ...
Developing Novel and Effective Approach for Association Rule Mining Using Progressive Sampling
ICCEE '09: Proceedings of the 2009 Second International Conference on Computer and Electrical Engineering - Volume 01A challenging task in data mining is the process of discovering association rules from a large database. Most of the existing association rule mining algorithms make repeated passes over the entire database to determine the frequent itemsets, which is ...
Comments