Abstract
Recent research has highlighted the practical benefits of subjective interestingness measures, which quantify the novelty or unexpectedness of a pattern when contrasted with any prior information of the data miner (Silberschatz and Tuzhilin, Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), 1995; Geng and Hamilton, ACM Comput Surv 38(3):9, 2006). A key challenge here is the formalization of this prior information in a way that lends itself to the definition of a subjective interestingness measure that is both meaningful and practical. In this paper, we outline a general strategy of how this could be achieved, before working out the details for a use case that is important in its own right. Our general strategy is based on considering prior information as constraints on a probabilistic model representing the uncertainty about the data. More specifically, we represent the prior information by the maximum entropy (MaxEnt) distribution subject to these constraints. We briefly outline various measures that could subsequently be used to contrast patterns with this MaxEnt model, thus quantifying their subjective interestingness. We demonstrate this strategy for rectangular databases with knowledge of the row and column sums. This situation has been considered before using computation intensive approaches based on swap randomizations, allowing for the computation of empirical p-values as interestingness measures (Gionis et al., ACM Trans Knowl Discov Data 1(3):14, 2007). We show how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently. More importantly, being an explicitly represented distribution, the MaxEnt model can additionally be used to define analytically computable interestingness measures, as we demonstrate for tiles (Geerts et al., Proceedings of the 7th international conference on Discovery science (DS04), 2004) in binary databases.
Similar content being viewed by others
References
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on Very large databases (VLDB94), pp 487–499
Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439): 509–512
Boyd S, Vandeberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery in databases (KDD99), pp 254–260
Calders T (2008) Itemset frequency satisfiability: complexity and axiomatization. Theor Comput Sci 394 (1-2): 84–111
Chung F, Lu L (2004) The average distance in a random graph with given expected degrees. Int Math 1(1): 91–113
Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, Hoboken
De Bie T (2009a) Explicit probabilistic models for databases and networks. Tech. Rep. 123931, arXiv:0906.5148v1, University of Bristol
De Bie T (2009b) Finding interesting itemsets using a probabilistic model for binary databases. Tech. Rep. 123930, University of Bristol
De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 2007 SIAM international conference on Data mining (SDM08), pp 237–248
Gallo A, De Bie T, Cristianini N (2007) MINI: Mining informative non-redundant itemsets. In: Proceedings of the 11th European conference on Principles and practice of knowledge discovery in databases (PKDD07), pp 438–445
Gallo A, Mammone A, De Bie T, Turchi M, Cristianini N (2009) From frequent itemsets to informative patterns. Tech. Rep. 123936, University of Bristal
Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of the 7th international conference on Discovery science (DS04), pp 278–289
Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3): 9
Gentle JE (2005) Elements of computational statistics. Springer, New York
Gionis A, Mannila H, Seppänen JK (2004) Geometric and combinatorial tiles in 0-1 data. In: Principles and practice of knowledge discovery in databases (PKDD04), pp 173–184
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3): 14
Gull S, Skilling J (1984) Maximum entropy method in image processing. Communications, radar and signal processing. IEE Proc F 131(6): 646–659
Hanhijarvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD09), pp 379–388
Jaroszewicz S, Simovici DA (2004) Interestingness of frequent itemsets using bayesian networks as background knowledge. In: Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD04), pp 178–186
Jaynes E (1957) Information theory and statistical mechanics I. Phys Rev 106(4): 620–630
Jaynes E (1982) On the rationale of maximum-entropy methods. Proc IEEE 70(9): 939–952
Khuller S, Moss A, Naor J (1999) The budgeted maximum coverage problem. Inf Process Lett 70(1):39–45
Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding informative noisy tiles in binary databases. In: Proceedings of the 2010 SIAM international conference on Data mining (SDM10), pp 153–164
Lehmann E, Romano J (1995) Testing statistical hypotheses, 3rd edn. Springer, New York
Mannila H (2008) Randomization techniques for data mining methods. In: Proceedings of the 12th East European conference on Advances in databases and information systems (ADBIS08), p 1
Miettinen P, Mielikainen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10): 1348–1362
Milo R, Shen-Orr S, Itzkovirz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594): 824–827
Minoux M (1978) Accelerated greedy algorithms for maximizing submodular set functions. Optimization Techniques. Springer, Berlin, pp 234–243
Newman M (2003) The structure and function of complex networks. SIAM Rev 45(2): 167–256
Ojala M, Vuokko N, Kallio A, Haiminen N, Mannila H (2008) Randomization of real-valued matrices for assessing the significance of data mining results. In: Proceedings of the 2008 SIAM international conference on Data mining (SDM08), pp 494–505
Padmanabhan B, Tuzhilin A (1998) A belief-driven method for discovering unexpected patterns. In: Proceedings of the 4th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD98), pp 94–100
Padmanabhan B, Tuzhilin A (2000) Small is beautiful: discovering the minimal set of unexpected patterns. In: Proceedings of the 6th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD00), pp 54–63
Pavlov D, Mannila H, Smyth P (2003) Beyond independence: probabilistic models for query approximation on binary transaction data. IEEE Trans Knowl Data Eng 15: 1409–1421
Rasch G (1961) On general laws and the meaning of measurement in psychology. In: Proceedings of the fourth Berkeley symposium on Mathematical statistics and probability, vol IV, pp 321–333
Robins G, Pattison P, Kalish Y, Lusher D (2007) An introduction to exponential random graph (p*) models for social networks. Soc Netw 29(2): 173–191
Savinov A (2004) Mining dependence rules by finding largest itemset support quota. In: Proceedings of the 2004 ACM symposium on Applied computing, pp 525–529
Shewchuk J (1994) An introduction to the conjugate gradient method without the agonizing pain. Tech. rep, CMU
Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the 2006 SIAM international conference on Data mining (SDM06), pp 393–404
Silberschatz A, Tuzhilin A (1995) On subjective measures of interestingness in knowledge discovery. In: Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), pp 275–281
Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1): 57–77
Topsøe F (1979) Information-theoretical optimization techniques. Kybernetika 15(1): 8–27
Tribus M (1961) Thermostatics and thermodynamics: an introduction to energy, information and states of matter, with engineering applications. Van Nostrand, Princeton
Wainwright M, Jordan MI (2008) Graphical models, exponential families, and variational inference. Found Trends Mach Learn 1(1-2): 1–305
Zaki M, Hsiao C (2002) CHARM: an efficient algorithm for closed itemsets mining. In: Proceedings of the 2002 SIAM international conference on Data mining (SDM02), pp 457–473
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Johannes Fürnkranz.
Rights and permissions
About this article
Cite this article
De Bie, T. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Disc 23, 407–446 (2011). https://doi.org/10.1007/s10618-010-0209-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-010-0209-3