Skip to main content
Log in

Maximum entropy models and subjective interestingness: an application to tiles in binary databases

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Recent research has highlighted the practical benefits of subjective interestingness measures, which quantify the novelty or unexpectedness of a pattern when contrasted with any prior information of the data miner (Silberschatz and Tuzhilin, Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), 1995; Geng and Hamilton, ACM Comput Surv 38(3):9, 2006). A key challenge here is the formalization of this prior information in a way that lends itself to the definition of a subjective interestingness measure that is both meaningful and practical. In this paper, we outline a general strategy of how this could be achieved, before working out the details for a use case that is important in its own right. Our general strategy is based on considering prior information as constraints on a probabilistic model representing the uncertainty about the data. More specifically, we represent the prior information by the maximum entropy (MaxEnt) distribution subject to these constraints. We briefly outline various measures that could subsequently be used to contrast patterns with this MaxEnt model, thus quantifying their subjective interestingness. We demonstrate this strategy for rectangular databases with knowledge of the row and column sums. This situation has been considered before using computation intensive approaches based on swap randomizations, allowing for the computation of empirical p-values as interestingness measures (Gionis et al., ACM Trans Knowl Discov Data 1(3):14, 2007). We show how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently. More importantly, being an explicitly represented distribution, the MaxEnt model can additionally be used to define analytically computable interestingness measures, as we demonstrate for tiles (Geerts et al., Proceedings of the 7th international conference on Discovery science (DS04), 2004) in binary databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on Very large databases (VLDB94), pp 487–499

  • Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html

  • Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439): 509–512

    Article  MathSciNet  Google Scholar 

  • Boyd S, Vandeberghe L (2004) Convex optimization. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery in databases (KDD99), pp 254–260

  • Calders T (2008) Itemset frequency satisfiability: complexity and axiomatization. Theor Comput Sci 394 (1-2): 84–111

    Article  MathSciNet  MATH  Google Scholar 

  • Chung F, Lu L (2004) The average distance in a random graph with given expected degrees. Int Math 1(1): 91–113

    MathSciNet  Google Scholar 

  • Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, Hoboken

    Book  MATH  Google Scholar 

  • De Bie T (2009a) Explicit probabilistic models for databases and networks. Tech. Rep. 123931, arXiv:0906.5148v1, University of Bristol

  • De Bie T (2009b) Finding interesting itemsets using a probabilistic model for binary databases. Tech. Rep. 123930, University of Bristol

  • De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 2007 SIAM international conference on Data mining (SDM08), pp 237–248

  • Gallo A, De Bie T, Cristianini N (2007) MINI: Mining informative non-redundant itemsets. In: Proceedings of the 11th European conference on Principles and practice of knowledge discovery in databases (PKDD07), pp 438–445

  • Gallo A, Mammone A, De Bie T, Turchi M, Cristianini N (2009) From frequent itemsets to informative patterns. Tech. Rep. 123936, University of Bristal

  • Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of the 7th international conference on Discovery science (DS04), pp 278–289

  • Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3): 9

    Article  Google Scholar 

  • Gentle JE (2005) Elements of computational statistics. Springer, New York

    Google Scholar 

  • Gionis A, Mannila H, Seppänen JK (2004) Geometric and combinatorial tiles in 0-1 data. In: Principles and practice of knowledge discovery in databases (PKDD04), pp 173–184

  • Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3): 14

    Article  Google Scholar 

  • Gull S, Skilling J (1984) Maximum entropy method in image processing. Communications, radar and signal processing. IEE Proc F 131(6): 646–659

    Google Scholar 

  • Hanhijarvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD09), pp 379–388

  • Jaroszewicz S, Simovici DA (2004) Interestingness of frequent itemsets using bayesian networks as background knowledge. In: Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD04), pp 178–186

  • Jaynes E (1957) Information theory and statistical mechanics I. Phys Rev 106(4): 620–630

    Article  MathSciNet  Google Scholar 

  • Jaynes E (1982) On the rationale of maximum-entropy methods. Proc IEEE 70(9): 939–952

    Article  Google Scholar 

  • Khuller S, Moss A, Naor J (1999) The budgeted maximum coverage problem. Inf Process Lett 70(1):39–45

    Article  Google Scholar 

  • Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding informative noisy tiles in binary databases. In: Proceedings of the 2010 SIAM international conference on Data mining (SDM10), pp 153–164

  • Lehmann E, Romano J (1995) Testing statistical hypotheses, 3rd edn. Springer, New York

    Google Scholar 

  • Mannila H (2008) Randomization techniques for data mining methods. In: Proceedings of the 12th East European conference on Advances in databases and information systems (ADBIS08), p 1

  • Miettinen P, Mielikainen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10): 1348–1362

    Article  Google Scholar 

  • Milo R, Shen-Orr S, Itzkovirz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594): 824–827

    Article  Google Scholar 

  • Minoux M (1978) Accelerated greedy algorithms for maximizing submodular set functions. Optimization Techniques. Springer, Berlin, pp 234–243

  • Newman M (2003) The structure and function of complex networks. SIAM Rev 45(2): 167–256

    Article  MathSciNet  MATH  Google Scholar 

  • Ojala M, Vuokko N, Kallio A, Haiminen N, Mannila H (2008) Randomization of real-valued matrices for assessing the significance of data mining results. In: Proceedings of the 2008 SIAM international conference on Data mining (SDM08), pp 494–505

  • Padmanabhan B, Tuzhilin A (1998) A belief-driven method for discovering unexpected patterns. In: Proceedings of the 4th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD98), pp 94–100

  • Padmanabhan B, Tuzhilin A (2000) Small is beautiful: discovering the minimal set of unexpected patterns. In: Proceedings of the 6th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD00), pp 54–63

  • Pavlov D, Mannila H, Smyth P (2003) Beyond independence: probabilistic models for query approximation on binary transaction data. IEEE Trans Knowl Data Eng 15: 1409–1421

    Article  Google Scholar 

  • Rasch G (1961) On general laws and the meaning of measurement in psychology. In: Proceedings of the fourth Berkeley symposium on Mathematical statistics and probability, vol IV, pp 321–333

  • Robins G, Pattison P, Kalish Y, Lusher D (2007) An introduction to exponential random graph (p*) models for social networks. Soc Netw 29(2): 173–191

    Article  Google Scholar 

  • Savinov A (2004) Mining dependence rules by finding largest itemset support quota. In: Proceedings of the 2004 ACM symposium on Applied computing, pp 525–529

  • Shewchuk J (1994) An introduction to the conjugate gradient method without the agonizing pain. Tech. rep, CMU

  • Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the 2006 SIAM international conference on Data mining (SDM06), pp 393–404

  • Silberschatz A, Tuzhilin A (1995) On subjective measures of interestingness in knowledge discovery. In: Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), pp 275–281

  • Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1): 57–77

    Article  Google Scholar 

  • Topsøe F (1979) Information-theoretical optimization techniques. Kybernetika 15(1): 8–27

    MathSciNet  Google Scholar 

  • Tribus M (1961) Thermostatics and thermodynamics: an introduction to energy, information and states of matter, with engineering applications. Van Nostrand, Princeton

  • Wainwright M, Jordan MI (2008) Graphical models, exponential families, and variational inference. Found Trends Mach Learn 1(1-2): 1–305

    MATH  Google Scholar 

  • Zaki M, Hsiao C (2002) CHARM: an efficient algorithm for closed itemsets mining. In: Proceedings of the 2002 SIAM international conference on Data mining (SDM02), pp 457–473

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tijl De Bie.

Additional information

Responsible editor: Johannes Fürnkranz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

De Bie, T. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Disc 23, 407–446 (2011). https://doi.org/10.1007/s10618-010-0209-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-010-0209-3

Keywords

Navigation