Maximum entropy models and subjective interestingness: an application to tiles in binary databases

De Bie, Tijl

doi:10.1007/s10618-010-0209-3

Maximum entropy models and subjective interestingness: an application to tiles in binary databases

Published: 11 December 2010

Volume 23, pages 407–446, (2011)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Tijl De Bie¹

1230 Accesses
87 Citations
Explore all metrics

Abstract

Recent research has highlighted the practical benefits of subjective interestingness measures, which quantify the novelty or unexpectedness of a pattern when contrasted with any prior information of the data miner (Silberschatz and Tuzhilin, Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), 1995; Geng and Hamilton, ACM Comput Surv 38(3):9, 2006). A key challenge here is the formalization of this prior information in a way that lends itself to the definition of a subjective interestingness measure that is both meaningful and practical. In this paper, we outline a general strategy of how this could be achieved, before working out the details for a use case that is important in its own right. Our general strategy is based on considering prior information as constraints on a probabilistic model representing the uncertainty about the data. More specifically, we represent the prior information by the maximum entropy (MaxEnt) distribution subject to these constraints. We briefly outline various measures that could subsequently be used to contrast patterns with this MaxEnt model, thus quantifying their subjective interestingness. We demonstrate this strategy for rectangular databases with knowledge of the row and column sums. This situation has been considered before using computation intensive approaches based on swap randomizations, allowing for the computation of empirical p-values as interestingness measures (Gionis et al., ACM Trans Knowl Discov Data 1(3):14, 2007). We show how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently. More importantly, being an explicitly represented distribution, the MaxEnt model can additionally be used to define analytically computable interestingness measures, as we demonstrate for tiles (Geerts et al., Proceedings of the 7th international conference on Discovery science (DS04), 2004) in binary databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on Very large databases (VLDB94), pp 487–499
Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439): 509–512
Article MathSciNet Google Scholar
Boyd S, Vandeberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
MATH Google Scholar
Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery in databases (KDD99), pp 254–260
Calders T (2008) Itemset frequency satisfiability: complexity and axiomatization. Theor Comput Sci 394 (1-2): 84–111
Article MathSciNet MATH Google Scholar
Chung F, Lu L (2004) The average distance in a random graph with given expected degrees. Int Math 1(1): 91–113
MathSciNet Google Scholar
Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, Hoboken
Book MATH Google Scholar
De Bie T (2009a) Explicit probabilistic models for databases and networks. Tech. Rep. 123931, arXiv:0906.5148v1, University of Bristol
De Bie T (2009b) Finding interesting itemsets using a probabilistic model for binary databases. Tech. Rep. 123930, University of Bristol
De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 2007 SIAM international conference on Data mining (SDM08), pp 237–248
Gallo A, De Bie T, Cristianini N (2007) MINI: Mining informative non-redundant itemsets. In: Proceedings of the 11th European conference on Principles and practice of knowledge discovery in databases (PKDD07), pp 438–445
Gallo A, Mammone A, De Bie T, Turchi M, Cristianini N (2009) From frequent itemsets to informative patterns. Tech. Rep. 123936, University of Bristal
Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of the 7th international conference on Discovery science (DS04), pp 278–289
Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3): 9
Article Google Scholar
Gentle JE (2005) Elements of computational statistics. Springer, New York
Google Scholar
Gionis A, Mannila H, Seppänen JK (2004) Geometric and combinatorial tiles in 0-1 data. In: Principles and practice of knowledge discovery in databases (PKDD04), pp 173–184
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3): 14
Article Google Scholar
Gull S, Skilling J (1984) Maximum entropy method in image processing. Communications, radar and signal processing. IEE Proc F 131(6): 646–659
Google Scholar
Hanhijarvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD09), pp 379–388
Jaroszewicz S, Simovici DA (2004) Interestingness of frequent itemsets using bayesian networks as background knowledge. In: Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD04), pp 178–186
Jaynes E (1957) Information theory and statistical mechanics I. Phys Rev 106(4): 620–630
Article MathSciNet Google Scholar
Jaynes E (1982) On the rationale of maximum-entropy methods. Proc IEEE 70(9): 939–952
Article Google Scholar
Khuller S, Moss A, Naor J (1999) The budgeted maximum coverage problem. Inf Process Lett 70(1):39–45
Article Google Scholar
Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding informative noisy tiles in binary databases. In: Proceedings of the 2010 SIAM international conference on Data mining (SDM10), pp 153–164
Lehmann E, Romano J (1995) Testing statistical hypotheses, 3rd edn. Springer, New York
Google Scholar
Mannila H (2008) Randomization techniques for data mining methods. In: Proceedings of the 12th East European conference on Advances in databases and information systems (ADBIS08), p 1
Miettinen P, Mielikainen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10): 1348–1362
Article Google Scholar
Milo R, Shen-Orr S, Itzkovirz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594): 824–827
Article Google Scholar
Minoux M (1978) Accelerated greedy algorithms for maximizing submodular set functions. Optimization Techniques. Springer, Berlin, pp 234–243
Newman M (2003) The structure and function of complex networks. SIAM Rev 45(2): 167–256
Article MathSciNet MATH Google Scholar
Ojala M, Vuokko N, Kallio A, Haiminen N, Mannila H (2008) Randomization of real-valued matrices for assessing the significance of data mining results. In: Proceedings of the 2008 SIAM international conference on Data mining (SDM08), pp 494–505
Padmanabhan B, Tuzhilin A (1998) A belief-driven method for discovering unexpected patterns. In: Proceedings of the 4th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD98), pp 94–100
Padmanabhan B, Tuzhilin A (2000) Small is beautiful: discovering the minimal set of unexpected patterns. In: Proceedings of the 6th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD00), pp 54–63
Pavlov D, Mannila H, Smyth P (2003) Beyond independence: probabilistic models for query approximation on binary transaction data. IEEE Trans Knowl Data Eng 15: 1409–1421
Article Google Scholar
Rasch G (1961) On general laws and the meaning of measurement in psychology. In: Proceedings of the fourth Berkeley symposium on Mathematical statistics and probability, vol IV, pp 321–333
Robins G, Pattison P, Kalish Y, Lusher D (2007) An introduction to exponential random graph (p*) models for social networks. Soc Netw 29(2): 173–191
Article Google Scholar
Savinov A (2004) Mining dependence rules by finding largest itemset support quota. In: Proceedings of the 2004 ACM symposium on Applied computing, pp 525–529
Shewchuk J (1994) An introduction to the conjugate gradient method without the agonizing pain. Tech. rep, CMU
Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the 2006 SIAM international conference on Data mining (SDM06), pp 393–404
Silberschatz A, Tuzhilin A (1995) On subjective measures of interestingness in knowledge discovery. In: Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), pp 275–281
Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1): 57–77
Article Google Scholar
Topsøe F (1979) Information-theoretical optimization techniques. Kybernetika 15(1): 8–27
MathSciNet Google Scholar
Tribus M (1961) Thermostatics and thermodynamics: an introduction to energy, information and states of matter, with engineering applications. Van Nostrand, Princeton
Wainwright M, Jordan MI (2008) Graphical models, exponential families, and variational inference. Found Trends Mach Learn 1(1-2): 1–305
MATH Google Scholar
Zaki M, Hsiao C (2002) CHARM: an efficient algorithm for closed itemsets mining. In: Proceedings of the 2002 SIAM international conference on Data mining (SDM02), pp 457–473

Download references

Author information

Authors and Affiliations

Intelligent Systems Laboratory, University of Bristol, Bristol, UK
Tijl De Bie

Authors

Tijl De Bie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tijl De Bie.

Additional information

Responsible editor: Johannes Fürnkranz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

De Bie, T. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Disc 23, 407–446 (2011). https://doi.org/10.1007/s10618-010-0209-3

Download citation

Received: 15 March 2010
Accepted: 19 November 2010
Published: 11 December 2010
Issue Date: November 2011
DOI: https://doi.org/10.1007/s10618-010-0209-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Maximum entropy models and subjective interestingness: an application to tiles in binary databases

Abstract

Access this article

Similar content being viewed by others

Gibbs Sampling Subjectively Interesting Tiles

Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data

Subjective Interestingness in Exploratory Data Mining

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Maximum entropy models and subjective interestingness: an application to tiles in binary databases

Abstract

Access this article

Similar content being viewed by others

Gibbs Sampling Subjectively Interesting Tiles

Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data

Subjective Interestingness in Exploratory Data Mining

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation