Skip to main content
Log in

Anytime approximation in probabilistic databases

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

This article describes an approximation algorithm for computing the probability of propositional formulas over discrete random variables. It incrementally refines lower and upper bounds on the probability of the formulas until the desired absolute or relative error guarantee is reached. This algorithm is used by the SPROUT query engine to approximate the probabilities of results to relational algebra queries on expressive probabilistic databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Abiteboul, S., Kimelfeld, B., Sagiv, Y., Senellart, P.: On the expressiveness of probabilistic XML models. VLDB J. 18(5), 1041–1064 (2009)

    Article  Google Scholar 

  2. Amsterdamer, Y., Deutch, D., Tannen, V.: Provenance for aggregate queries. In: PODS, pp. 153–164 (2011)

  3. Antova, L., Jansen, T., Koch, C., Olteanu, D.: Fast and simple relational processing of uncertain data. In: ICDE, pp. 983–992 (2008)

  4. Barcelo, P., Libkin, L., Romero, M.: Efficient approximations of conjunctive queries. In: PODS, pp. 249–260 (2012)

  5. Birnbaum, E., Lozinskii, E.: The good old Davis-Putnam procedure helps counting models. J. AI Res. 10(6), 457–477 (1999)

    MathSciNet  MATH  Google Scholar 

  6. Brayton, R.K.: Factoring logic functions. IBM J. Res. Dev. 31(2), 187 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  7. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E.R. Jr., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI (2010)

  8. Cormode, G., Garofalakis, M., Haas, P., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)

    Google Scholar 

  9. Dagum, P., Karp, R.M., Luby, M., Ross, S.M.: An optimal algorithm for Monte Carlo Estimation. SIAM J. Comput. 29(5), 1484–1496 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  10. Dalvi, N., Schnaitter, K., Suciu, D.: Computing query probability with incidence algebras. In: PODS, pp. 203–214 (2010)

  11. Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB, pp. 864–875 (2004)

  12. Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)

    Article  Google Scholar 

  13. Darwiche, A., Marquis, P.: A knowlege compilation map. J. AI Res. 17, 229–264 (2002)

    MathSciNet  MATH  Google Scholar 

  14. Davis, M., Putnam, H.: A computing procedure for quantification theory. J. ACM 7(3), 201–215 (1960)

    Article  MathSciNet  MATH  Google Scholar 

  15. Dylla, M., Miliaraki, I., Theobald, M.: Top-k query processing in probabilistic databases with non-materialized views. In: ICDE (2013, to appear)

  16. Elbassioni, K., Makino, K., Rauf, I.: On the readability of monotone boolean formulae. In: COCOON, pp. 496–505 (2009)

  17. Fink, R., Han, L., Olteanu, D.: Aggregation in probabilistic databases via knowledge compilation. PVLDB 5(5), 490–501 (2012)

    Google Scholar 

  18. Fink, R., Hogue, A., Olteanu, D., Rath, S.: SPROUT\(^2\): a squared query engine for uncertain web data. In: SIGMOD, pp. 1299–1302 (2011)

  19. Fink, R., Olteanu, D.: On the optimal approximation of queries using tractable propositional languages. In: ICDT, pp. 174–185 (2011)

  20. Fink, R., Olteanu, D., Rath, S.: Providing support for full relational algebra queries in probabilistic databases. In: ICDE, pp. 315–326 (2011)

  21. Garey, M.R., Johnson, D.S.: Computers and Intractability; A Guide to the Theory of NP-Completeness. W.H. Freeman (1979)

  22. Gatterbauer, W., Jha, A.K., Suciu, D.: Dissociation and propagation for efficient query evaluation over probabilistic databases. TR UW-CSE-10-04-01, U. Washington (2010)

  23. Golumbic, M., Mintza, A., Rotics, U.: Read-once functions revisited and the readability number of a Boolean function. In: International Colloquium on Graph Theory, pp. 357–361 (2005)

  24. Gomes, C.P., Sabharwal, A., Selman, B.: Handbook of satisfiability, Chapter. Model Counting. IOS Press (2009)

  25. Grädel, E., Gurevich, Y., Hirsch, C.: The Complexity of query reliability. In: PODS, pp. 227–234 (1998)

  26. Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. In: VLDB 965–976 (2006)

  27. Huang, J., Antova, L., Koch, C., Olteanu, D.: MayBMS: a probabilistic database management system. In: SIGMOD, pp. 1071–1074 (2009)

  28. Imielinski, T., Lipski, W.: Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  29. Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C.M., Haas, P.J.: MCDB: a Monte Carlo approach to managing uncertain data. In: SIGMOD, pp. 687–700 (2008)

  30. Jha, A.K., Suciu, D.: Knowledge compilation meets database theory: compiling queries to decision diagrams. In: ICDT, pp. 162–173 (2011)

  31. Johnson, D., Papadimitriou, C., Yannakakis, M.: On generating all maximal independent sets. Inf. Process. Lett. 27(3), 119–123 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  32. Kanagal, B., Li, J., Deshpande, A.: Sensitivity analysis and explanations for robust query evaluation in probabilistic databases. In: SIGMOD, pp. 841–852 (2011)

  33. Karp, R.M., Luby, M., Madras, N.: Monte-Carlo approximation algorithms for enumeration problems. J. Algorithms 10(3), 429–448 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  34. Koch, C.: Approximating predicates and expressive queries on probabilistic databases. In: PODS, pp. 99–108 (2008)

  35. Koch, C., Olteanu, D.: Conditioning probabilistic databases. PVLDB 1(1), 313–325 (2008)

    Google Scholar 

  36. Li, J., Deshpande, A.: Consensus answers for queries over probabilistic databases. In: PODS, pp. 259–268 (2009)

  37. Meinel, C., Theobald, T.: Algorithms and Data Structures in VLSI Design. Springer, Berlin (1998)

    Book  MATH  Google Scholar 

  38. Olteanu, D., Huang, J.: Using OBDDs for efficient query evaluation on probabilistic databases. In: SUM, pp. 326–340 (2008)

  39. Olteanu, D., Huang, J.: Secondary-storage confidence computation for conjunctive queries with inequalities. In: SIGMOD, pp. 389–402 (2009)

  40. Olteanu, D., Huang, J., Koch, C.: SPROUT: Lazy vs. Eager query plans for tuple-independent probabilistic databases. In: ICDE, pp. 640–651 (2009)

  41. Olteanu, D., Huang, J., Koch, C.: Approximate confidence computation for probabilistic databases. In: ICDE, pp. 145–156 (2010)

  42. Olteanu, D., Koch, C., Antova, L.: World-set decompositions: expressiveness and efficient algorithms. Theor. Comput. Sci. 403(2–3), 265–284 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  43. Olteanu, D., Wen, H.: Ranking query answers in probabilistic databases: complexity and efficient algorithms. In: ICDE, pp. 282–293 (2012)

  44. Pe’er, J., Pinter, R.Y.: Minimal decomposition of boolean functions using non-repeating literal trees. In: IFIP Workshop on Logic and Architecture Synthesis (1995)

  45. Provan, J.S., Ball, M.O.: The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput. 12(4), 777–788 (1983)

    Google Scholar 

  46. Ré, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: ICDE, pp. 886–895 (2007)

  47. Ré, C., Suciu, D.: Approximate lineage for probabilistic databases. PVLDB 1(1), 797–808 (2008)

    Google Scholar 

  48. Ré, C., Suciu, D.: The trichotomy of having queries on a probabilistic database. VLDB J. 18(5), 1091–1116 (2009)

    Google Scholar 

  49. Sagiv, Y., Yannakakis, M.: Equivalences among relational expressions with the union and difference operators. J. ACM 27(4), 633–655 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  50. Selman, B.: Knowledge compilation and theory approximation. J. ACM 43(2), 193–224 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  51. Sen, P., Deshpande, A., Getoor, L.: Read-once functions and query evaluation in probabilistic databases. PVLDB 3(1), 1068–1079 (2010)

    Google Scholar 

  52. Souihli, A., Senellart, P.: Optimizing approximations of DNF query lineage in probabilistic XML. In: ICDE (2013, to appear)

  53. Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic databases. Morgan & Claypool Publishers (2011)

  54. Trevisan, L.: A note on deterministic approximate counting for k-DNF. In: APPROX-RANDOM, pp. 417–426 (2004)

  55. Tsukiyama, S., Ide, M., Ariyoshi, H., Shirakawa, I.: A new algorithm for generating all the maximal independent sets. SIAM J. Comput. 6(3), 505–517 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  56. Vadhan, S.: The complexity of counting in sparse, regular, and planar graphs. SIAM J. Comput. 32(2), 398–427 (2001)

    Article  MathSciNet  Google Scholar 

  57. Vazirani, V.V.: Approximation Algorithms. Springer, Berlin (2001)

    Google Scholar 

  58. Wang, T.Y., Ré, C., Suciu, D.: Implementing not exists predicates over a probabilistic database. In: QDB/MUD, pp. 73–86 (2008)

  59. Wei, W., Selman, B.: A new approach to model counting. In: SAT, pp. 324–339 (2005)

  60. Zachary, W.W.: An information flow model for conflict and fission in small groups. J. Anthropol. Res. 33, 452–473 (1977)

    Google Scholar 

Download references

Acknowledgments

We would like to thank the anonymous reviewers and Peter Haas for their insightful comments that helped improve this article. We also thank Christoph Koch and Swaroop Rath for their collaboration on earlier work on which this article is partially based. Jiewen Huang’s work was done while at Oxford.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dan Olteanu.

Additional information

This research was funded by European Research Consortium grant agreement FOX number FP7-ICT-233599.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 152 KB)

Appendix A: Proofs

Appendix A: Proofs

This section contains proofs of formal statements in previous sections. Due to space constraints, further proofs can be found in the electronic supplementary material.

1.1 A.1 Proof of Theorem 1

Let us first show the direction MLB \(\Rightarrow \) GLB. Let \(\varPhi \) be an irreducible positive DNF formula and \(\varPhi _L\) an MLB for \(\varPhi \). Assume \(\varPhi _L\) is no GLB for \(\varPhi \). Then, there exists an iDNF formula \(\varPhi ^{\prime }_L\) such that \({\fancyscript{M}}(\varPhi _L) \subset {\fancyscript{M}}(\varPhi ^{\prime }_L) \subseteq {\fancyscript{M}}(\varPhi )\) and the following properties hold:

  1. (i)

    \(\varPhi _L \subseteq \varPhi \) (every clause in \(\varPhi _L\) is also a clause in \(\varPhi \))

  2. (ii)

    Every clause \(\varphi \) in \(\varPhi \) and not in \(\varPhi _L\) contains a conflicting variable, i.e., a variable \(x \in \varphi \) such that there exists a clause \({\bar{\varphi }}_L \in \varPhi _L\) with \(x \in \varphi _L\) (\(\varPhi _L\) is MLB for \(\varPhi \))

  3. (iii)

    \(\forall \varphi ^{\prime }_L \in \varPhi ^{\prime }_L: \exists \varphi \in \varPhi : \varphi ^{\prime }_L \supseteq \varphi \) (\(\varPhi ^{\prime }_L \models \varPhi \) and Lemma  1)

  4. (iv)

    \(\forall \varphi _L \in \varPhi _L: \exists \varphi ^{\prime }_L \in \varPhi ^{\prime }_L: \varphi _L \supseteq \varphi ^{\prime }_L\) (\(\varPhi _L \models \varPhi ^{\prime }_L\) and Lemma 1)

  5. (v)

    No two clauses \(\varphi , {\bar{\varphi }} \in \varPhi \) satisfy \(\varphi \models {\bar{\varphi }}\) (\(\varPhi \) is irreducible)

Properties (i) and (ii) follow from \(\varPhi _L\) being an MLB for \(\varPhi \), (iii)–(iv) from \({\fancyscript{M}}(\varPhi _L) \subset {\fancyscript{M}}(\varPhi ^{\prime }_L) \subseteq {\fancyscript{M}}(\varPhi )\). We prove that \(\varPhi _L\) is equivalent to \(\varPhi ^{\prime }_L\) by case differentiation:

Case 1 \(\top \in \varPhi \). Then, since \(\top \) has no conflicting variables with any other clause, it follows that \(\top \in \varPhi _L\) and thus we have \({\fancyscript{M}}(\varPhi _L) = {\fancyscript{M}}(\varPhi ^{\prime }_L) = {\fancyscript{M}}(\varPhi )\) which is a contradiction to the assumption that \(\varPhi _L\) is no GLB for \(\varPhi \).

Case 2 \(\top \not \in \varPhi \). Let \(\varphi ^{\prime }_L \in \varPhi ^{\prime }_L\) be any clause in \(\varPhi ^{\prime }_L\) and \(\varphi \in \varPhi \) a clause such that \(\varphi ^{\prime }_L \supseteq \varphi \) according to (iii). With respect to property (i), we have the following two cases:

Case 2(a) \(\varphi \in \varPhi _L\). Let \({\bar{\varphi ^{\prime }}}_L \in \varPhi ^{\prime }_L\) be a clause with \(\varphi \supseteq {\bar{\varphi ^{\prime }}}_L\) according to (iv). Together with \(\varphi ^{\prime }_L \subseteq \varphi \) from above, we have \({\bar{\varphi ^{\prime }}}_L \supseteq \varphi \supseteq \varphi ^{\prime }_L\), and since none of \(\varphi , \varphi ^{\prime }_L, {\bar{\varphi ^{\prime }}}_L\) can be the empty clause (see case 1), \(\varphi ^{\prime }_L\) and \({\bar{\varphi ^{\prime }}}_L\) must share at least one variable. Since \(\varPhi ^{\prime }_L \in \) iDNF, it follows that \(\varphi ^{\prime }_L\) and \({\bar{\varphi ^{\prime }}}_L\) are the same clause. Thus, \(\varphi \supseteq {\bar{\varphi ^{\prime }}}_L = \varphi ^{\prime }_L \supseteq \varphi \), i.e., \(\varphi = \varphi ^{\prime }_L\). It follows that \(\varPhi _L = \varPhi ^{\prime }_L\) which is a contradiction to the assumption \({\fancyscript{M}}(\varPhi _L) \subset {\fancyscript{M}}(\varPhi ^{\prime }_L)\).

Case 2(b) \(\varphi \not \in \varPhi _L\). According to (ii), there is a variable \(x \in \varphi \) such that there exists a clause \(\varphi _L \in \varPhi _L\) with \(x \in \varphi _L\); furthermore, \(\varphi _L \in \varPhi \) due to (i). From \(\varphi ^{\prime }_L \supseteq \varphi \), it follows \(x \in \varphi ^{\prime }_L\). From (iv), it follows that there exist a clause \({\bar{\varphi ^{\prime }}}_L \in \varPhi ^{\prime }_L\) such that \(\varphi _L \supseteq {\bar{\varphi ^{\prime }}}_L\). We distinguish two cases:

Case 2(b) i \(x \in {\bar{\varphi ^{\prime }}}_L\). It follows \({\bar{\varphi ^{\prime }}}_L = \varphi ^{\prime }_L\), because \(\varPhi ^{\prime }_L \in \) iDNF. From \(\varphi _L \supseteq {\bar{\varphi ^{\prime }}}_L = \varphi ^{\prime }_L \supseteq \varphi \), it follows that \(\varphi _L \supseteq \varphi \). Since \(\varphi _L, \varphi \in \varPhi \), this is a contradiction to the assumption that \(\varPhi \) is irreducible, property (v).

Case 2(b) ii \(x \not \in {\bar{\varphi ^{\prime }}}_L\). Then, according to (iii), there is a \({\bar{\varphi }} \in \varPhi \) with \({\bar{\varphi ^{\prime }}}_L \supseteq {\bar{\varphi }}\) and thus \(x \not \in \varPhi \) which in turn implies \({\bar{\varphi }} \ne \varphi \) because \(x \in \varphi \). Transitivity of \(\supseteq \) implies \(\varphi _L \supseteq {\bar{\varphi }}\) which is a contradiction to the assumption that \(\varPhi \) is irreducible, property (v).

Secondly, we prove the direction GLB \(\Rightarrow \) MLB. Assume that a GLB \(\varPhi _L\) for \(\varPhi \) is no MLB for \(\varPhi \). Then, at least one of the two MLB properties in Definition 2 is unsatisfied. We show that in either case, \(\varPhi _L\) is no GLB for \(\varPhi \).

Case 1 Assume \(\varPhi _L \not \subseteq \varPhi \). Then, \(\varPhi _L\) contains a clause \(\varphi _L\) that is not in \(\varPhi \).

Case 1(a) If there is a clause \(\varphi \in \varPhi \) such that \(\varphi _L \supseteq \varphi \), then \(\varphi _L \supset \varphi \). Let \(x\) be a variable that occurs in \(\varphi _L\) but not in \(\varphi \). Then, the iDNF formula obtained from \(\varPhi _L\) by removing the variable \(x\) from \(\varphi \) has strictly more models than \(\varPhi _L\), and thus, \(\varPhi _L\) is no GLB for \(\varPhi \).

Case 1(b) If there is no such clause, then \(\varPhi _L \not \models \varPhi \) following Lemma 1 and thus \(\varPhi _L\) is no lower bound and in particular no greatest lower bound for \(\varPhi \).

Case 2 If there is a clause \(\varphi \in \varPhi \) such that \(\varPhi _L \cup \{\varphi \} \in \) iDNF, then \({\fancyscript{M}}(\varPhi _L) \subset {\fancyscript{M}}(\varPhi _L \cup \{\varphi \})\) because none of the variables in \(\varphi \) occur in \(\varPhi _L\) and thus \(\varPhi _L\) is no GLB for \(\varPhi \).

1.2 A.2 Proof of Theorem 2

We use a graph representation of the witness relationships between clauses, cf. Fig. 13. The clauses are the nodes of the graph, and there is a directed edge between two clauses \(\varphi \) and \(\psi \) whenever \(\varphi \models \psi \), i.e., \(\varphi \) is a witness of \(\psi \).

Fig. 13
figure 13

Witness graphs used in the proof of Theorem 2

MUB \(\Rightarrow \) LUB. Assume \(\varPhi _U\) is an MUB for \(\varPhi \) but not an LUB for \(\varPhi \). Then, there exists a better iDNF upper bound \(\varPhi ^{\prime }_U\) for \(\varPhi \) that satisfies \({\fancyscript{M}}(\varPhi ) \subseteq {\fancyscript{M}}(\varPhi ^{\prime }_U) \subset {\fancyscript{M}}(\varPhi _U)\). Using Lemma 1, the second inclusion unfolds to:

$$\begin{aligned}&Z&{\fancyscript{M}}(\varPhi ^{\prime }_U) \subset {\fancyscript{M}}(\varPhi _U) \\&\text{ iff }\; {\fancyscript{M}}(\varPhi ^{\prime }_U) \subseteq {\fancyscript{M}}(\varPhi _U) \text{ and } \text{ not } {\fancyscript{M}}(\varPhi _U) \subseteq {\fancyscript{M}}(\varPhi ^{\prime }_U) \\&\text{ iff }\; \forall \varphi ^{\prime }_U \in \varPhi ^{\prime }_U: \exists \varphi _U \in \varPhi _U: \varphi ^{\prime }_U \supseteq \varphi _U \\&\text{ and } \; \exists \varphi _U \in \varPhi _U: \forall \varphi ^{\prime }_U \in \varPhi ^{\prime }_U: \lnot (\varphi _U \supseteq \varphi ^{\prime }_U). \end{aligned}$$

By collecting the assumptions and unfolding the definitions:

  1. (i)

    \(\forall \varphi \in \varPhi : \exists \varphi ^{\prime }_U \in \varPhi ^{\prime }_U: \varphi \supseteq \varphi ^{\prime }_U\) (\(\varPhi \models \varPhi ^{\prime }_U\))

  2. (ii)

    \(\forall \varphi \in \varPhi : \exists \varphi _U \in \varPhi _U: \varphi \supseteq \varphi _U\) (\(\varPhi \models \varPhi _U\))

  3. (iii)

    \(\forall \varphi ^{\prime }_U \in \varPhi ^{\prime }_U: \exists \varphi _U \in \varPhi _U: \varphi ^{\prime }_U \supseteq \varphi _U\) (\(\varPhi ^{\prime }_U \models \varPhi _U\))

  4. (iv)

    \(\exists \varphi _U \in \varPhi _U: \forall \varphi ^{\prime }_U \in \varPhi ^{\prime }_U: \lnot (\varphi _U \supseteq \varphi ^{\prime }_U)\) (\(\varPhi _U \not \models \varPhi _U^{\prime }\))

  5. (v)

    There is no clause \(\varphi _U \in \varPhi _U\) that can be extended by a variable from \(\text{ vars }(\varPhi )\) and the resulting formula is still in iDNF and implied by \(\varPhi \)

  6. (vi)

    Every \(\varphi \in \varPhi \) is a witness for at least one clause \(\varphi _U \in \varPhi _U\), i.e., \(\varPhi \models \varPhi _U\) according to Lemma 1

  7. (vii)

    Every clause \(\varphi _U \in \varPhi _U\) has a critical witness in \(\varPhi \).

Sentences (i)–(iv) are due to the assumption that \(\varPhi _U\) is upper bound but no LUB for \(\varPhi \) and sentences (v)–(vii) are the syntactical characterization of the MUB property of \(\varPhi _U\).

Let \(\varphi _U \in \varPhi _U\) be as in (iv). Let \(\varphi ^{\prime }_U \in \varPhi ^{\prime }_U\) be a clause in \(\varPhi ^{\prime }_U\) that shares a variable with \(\varphi _U\). If no such clause exists, then by transitivity of \(\supseteq \) and \(\varPhi _U, \varPhi ^{\prime }_U \in \) iDNF, \(\varPhi _U\) cannot have a witness in \(\varPhi \) which is a contradiction to (vii).

Then, since \(\varPhi _U, \varPhi ^{\prime }_U \in \) iDNF, \(\varphi _U\) can be the only clause in \(\varPhi _U\) that satisfies sentence (iii) and together with \(\lnot (\varphi _U \supseteq \varphi ^{\prime }_U)\) from (iv) we can conclude \(\varphi ^{\prime }_U \supset \varphi _U\). Let \(x\) be a variable that occurs in \(\varphi ^{\prime }_U\), but not in \(\varphi _U\). We show a contradiction to the above sentences (i)–(vii) by case differentiation:

Case 1 \(x \not \in \varPhi _U\). According to (vii), \(\varphi _U\) has a critical witness \(w \in \varPhi \). Consider Fig. 13a.

Case 1(a) \(w \models \varphi ^{\prime }_U\). Then, \(\varphi _U\) can be extended by \(x\), and the resulting formula is still in iDNF and implied by \(\varPhi \). This is a contradiction to (v).

Case 1(b) If \(w \not \models \varphi ^{\prime }_U\). Then, according to (i), there is a \({\bar{\varphi ^{\prime }}}_U \in \varPhi ^{\prime }_U\) such that \(w \models {\bar{\varphi ^{\prime }}}_U\). Since \(\varPhi ^{\prime }_U \in \) iDNF, \(\varphi ^{\prime }_U\) and \({\bar{\varphi ^{\prime }}}_U\) share no variables; due to (iii), there must be a \({\bar{\varphi }}_U \in \varPhi _U\) which is implied by \({\bar{\varphi ^{\prime }}}_U\) and is different from \(\varphi _U\) due to the iDNF property of \(\varPhi _U\). From the transitivity of the implication relation, it follows \(w \models {\bar{\varphi }}_U\) which is a contradiction to the assumption that \(w\) is a critical witness for \(\varphi _U\).

Case 2 \(x \in \varPhi _U\). Let \({\bar{\varphi }}_U \in \varPhi _U\) such that \(x \in {\bar{\varphi }}_U\), and \(w \in \varPhi \) be a critical witness for \({\bar{\varphi }}_U\). Consider Fig. 13b.

Case 2(a) \(w \models \varphi ^{\prime }_U\). As above, transitivity of \(\models \) implies \(w \models \varphi _U\) which is a contradiction to the assumption that \(w\) is a critical witness for \({\bar{\varphi }}_U\).

Case 2(b) \(w \not \models \varphi ^{\prime }_U\). Then, according to (i), there is a \({\bar{\varphi ^{\prime }}}_U \in \varPhi ^{\prime }_U\) such that \(w \models {\bar{\varphi ^{\prime }}}_U\). Since \(x \in \varphi ^{\prime }_U\) and \(\varPhi ^{\prime }_U \in \) iDNF, \({\bar{\varphi ^{\prime }}}_U\) does not contain the variable \(x\); due to (iii), there must be a \(\hat{\varphi }_U \in \varPhi _U\) which is implied by \({\bar{\varphi ^{\prime }}}_U\) and does thus not contain the variable \(x\) and is hence different from \({\bar{\varphi }}_U\). Transitivity of \(\models \) implies \(w \models \hat{\varphi }_U\) which is a contradiction to the assumption that \(w\) is a critical witness for \({\bar{\varphi }}_U\).

This completes the proof for the direction MUB \(\Rightarrow \) LUB.

LUB \(\Rightarrow \) MUB. Assume \(\varPhi _U\) is no MUB for \(\varPhi \). We need to show that whenever any of the three conditions in Definition 5 does not hold, then \(\varPhi _U\) is no LUB for \(\varPhi \).

Case 1 If there is a clause in \(\varPhi \) which is not a witness of clauses in \(\varPhi _U\), then \(\varPhi \not \models \varPhi _U\) and thus \(\varPhi _U\) is no upper bound for \(\varPhi \) and in particular no least upper bound.

Case 2 Let \(x\) be a variable such that a clause \(\varphi _U \in \varPhi _U\) can be extended by \(x\) and the resulting formula is in iDNF and implied by \(\varPhi \). It is clear that \(x\) does not occur in \(\varPhi _U\), because otherwise it would not be possible to extend the formula without violating the iDNF property. Then, the assignment with \(x \leftarrow \text{ false }\) and all other variables \(true\) is a model of \(\varPhi _U\) but no model of the extended formula \(\varPhi _U^{ext}\). Thus, \({\fancyscript{M}}(\varPhi _U^{ext})\subset {\fancyscript{M}}(\varPhi _U)\) and \(\varPhi _U\) is no LUB for \(\varPhi \).

Case 3 Let \(\varphi _U \in \varPhi _U\) be a clause without a critical witness in \(\varPhi \). Then, removing \(\varphi _U\) from \(\varPhi _U\) creates a formula \({\bar{\varPhi }}_U\) with the following properties:

  1. (i)

    Since \(\varPhi _U\) is an iDNF formula, removing \(\varphi _U\) creates a formula with strictly fewer models than \(\varPhi _U\).

  2. (ii)

    Since all witnesses of \(\varphi _U\) are non-critical, they imply other clauses in \(\varPhi _U\) as well. Hence \({\bar{\varPhi }}_U\) is still implied by \(\varPhi \). \({\bar{\varPhi }}_U\) is a better upper bound for \(\varPhi \) and \(\varPhi _U\) is no LUB.

1.3 A.3 Proof of Theorem 3

The proof of Theorem 3 comprises three parts: (i) Every formula returned by Algorithm 2 is an MUB, (ii) no two formulas returned by the algorithm are equivalent, and (iii) the delay between returning consecutive MUBs is polynomial in the size of the input formula. Properties (i) and (ii) are shown in Lemmata 4 and 5. Regarding (iii), let us consider the tree corresponding to the execution trace of the algorithm, where each iteration of the for-loop generates a new branch and each recursive call generates a child node; the leaves of the tree represent the MUBs discovered. The algorithm explores this tree in depth-first order. On each branch, the number of recursions and hence the height of the tree is bounded by the number \(n\) of clauses of the input formula. The formulas \(\varPhi _R\), \(\{\varphi {\setminus }\text{ vars }(\psi ) \;|\; \varphi \in \varPhi _R \wedge \varphi \ne \psi \}\), and \(\varPhi _U \cup \{\psi \}\) can be constructed in time at most quadratic in \(n\) (and linear in the max-arity of \(\varPhi \)). Hence, the time required to traverse the tree from the root to a leaf is bounded by \({\fancyscript{O}}(n^3)\) and constitutes an upper bound for the delay between consecutive MUBs.

Lemma 4

Each formula returned by Algorithm 2 is an iDNF minimal upper bound for the input formula \(\varPhi \).

Proof

by induction. Base case: If PolyMub is called with a (possibly empty) clause \(\varPhi \), then it returns \(\varPhi \) as its only MUB.

For the induction step, let \(\psi \) be a clause and \(\varPhi \) a DNF formula such that \(\psi \vee \varPhi \) is irreducible, and let \(\varPhi ^{r(\psi )}\) be the formula obtained from \(\varPhi \) by removing all occurrences of variables of \(\psi \), i.e. \(\varPhi ^{r(\psi )}= \{\varphi {\setminus } \text{ vars }(\psi ) \;|\; \varphi \in \varPhi \}\); we prove the following property: If \(\varPhi ^{r(\psi )}_U\) is an MUB for \(\varPhi ^{r(\psi )}\), then \(\varphi \vee \varPhi ^{r(\psi )}_U\) is an MUB for \(\psi \vee \varPhi \). This composition property is exactly the induction step showing that PolyMub constructs MUBs.

Let \(\psi \), \(\varPhi \), \(\varPhi ^{r(\psi )}\) be as above, and let \(\varPhi ^{r(\psi )}_U\) be an MUB for \(\varPhi ^{r(\psi )}\). Then, \(\varPhi ^{r(\psi )}_U\) is an iDNF formula and satisfies the upper bound, maximality, and criticality conditions from Definition 5 with respect to \(\varPhi ^{r(\psi )}\). It remains to be shown that \(\psi \vee \varPhi ^{r(\psi )}_U\) is an iDNF formula and satisfies those three conditions with respect to \(\psi \vee \varPhi \). First, for the iDNF property: Since \(\varPhi ^{r(\psi )}\) does not contain variables from \(\psi \), and since \(\varPhi ^{r(\psi )}_U\) is an MUB for \(\varPhi ^{r(\psi )}\), it follows that \(\psi \) and \(\varPhi ^{r(\psi )}_U\) do not share variables, and thus \(\psi \vee \varPhi ^{r(\psi )}_U\) is an iDNF formula.

Upper bound. We need to show that every clause in \(\psi \vee \varPhi \) implies a clause in \(U = \psi \vee \varPhi ^{r(\psi )}_U\) (cf. Lemma 1). For \(\psi \), this is evident. By transitivity of \(\models \), and the fact that \(\varPhi ^{r(\psi )}\models \varPhi ^{r(\psi )}_U\) by induction hypothesis, it suffices to show \(\varPhi \models \varPhi ^{r(\psi )}\). Let \(\varphi \) be a clause in \(\varPhi \); then \(\varphi {\setminus } \text{ vars }(\psi )\) is a clause in \(\varPhi ^{r(\psi )}\) and is implied by \(\varphi \) (cf. Proposition 2).

Maximality. We show that every clause in \(U = \psi \vee \varPhi ^{r(\psi )}_U\) is maximal. By induction hypothesis, \(\varPhi ^{r(\psi )}_U\) is maximal with respect to \(\varPhi ^{r(\psi )}\) and variables \(\text{ vars }(\varPhi ^{r(\psi )})\); moreover, \(\varPhi ^{r(\psi )}_U\) is also maximal with respect to \(\varPhi \), since it cannot be extended by any of the remaining variables \(\text{ vars }(\psi )\), since \(\psi \) is a clause in \(U\) and this extension would violate the iDNF property of \(U\). \(\psi \) cannot be extended by a variable from \(\text{ vars }(\varPhi )\), as it would violate the upper bound property: Since \(\varPhi ^{r(\psi )}_U\) does not contain any variable from \(\text{ vars }(\psi )\), \(\psi \) only implies clause \(\psi \) in \(U\). Any extension to \(\psi \) would invalidate this implication.

Criticality. It needs to be shown that every clause in \(U = \psi \vee \varPhi ^{r(\psi )}_U\) has a critical witness in \(\psi \vee \varPhi \). Since \(\psi \vee \varPhi \) is irreducible, \(\psi \) is the only witness for \(\psi \in U\). Now let \(\varphi \) be a clause in \(\varPhi ^{r(\psi )}_U\); \(\varphi \) is not implied by \(\psi \) since they do not share any variables. We still need to show that \(\varphi \) has a critical witness in \(\varPhi \). By induction hypothesis, \(\varphi \) has a critical witness \(w \in \varPhi ^{r(\psi )}\); let \(w^{e(\psi )} \in \varPhi \) be the clause \(w\) extended by variables from \(\text{ vars }(\psi )\). \(w^{e(\psi )}\) is a critical witness for \(\varphi \), since: (i) \(w^{e(\psi )}\) cannot imply a different clause \(\varphi ^{\prime } \in \varPhi ^{r(\psi )}_U\), since \(\varphi ^{\prime }\) does not contain variables from \(\text{ vars }(\psi )\) and \(w\) does not imply \(\varphi ^{\prime }\); (ii) \(w^{e(\psi )}\) cannot imply \(\psi \), since this would require \(\psi \subset w^{e(\psi )}\) and hence \(\psi \vee \varPhi \) would not be irreducible which is a contradiction to our initial assumptions. \(\square \)

Lemma 5

Let \(\varPhi \) and \(\varPsi \) be two formulas returned by Algorithm 2. Then \(\varPhi \not \equiv \varPsi \).

Proof

\(\varPhi \) and \(\varPsi \) are irreducible since they are iDNF formulas by Lemma 4. By construction, \(\varPhi \) and \(\varPsi \) contain two distinct clauses \(\varphi \) and \(\psi \) that share a variable \(x\). Since \(\varPhi \) and \(\varPsi \) are iDNF formulas, we thus have \(\varphi \in \varPhi \), \(\varphi \not \in \varPsi \), \(\psi \in \varPsi \), and \(\psi \not \in \varPhi \); it then follows from Corollary 1 that \(\varPhi \not \equiv \varPsi \). \(\square \)

1.4 A.4 Proof of Lemma 3

Consider the point interval of each open leaf be [\(x,x\)], where \(x\) is a distinct variable. The upper and lower bounds of \(T\) can be then expressed as functions \(f_U\) and \(f_L\), respectively, of such variables. We show that for each such variable \(x\), \(\frac{\partial (f_U - f_L)}{\partial x} \le 0\) and hence \(f_U-f_L\) is maximized when \(x\) is minimized. That is, when \(x=L\), where \(L\) is the lower bound of that open leaf.

We denote by \(f^n_U\) and \(f^n_L\) the lower and upper bound functions in variable \(x\) for a node at depth \(n\). These functions are linear: \(f^n_U = a^n_U\cdot x + b^n_U\) and \(f^n_L = a^n_L\cdot x + b^n_L\).

Base case: Open leaf with variable \(x\), depth \(n\), and \(f^n_U = f^n_L = x\). Then, \(\frac{\partial (f^n_U - f^n_L)}{\partial x} = 1 - 1 = 0\).

Assume now the property holds at a node \(c\) at level \(j+1\), and \(c\) is an ancestor of the open leaf with \(x\) or that open leaf. We show that the property also holds at the parent of \(c\) (and depth \(j\)).

Case 1 The parent of \(c\) is a \(\oplus \) node: \(\oplus (c_1,\ldots ,c_k)\), where \(c\) is one of \(c_1, \ldots ,c_k\). Then,

$$\begin{aligned} f^j_U&= f^{j+1}_U + \alpha _U = a^{j+1}_U\cdot x + b^{j+1}_U+\alpha _U\\ f^j_L&= f^{j+1}_L + \alpha _L = a^{j+1}_L\cdot x + b^{j+1}_L+\alpha _L \end{aligned}$$

where \(\alpha _U\) and \(\alpha _L\) represent the sum of the upper bounds, and lower bounds, respectively, of all the siblings of \(c\). We then immediately have that \(\frac{\partial (f^j_U - f^j_L)}{\partial x} = a^{j+1}_U - a^{j+1}_L \le 0\).

Case 2    The parent of \(c\) is a \(\odot \) node: \(\odot (c_1,\ldots , c_k)\), where \(c\) is one of \(c_1, \ldots ,c_k\). Recall that we only consider restricted \(\odot \) nodes, where at most one child is not a clause and can have different values for lower and upper bounds. If this child is \(c\), let \(q\) be the product of the (exact) probabilities of all other children. Then, \(a^j_U = a^{j+1}_U\cdot q\) and \(a^j_L = a^{j+1}_L\cdot q\) and thus the inequality \(a^j_U-a^j_L \le 0\) is preserved.

Case 3 The parent of \(c\) is a \(\otimes \) node: \(\otimes (c_1,\ldots , c_k)\), where \(c\) is one of \(c_1, \ldots ,c_k\). Let

$$\begin{aligned} \alpha _L&= \overset{k}{\underset{i=1, c_i\not =c}{\Pi }} (1-L(c_i)), \alpha _U&= \overset{k}{\underset{i=1, c_i\not =c}{\Pi }} (1-U(c_i)) \end{aligned}$$

where \(L(c_i)\) and \(U(c_i)\) represent the formulas for the lower and upper bounds, respectively, of node \(c_i\). Given that \(L(c_i)\le U(c_i)\) for each node \(c_i\), it holds that \(\alpha _L\le \alpha _U\). Then,

$$\begin{aligned} f^j_U&= 1 - \alpha _U\cdot (1 - f^{j+1}_U) \\&= \alpha _U\cdot a^{j+1}_U\cdot x + 1 - \alpha _U + \alpha _U\cdot b^{j+1}_U\\ f^j_L&= 1 - \alpha _L\cdot (1 - f^{j+1}_L) \\&= \alpha _L\cdot a^{j+1}_L\cdot x + 1 - \alpha _L + \alpha _L\cdot b^{j+1}_L\\ \frac{\partial (f^j_U - f^j_L)}{\partial x}&= \alpha _U\cdot a^{j+1}_U - \alpha _L\cdot a^{j+1}_L \le 0. \end{aligned}$$

The latter inequality holds since \(\alpha _U\le \alpha _L\) (as discussed above) and \(a^{j+1}_U\le a^{j+1}_L\) (by hypothesis).

For relative approximation, we need to find \(x\) that maximizes \((1-\epsilon )\cdot U - (1+\epsilon )\cdot L\). This holds by a straightforward extension of the previous proof: The coefficient of \(x\) is shown to be greater in \(L\) than in \(U\) for \(U-L\). Since \(1-\epsilon \le 1+\epsilon \), this property is preserved.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fink, R., Huang, J. & Olteanu, D. Anytime approximation in probabilistic databases. The VLDB Journal 22, 823–848 (2013). https://doi.org/10.1007/s00778-013-0310-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-013-0310-5

Keywords

Navigation