Skip to main content
Log in

Incremental discovery of denial constraints

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

We investigate the problem of incremental denial constraint (DC) discovery, aiming at discovering DCs in response to a set \(\triangle \)r of tuple insertions to a given relational instance r and the known set \(\varSigma \) of DCs holding on r. The need for the study is evident since real-life data are often frequently updated, and it is often prohibitively expensive to perform DC discovery from scratch for every update. We tackle this problem with two steps. We first employ indexing techniques to efficiently identify the incremental evidences caused by \(\triangle r\). We present algorithms to build indexes for \(\varSigma \) and r in the pre-processing step, and to visit and update indexes in response to \(\triangle \)r. In particular, we propose a novel indexing technique for two inequality comparisons possibly across the attributes of r. By leveraging the indexes, we can identify all the tuple pairs incurred by \(\triangle \)r that simultaneously satisfy the two comparisons, with a cost dependent on log(\(|\)r\(|\)). We then compute the changes \(\triangle \varSigma \) to \(\varSigma \) based on the incremental evidences, such that \(\varSigma \oplus \triangle \varSigma \) is the set of DCs holding on \(r+\triangle r\). \(\triangle \varSigma \) may contain new DCs that are added into \(\varSigma \) and obsolete DCs that are removed from \(\varSigma \). Our experimental evaluations show that our incremental approach is faster than the two state-of-the-art batch DC discovery approaches that compute from scratch on \(r + \triangle r\) by orders of magnitude, even when \(\triangle r\) is up to 30% of r.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. https://github.com/HPI-Information-Systems/metanome-algorithms/tree/hydra.https://github.com/HPI-Information-Systems/metanome-algorithms/tree/master/dcfinder.

  2. The predicate pairs are not shown because attribute names of Adult and UCE do not have semantic meaning.

References

  1. Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)

    Article  Google Scholar 

  2. Abedjan, Z., Golab, L., Naumann, F.: Data profiling: a tutorial. In SIGMOD, pp. 1747–1751 (2017)

  3. Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. In: Synthesis lectures on data management. Morgan and Claypool Publishers, San Rafael (2018)

  4. Abedjan, Z., Quiané-Ruiz, J. A., Naumann, F.: Detecting unique column combinations on dynamic data. In ICDE, pp. 1036–1047 (2014)

  5. Birnick, J., Bläsius, T., Friedrich, T., Naumann, F., Papenbrock, T., Schirneck, M.: Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13(11), 2270–2283 (2020)

    Article  Google Scholar 

  6. Bleifuß, T., Kruse, S., Naumann, F.: Efficient denial constraint discovery with hydra. PVLDB 11(3), 311–323 (2017)

    Google Scholar 

  7. Caruccio, Loredana: Cirillo, Stefano: incremental discovery of imprecise functional dependencies. ACM J. Data Inf. Qual. 12(4), 19:1-19:25 (2020)

    Google Scholar 

  8. Caruccio, L., Cirillo, S., Deufemia, V., Polese, G.: Incremental discovery of functional dependencies with a bit-vector algorithm. In SEBD (2019)

  9. Caruccio, L., Deufemia, V., Naumann, F., Polese, G.: Discovering relaxed functional dependencies based on multi-attribute dominance. IEEE Trans. Knowl. Data Eng. 33(9), 3212–3228 (2021)

    Article  Google Scholar 

  10. Caruccio, L., Deufemia, V., Polese, G.: Mining relaxed functional dependencies from data. Data Min. Knowl. Discov. 34(2), 443–477 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  11. Qi C. Jarek G., Fred K., Cliff Leung, T. T., Linqi Liu, X. Q., and Bernhard Schiefer, K.: Implementation of two semantic query optimization techniques in DB2 universal database. In VLDB, pp. 687–698, (1999)

  12. Chu, X., Ilyas, I.F., Papotti, P.: Discovering denial constraints. PVLDB 6(13), 1498–1509 (2013)

    Google Scholar 

  13. Chu, X., Ilyas, I. F., Papotti, P.: Holistic data cleaning: Putting violations into context. In ICDE, 458–469 (2013)

  14. Gao C., Wenfei F., Floris G., Xibei J., and Shuai M.: Improving data quality: consistency and accuracy. In VLDB, pp. 315–326, 2007

  15. Dallachiesa, Michele, E., Amr, E., Ahmed, E., Ahmed, K., Ilyas, I. F., Ouzzani, M., Tang, N.: Nadeef: a commodity data cleaning system. In SIGMOD, 541–552 (2013)

  16. Fan, W., Geerts, F.: Foundations of Data Quality Management. In Synthesis lectures on data management. Morgan and Claypool Publishers, San Rafael (2012)

  17. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6:1-6:48 (2008)

    Article  Google Scholar 

  18. Fan, W., Chunming, H., Liu, X., Ping, L.: Discovering graph functional dependencies. ACM Trans. Database Syst. 45(3), 151–1542 (2020)

    Article  MathSciNet  Google Scholar 

  19. Ge, C., Ilyas, I.F., Kerschbaum, F.: Secure multi-party functional dependency discovery. PVLDB 13(2), 184–196 (2019)

    Google Scholar 

  20. Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Cleaning data with llunatic. VLDB J. 29(4), 867–892 (2020)

    Article  Google Scholar 

  21. Giannakopoulou, S., Karpathiotakis, M., Ailamaki, A.: Cleaning denial constraint violations through relaxation. In SIGMOD, pp. 805–815 (2020)

  22. Gilad, A., Deutch, D., Roy, S.: On multiple semantics for declarative database repairs. In SIGMOD, pp. 817–831 (2020)

  23. Ginsburg, S., Hull, R.: Order dependency in the relational model. Theor. Comput. Sci. 26, 149–195 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  24. Ginsburg, S., Hull, R.: Sort sets in the relational model. J. ACM 33(3), 465–488 (1986)

    Article  MathSciNet  Google Scholar 

  25. Heise, A., Quiané-Ruiz, J.-A., Abedjan, Z., Jentzsch, A., Naumann, F.: Scalable discovery of unique column combinations. PVLDB 7(4), 301–312 (2013)

    Google Scholar 

  26. Ihab, F.I., Xu, C.: Data Cleaning. ACM, New York City (2019)

    MATH  Google Scholar 

  27. Jin, Y., Tan, Z., Zeng, W., Ma, S.: Approximate order dependency discovery. In ICDE, pp. 25–36 (2021)

  28. Jin, Y., Zhu, L., Tan, Z.: Efficient bidirectional order dependency discovery. In ICDE, pp. 61–72 (2020)

  29. Karegar, R., Godfrey, P., Golab, L., Kargar, M., Srivastava, D., Szlichta, J.: Efficient discovery of approximate order dependencies. In EDBT, pp. 427–432 (2021)

  30. Khayyat, Z., Ilyas, I. F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J. A., Tang, N., Yin, S.: Bigdansing: a system for big data cleansing. In SIGMOD, pp. 1215–1230 (2015)

  31. Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Kalnis, P.: Lightning fast and space efficient inequality joins. PVLDB 8(13), 2074–2085 (2015)

    Google Scholar 

  32. Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Kalnis, P.: Fast and scalable inequality joins. VLDB J. 26(1), 125–150 (2017)

    Article  Google Scholar 

  33. Kossmann, J., Papenbrock, T., Naumann, F.: Data dependencies for query optimization: a survey. VLDB J. 31(1), 1–22 (2022)

    Article  Google Scholar 

  34. Koumarelas, I.K., Naskos, A., Gounaris, A.: Flexible partitioning for selective binary theta-joins in a massively parallel setting. Distributed Parallel Databases 36(2), 301–337 (2018)

    Article  Google Scholar 

  35. Kruse, S., Naumann, F.: Efficient discovery of approximate dependencies. PVLDB 11(7), 759–772 (2018)

    Google Scholar 

  36. Langer, P., Naumann, F.: Efficient order dependency detection. VLDB J. 25(2), 223–241 (2016)

    Article  Google Scholar 

  37. Livshits, E., Heidari, A., Ilyas, I.F., Kimelfeld, B.: Approximate denial constraints. PVLDB 13(10), 1682–1695 (2020)

    Google Scholar 

  38. Ma, S., Fan, W., Bravo, L.: Extending inclusion dependencies with conditions. Theort. Comput. Sci. 515, 64–95 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  39. Nerone, M. A., Holanda, P., de Almeida, E. C., and Manegold, S.: Multidimensional adaptive and progressive indexes. In ICDE, pp. 624–635, 2021

  40. Okcan, A., Riedewald, M.: Processing theta-joins using map reduce. SIGMOD 1(1), 949–960 (2011)

    Google Scholar 

  41. Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In SIGMOD, pp. 821–833 (2016)

  42. Pena, E. H. M., and de Almeida, E. C. D.: BFASTDC: A bitwise algorithm for mining denial constraints. In DEXA, pp. 53–68, 2018

  43. Pena, E.H.M., de Almeida, E.C.D., Felix, N.: Discovery of approximate (and exact) denial constraints. PVLDB 13(3), 266–278 (2019)

    Google Scholar 

  44. Pena, E.H.M., de Almeida, E.C., Felix, N.: Fast detection of denial constraint violations. Proc VLDB Endow 15(4), 859–871 (2021)

    Article  Google Scholar 

  45. Pena, E. H. M., Filho, E. R. L., de Almeida, E. C., and Felix N.: Efficient detection of data dependency violations. In CIKM, pp. 1235–1244, (2020)

  46. Pugh, W.: Skip lists: a probabilistic alternative to balanced trees. Commun. ACM 33(6), 668–676 (1990)

    Article  Google Scholar 

  47. Theodoros, R., Xu, C., Ihab, F., Christopher Ré, I.: Holoclean: holistic data repairs with probabilistic inference. Proc VLDB Endow 10(11), 1190–1201 (2017)

    Article  Google Scholar 

  48. Saxena, H., Golab, L., Ilyas, I. F.: Distributed discovery of functional dependencies. In ICDE, pp. 1590–1593 (2019)

  49. Saxena, H., Golab, L., Ilyas, I.F.: Distributed implementations of dependency discovery algorithms. Proc. VLDB Endow 12(11), 1624–1636 (2019)

    Article  Google Scholar 

  50. Schirmer, P., Papenbrock, T., Koumarelas, I.K., Naumann, F.: Efficient discovery of matching dependencies. ACM Trans. Database Syst. 45(3), 13:1-13:33 (2020)

    Article  MathSciNet  Google Scholar 

  51. Schirmer, P., Papenbrock, T., Kruse, S., Naumann, F., Hempfing, D., Mayer, T., Neuschäfer-Rube, D.: Dynfd: functional dependency discovery in dynamic datasets. In EDBT, pp. 253–264 (2019)

  52. Schmidl, S., Papenbrock, T.: Efficient distributed discovery of bidirectional order dependencies. VLDB J. 31(1), 49–74 (2022)

    Article  Google Scholar 

  53. Shaabani, N., Meinel, C.: Incrementally updating unary inclusion dependencies in dynamic data. Distrib. Parallel Databases 37(1), 133–176 (2019)

  54. Simmen, D. E., Shekita, E. J., Malkemus, T.: Fundamental techniques for order optimization. In SIGMOD, pp. 57–67 (1996)

  55. Song, S., Chen, L.: Discovering matching dependencies. In CIKM, pp. 1421–1424 (2009)

  56. Song, S., Chen, L.: Efficient discovery of similarity constraints for matching dependencies. Data Knowl. Eng. 87, 146–166 (2013)

  57. Song, S., Gao, F., Huang, R., Wang, C.: Data dependencies extended for variety and veracity: A family tree. IEEE Trans. Knowl. Data Eng. 34(10), 4717–4736 (2022)

    Article  Google Scholar 

  58. Szlichta, J., Godfrey, P., Golab, L., Kargar, M., Srivastava, D.: Effective and complete discovery of order dependencies via set-based axiomatization. PVLDB 10(7), 721–732 (2017)

  59. Szlichta, J., Godfrey, P., Golab, L., Kargar, M., Srivastava, D.: Effective and complete discovery of bidirectional order dependencies via set-based axioms. VLDB J. 27(4), 573–591 (2018)

  60. Szlichta, J., Godfrey, P., Gryz, J.: Fundamentals of order dependencies. PVLDB 5(11), 1220–1231 (2012)

    Google Scholar 

  61. Szlichta, J., Godfrey, P., Gryz, J., Ma, W., Qiu, W., Zuzarte, C.: Business-intelligence queries with order dependencies in DB2. In EDBT, pp. 750–761 (2014)

  62. Szlichta, J., Godfrey, P., Gryz, J., Zuzarte, C.: Expressiveness and complexity of order dependencies. PVLDB 6(14), 1858–1869 (2013)

    Google Scholar 

  63. Tan, Z., Ran, A., Ma, S., Qin, S.: Fast incremental discovery of pointwise order dependencies. PVLDB 13(10), 1669–1681 (2020)

    Google Scholar 

  64. Tschirschnitz, F., Papenbrock, T., Naumann, F.: Detecting inclusion dependencies on very many tables. ACM Trans. Database Syst. 42(3), 18:1-18:29 (2017)

    Article  MathSciNet  Google Scholar 

  65. Vazirani, V.V.: Approximation algorithms. Springer, Heidelberg (2001)

    MATH  Google Scholar 

  66. Wei, Z., Hartmann, S., Link, S.: Algorithms for the discovery of embedded functional dependencies. VLDB J. 30(6), 1069–1093 (2021)

    Article  Google Scholar 

  67. Wei, Z., Link, S.: Discovery and ranking of functional dependencies. In ICDE, pp. 1526–1537 (2019)

  68. Weise, J., Schmidl, S., Papenbrock, T.: Optimized theta-join processing through candidate pruning and workload distribution. In BTW, pp. 59–78 (2021)

  69. Xiao, R., Tan, Z., Wang, H., Ma, S.: Fast approximate denial constraint discovery. Proc. VLDB Endow. 16(2), 269–281 (2022)

    Article  Google Scholar 

  70. Xiao, R., Yuan, Y., Tan, Z., Ma, S., Wang, W.: Dynamic functional dependency discovery with dynamic hitting set enumeration. In ICDE, pp. 286–298 (2022)

  71. Lin Z., Xu, S., Zijing T., Yang, K., Yang, W., Zhou, X., Tian, Y.: Incremental discovery of order dependencies on tuple insertions. In DASFAA, pp. 157–174 (2019)

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China 62172102 and 61925203. We thank authors of [6, 32, 43] for sharing their codes for our experimental evaluation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zijing Tan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qian, C., Li, M., Tan, Z. et al. Incremental discovery of denial constraints. The VLDB Journal 32, 1289–1313 (2023). https://doi.org/10.1007/s00778-023-00788-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-023-00788-y

Keywords

Navigation