skip to main content
research-article

Conditional functional dependencies for capturing data inconsistencies

Published:24 June 2008Publication History
Skip Abstract Section

Abstract

We propose a class of integrity constraints for relational databases, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by enforcing bindings of semantically related values. For static analysis of CFDs we investigate the consistency problem, which is to determine whether or not there exists a nonempty database satisfying a given set of CFDs, and the implication problem, which is to decide whether or not a set of CFDs entails another CFD. We show that while any set of transitional FDs is trivially consistent, the consistency problem is NP-complete for CFDs, but it is in PTIME when either the database schema is predefined or no attributes involved in the CFDs have a finite domain. For the implication analysis of CFDs, we provide an inference system analogous to Armstrong's axioms for FDs, and show that the implication problem is coNP-complete for CFDs in contrast to the linear-time complexity for their traditional counterpart. We also present an algorithm for computing a minimal cover of a set of CFDs. Since CFDs allow data bindings, in some cases CFDs may be physically large, complicating the detection of constraint violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints by a single query. We also provide incremental methods for checking CFDs in response to changes to the database. We experimentally verify the effectiveness of our CFD-based methods for inconsistency detection. This work not only yields a constraint theory for CFDs but is also a step toward a practical constraint-based method for improving data quality.

References

  1. Abiteboul, S., Hull, R., and Vianu, V. 1995. Foundations of Databases. Addison-Wesley.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arenas, M., Bertossi, L. E., and Chomicki, J. 2003. Consistent query answers in inconsistent databases. Theory Pract. Logic Program. 3, 4-5, 393--424.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Armstrong, W. W. 1974. Dependency structures of data base relationships. In Proceedings of the IFIP World Computer Congress. 580--583.]]Google ScholarGoogle Scholar
  4. Baudinet, M., Chomicki, J., and Wolper, P. 1999. Constraint-generating dependencies. J. Comput. Syst. Sci. 59, 1, 94--115.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Beeri, C. and Bernstein, P. A. 1979. Computational problems related to the design of normal form relational schemas. ACM Trans. Data. Syst. 4, 1, 30--59.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Beeri, C. and Vardi, M. 1984. A proof procedure for data dependencies. J. ACM 31, 4, 718--741.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bertossi, L. and Chomicki, J. 2003. Query answering in inconsistent databases. In Logics for Emerging Applications of Databases. 43--83.]]Google ScholarGoogle Scholar
  8. Bohannon, P., Fan, W., Flaster, M., and Rastogi, R. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the International Conference on Management of Data (SIGMOD). 143--154.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bohannon, P., Fan, W., Geerts, F., Jia, X., and Kementsietsidis, A. 2007. Conditional functional dependencies for data cleaning. In Proceedings of the International Conference on Data Engineering (ICDE). 746--755.]]Google ScholarGoogle Scholar
  10. Bra, P. D. and Paredaens, J. 1983. Conditional dependencies for horizontal decompositions. In Colloquium on Automata, Languages and Programming. 67--82.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bravo, L. and Bertossi, L. 2003. Logic programs for consistently querying data integration systems. In Proceedings of the International Joint Conference on Artificial Intelligence. 10--15.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bravo, L., Fan, W., Geerts, F., and Ma, S. 2008. Increasing the expressivity of conditional functional dependencies without extra complexity. In Proceedings of the International Conference on Data Engineering (ICDE).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Bravo, L., Fan, W., and Ma, S. 2007. Extending dependencies with conditions. In Proceedings of the International Conference on Very Large Databases (VLDB). 243--254.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Bruni, R. and Sassano, A. 2001. Errors detection and correction in large scale data collecting. In Proceedings of the International Conference on Advances in Intelligent Data Analysis (IDA). 84--94.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cali, A., Lembo, D., and Rosati, R. 2003a. On the decidability and complexity of query answering over inconsistent and incomplete databases. In Proceedings of the Symposium on Principles of Database Systems (PODS). 260--271.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Cali, A., Lembo, D., and Rosati, R. 2003b. Query rewriting and answering under constraints in data integration systems. In Proceedings of the International Joint Conference on Artificial Intelligence. 16--21.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chomicki, J. and Marcinkowski, J. 2005a. Minimal-change integrity maintenance using tuple deletions. Inform. Comput. 197, 1-2, 90--121.]]Google ScholarGoogle ScholarCross RefCross Ref
  18. Chomicki, J. and Marcinkowski, J. 2005b. On the computational complexity of minimal-change integrity maintenance in relational databases. In Inconsistency Tolerance. 119--150.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Codd, E. F. 1972. Relational completeness of data base sublanguages. In Database Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 65--98.]]Google ScholarGoogle Scholar
  20. Cong, G., Fan, W., Geerts, F., Jia, X., and Ma, S. 2007. Improving data quality: Consistency and accuracy. In Proceedings of the International Conference on Very Large Databases (VLDB). 315--326.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Eckerson, W. W. 2002. Data quality and the bottom line: Achieving business success through a commitment to high quality data. Tech. rep., The Data Warehousing Institute. http://www.tdwi.org/research/display.aspx?ID=6064.]]Google ScholarGoogle Scholar
  22. Fellegi, I. and Holt, D. 1976. A systematic approach to automatic edit and imputation. J. Amer. Statist. Assn. 71, 353, 17--35.]]Google ScholarGoogle ScholarCross RefCross Ref
  23. Fellegi, I. P. and Sunter, A. B. 1969. A theory for record linkage. J. Amer. Statist. Assn. 64, 328, 1183--1210.]]Google ScholarGoogle ScholarCross RefCross Ref
  24. Franconi, E., Palma, A. L., Leone, N., Perri, S., and Scarcello, F. 2001. Census data repair: a challenging application of disjunctive logic programming. In Proceedings of the Artificial Intelligence on Logic for Programming (LPAR). 561--578.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Galhardas, H., Florescu, D., Shasha, D., and Simon, E. 2000. AJAX: An extensible data cleaning tool. In Proceedings of the International Conference on Management of Data (SIGMOD). 590.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Galhardas, H., Florescu, D., Shasha, D., Simon, E., and Saita, C.-A. 2001. Declarative data cleaning: Language, model and algorithms. In Proceedings of the International Conference on Very Large Databases (VLDB). 371--380.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Garey, M. and Johnson, D. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Garfinkel, R. S., Kunnathur, A. S., and Liepins, G. E. 1986. Optimal imputation of erroneous data: Categorical data, general edits. Operat. Resear. 34, 5, 744--751.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gertz, M. and Lipeck, U. 1995. A diagnostic approach to repairing constraint violations in databases. In Proceedings of the International Workshop on Principles of Diagnosis (DX). 65--72.]]Google ScholarGoogle Scholar
  30. Grahne, G. 1991. The Problem of Incomplete Information in Relational Databases. Springer.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Greco, G., Greco, S., and Zumpano, E. 2003. A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Engin. 15, 6, 1389--1408.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Hernandez, M. A. and Stolfo, S. 1998. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2, 1, 9--37.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Imieliński, T. and Lipski Jr, W. 1984. Incomplete information in relational databases. J. ACM 31, 4, 761--791.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Lim, E.-P., Srivastava, J., Prabhakar, S., and Richardson, J. 1996. Entity identification in database integration. Inform. Sci. 89, 1-2, 1--38.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Maher, M. J. 1997. Constrained dependencies. Theor. Comput. Sci. 173, 1, 113--149.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Maher, M. J. and Srivastava, D. 1996. Chasing constrained tuple-generating dependencies. In Proceedings of the Symposium on Principles of Database Systems (PODS). 128--138.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Maier, D. 1980. Minimum covers in relational database model. J. ACM 27, 4, 664--674.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Maier, D. 1983. The Theory of Relational Databases. Computer Science Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Maletic, J. I. and Marcus, A. 2000. Data cleansing: Beyond integrity analysis. In Proceedings of the Conference on Information Quality (IQ). 200--209.]]Google ScholarGoogle Scholar
  40. Monge, A. E. 2000. Matching algorithms within a duplicate detection system. IEEE Data Eng. Bull. 23, 4, 14--20.]]Google ScholarGoogle Scholar
  41. Papadimitriou, C. H. 1994. Computational Complexity. Addison Wesley.]]Google ScholarGoogle Scholar
  42. Rahm, E. and Do, H. H. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4, 3--13.]]Google ScholarGoogle Scholar
  43. Raman, V. and Hellerstein, J. M. 2001. Potter's wheel: An interactive data cleaning system. In Proceedings of the International Conference on Very Large Databases (VLDB). 381--390.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Sadri, F. 1980. Data dependencies in the relational model of data: A generalization. PhD thesis, Princeton University.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Sadri, F. and Ullman, J. 1982. Template dependencies: A large class of dependencies in relational databases and its complete axiomatization. J. ACM 29, 2, 363--372.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Shilakes, C. C. and Tylman, J. 1998. Enterprise information portals. Tech. rep., Merrill Lynch, Inc., New York, NY.]]Google ScholarGoogle Scholar
  47. Vazirani, V. V. 2003. Approximation Algorithms. Springer.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Wijsen, J. 2005. Database repairing using updates. ACM Trans. Datab. Syst. 30, 3, 722--768.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Winkler, W. E. 1994. Advanced methods for record linkage. Tech. rep., Statistical Research Division, U.S. Bureau of the Census.]]Google ScholarGoogle Scholar
  50. Winkler, W. E. 1997. Set-covering and editing discrete data. In Proceedings of the American Statistical Association. Section on Survey Research Methods. 564--569.]]Google ScholarGoogle Scholar
  51. Winkler, W. E. 2004. Methods for evaluating and creating data quality. Infor. Syst. 29, 7, 531--550.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Conditional functional dependencies for capturing data inconsistencies

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 33, Issue 2
      June 2008
      309 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/1366102
      Issue’s Table of Contents

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 June 2008
      • Accepted: 1 December 2007
      • Revised: 1 September 2007
      • Received: 1 February 2007
      Published in tods Volume 33, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader