Abstract
We propose a class of integrity constraints for relational databases, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by enforcing bindings of semantically related values. For static analysis of CFDs we investigate the consistency problem, which is to determine whether or not there exists a nonempty database satisfying a given set of CFDs, and the implication problem, which is to decide whether or not a set of CFDs entails another CFD. We show that while any set of transitional FDs is trivially consistent, the consistency problem is NP-complete for CFDs, but it is in PTIME when either the database schema is predefined or no attributes involved in the CFDs have a finite domain. For the implication analysis of CFDs, we provide an inference system analogous to Armstrong's axioms for FDs, and show that the implication problem is coNP-complete for CFDs in contrast to the linear-time complexity for their traditional counterpart. We also present an algorithm for computing a minimal cover of a set of CFDs. Since CFDs allow data bindings, in some cases CFDs may be physically large, complicating the detection of constraint violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints by a single query. We also provide incremental methods for checking CFDs in response to changes to the database. We experimentally verify the effectiveness of our CFD-based methods for inconsistency detection. This work not only yields a constraint theory for CFDs but is also a step toward a practical constraint-based method for improving data quality.
- Abiteboul, S., Hull, R., and Vianu, V. 1995. Foundations of Databases. Addison-Wesley.]] Google ScholarDigital Library
- Arenas, M., Bertossi, L. E., and Chomicki, J. 2003. Consistent query answers in inconsistent databases. Theory Pract. Logic Program. 3, 4-5, 393--424.]] Google ScholarDigital Library
- Armstrong, W. W. 1974. Dependency structures of data base relationships. In Proceedings of the IFIP World Computer Congress. 580--583.]]Google Scholar
- Baudinet, M., Chomicki, J., and Wolper, P. 1999. Constraint-generating dependencies. J. Comput. Syst. Sci. 59, 1, 94--115.]] Google ScholarDigital Library
- Beeri, C. and Bernstein, P. A. 1979. Computational problems related to the design of normal form relational schemas. ACM Trans. Data. Syst. 4, 1, 30--59.]] Google ScholarDigital Library
- Beeri, C. and Vardi, M. 1984. A proof procedure for data dependencies. J. ACM 31, 4, 718--741.]] Google ScholarDigital Library
- Bertossi, L. and Chomicki, J. 2003. Query answering in inconsistent databases. In Logics for Emerging Applications of Databases. 43--83.]]Google Scholar
- Bohannon, P., Fan, W., Flaster, M., and Rastogi, R. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the International Conference on Management of Data (SIGMOD). 143--154.]] Google ScholarDigital Library
- Bohannon, P., Fan, W., Geerts, F., Jia, X., and Kementsietsidis, A. 2007. Conditional functional dependencies for data cleaning. In Proceedings of the International Conference on Data Engineering (ICDE). 746--755.]]Google Scholar
- Bra, P. D. and Paredaens, J. 1983. Conditional dependencies for horizontal decompositions. In Colloquium on Automata, Languages and Programming. 67--82.]] Google ScholarDigital Library
- Bravo, L. and Bertossi, L. 2003. Logic programs for consistently querying data integration systems. In Proceedings of the International Joint Conference on Artificial Intelligence. 10--15.]] Google ScholarDigital Library
- Bravo, L., Fan, W., Geerts, F., and Ma, S. 2008. Increasing the expressivity of conditional functional dependencies without extra complexity. In Proceedings of the International Conference on Data Engineering (ICDE).]] Google ScholarDigital Library
- Bravo, L., Fan, W., and Ma, S. 2007. Extending dependencies with conditions. In Proceedings of the International Conference on Very Large Databases (VLDB). 243--254.]] Google ScholarDigital Library
- Bruni, R. and Sassano, A. 2001. Errors detection and correction in large scale data collecting. In Proceedings of the International Conference on Advances in Intelligent Data Analysis (IDA). 84--94.]] Google ScholarDigital Library
- Cali, A., Lembo, D., and Rosati, R. 2003a. On the decidability and complexity of query answering over inconsistent and incomplete databases. In Proceedings of the Symposium on Principles of Database Systems (PODS). 260--271.]] Google ScholarDigital Library
- Cali, A., Lembo, D., and Rosati, R. 2003b. Query rewriting and answering under constraints in data integration systems. In Proceedings of the International Joint Conference on Artificial Intelligence. 16--21.]] Google ScholarDigital Library
- Chomicki, J. and Marcinkowski, J. 2005a. Minimal-change integrity maintenance using tuple deletions. Inform. Comput. 197, 1-2, 90--121.]]Google ScholarCross Ref
- Chomicki, J. and Marcinkowski, J. 2005b. On the computational complexity of minimal-change integrity maintenance in relational databases. In Inconsistency Tolerance. 119--150.]] Google ScholarDigital Library
- Codd, E. F. 1972. Relational completeness of data base sublanguages. In Database Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 65--98.]]Google Scholar
- Cong, G., Fan, W., Geerts, F., Jia, X., and Ma, S. 2007. Improving data quality: Consistency and accuracy. In Proceedings of the International Conference on Very Large Databases (VLDB). 315--326.]] Google ScholarDigital Library
- Eckerson, W. W. 2002. Data quality and the bottom line: Achieving business success through a commitment to high quality data. Tech. rep., The Data Warehousing Institute. http://www.tdwi.org/research/display.aspx?ID=6064.]]Google Scholar
- Fellegi, I. and Holt, D. 1976. A systematic approach to automatic edit and imputation. J. Amer. Statist. Assn. 71, 353, 17--35.]]Google ScholarCross Ref
- Fellegi, I. P. and Sunter, A. B. 1969. A theory for record linkage. J. Amer. Statist. Assn. 64, 328, 1183--1210.]]Google ScholarCross Ref
- Franconi, E., Palma, A. L., Leone, N., Perri, S., and Scarcello, F. 2001. Census data repair: a challenging application of disjunctive logic programming. In Proceedings of the Artificial Intelligence on Logic for Programming (LPAR). 561--578.]] Google ScholarDigital Library
- Galhardas, H., Florescu, D., Shasha, D., and Simon, E. 2000. AJAX: An extensible data cleaning tool. In Proceedings of the International Conference on Management of Data (SIGMOD). 590.]] Google ScholarDigital Library
- Galhardas, H., Florescu, D., Shasha, D., Simon, E., and Saita, C.-A. 2001. Declarative data cleaning: Language, model and algorithms. In Proceedings of the International Conference on Very Large Databases (VLDB). 371--380.]] Google ScholarDigital Library
- Garey, M. and Johnson, D. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company.]] Google ScholarDigital Library
- Garfinkel, R. S., Kunnathur, A. S., and Liepins, G. E. 1986. Optimal imputation of erroneous data: Categorical data, general edits. Operat. Resear. 34, 5, 744--751.]] Google ScholarDigital Library
- Gertz, M. and Lipeck, U. 1995. A diagnostic approach to repairing constraint violations in databases. In Proceedings of the International Workshop on Principles of Diagnosis (DX). 65--72.]]Google Scholar
- Grahne, G. 1991. The Problem of Incomplete Information in Relational Databases. Springer.]] Google ScholarDigital Library
- Greco, G., Greco, S., and Zumpano, E. 2003. A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Engin. 15, 6, 1389--1408.]] Google ScholarDigital Library
- Hernandez, M. A. and Stolfo, S. 1998. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2, 1, 9--37.]] Google ScholarDigital Library
- Imieliński, T. and Lipski Jr, W. 1984. Incomplete information in relational databases. J. ACM 31, 4, 761--791.]] Google ScholarDigital Library
- Lim, E.-P., Srivastava, J., Prabhakar, S., and Richardson, J. 1996. Entity identification in database integration. Inform. Sci. 89, 1-2, 1--38.]] Google ScholarDigital Library
- Maher, M. J. 1997. Constrained dependencies. Theor. Comput. Sci. 173, 1, 113--149.]] Google ScholarDigital Library
- Maher, M. J. and Srivastava, D. 1996. Chasing constrained tuple-generating dependencies. In Proceedings of the Symposium on Principles of Database Systems (PODS). 128--138.]] Google ScholarDigital Library
- Maier, D. 1980. Minimum covers in relational database model. J. ACM 27, 4, 664--674.]] Google ScholarDigital Library
- Maier, D. 1983. The Theory of Relational Databases. Computer Science Press.]] Google ScholarDigital Library
- Maletic, J. I. and Marcus, A. 2000. Data cleansing: Beyond integrity analysis. In Proceedings of the Conference on Information Quality (IQ). 200--209.]]Google Scholar
- Monge, A. E. 2000. Matching algorithms within a duplicate detection system. IEEE Data Eng. Bull. 23, 4, 14--20.]]Google Scholar
- Papadimitriou, C. H. 1994. Computational Complexity. Addison Wesley.]]Google Scholar
- Rahm, E. and Do, H. H. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4, 3--13.]]Google Scholar
- Raman, V. and Hellerstein, J. M. 2001. Potter's wheel: An interactive data cleaning system. In Proceedings of the International Conference on Very Large Databases (VLDB). 381--390.]] Google ScholarDigital Library
- Sadri, F. 1980. Data dependencies in the relational model of data: A generalization. PhD thesis, Princeton University.]] Google ScholarDigital Library
- Sadri, F. and Ullman, J. 1982. Template dependencies: A large class of dependencies in relational databases and its complete axiomatization. J. ACM 29, 2, 363--372.]] Google ScholarDigital Library
- Shilakes, C. C. and Tylman, J. 1998. Enterprise information portals. Tech. rep., Merrill Lynch, Inc., New York, NY.]]Google Scholar
- Vazirani, V. V. 2003. Approximation Algorithms. Springer.]] Google ScholarDigital Library
- Wijsen, J. 2005. Database repairing using updates. ACM Trans. Datab. Syst. 30, 3, 722--768.]] Google ScholarDigital Library
- Winkler, W. E. 1994. Advanced methods for record linkage. Tech. rep., Statistical Research Division, U.S. Bureau of the Census.]]Google Scholar
- Winkler, W. E. 1997. Set-covering and editing discrete data. In Proceedings of the American Statistical Association. Section on Survey Research Methods. 564--569.]]Google Scholar
- Winkler, W. E. 2004. Methods for evaluating and creating data quality. Infor. Syst. 29, 7, 531--550.]] Google ScholarDigital Library
Index Terms
- Conditional functional dependencies for capturing data inconsistencies
Recommendations
Design by example for SQL table definitions with functional dependencies
A database is C-Armstrong for a given set of constraints in a class C if it satisfies every constraint of the set and violates every constraint in C not implied by the set. Therefore, Armstrong databases are test data that perfectly illustrate the ...
On the implication problem for cardinality constraints and functional dependencies
In database design, integrity constraints are used to express database semantics. They specify the way by that the elements of a database are associated to each other. The implication problem asks whether a given set of constraints entails further ...
Inclusion dependencies and their interaction with functional dependencies in SQL
Driven by the SQL standard, we investigate simple and partial inclusion dependencies (INDs) with not null constraints. Implication of simple INDs and not null constraints is not finitely axiomatizable. We propose not null inclusion dependencies (NNINDs) ...
Comments