Query processing over incomplete autonomous databases: query rewriting using learned data dependencies

Wolf, Garrett; Kalavagattu, Aravind; Khatri, Hemal; Balakrishnan, Raju; Chokshi, Bhaumik; Fan, Jianchun; Chen, Yi; Kambhampati, Subbarao

doi:10.1007/s00778-009-0155-0

Query processing over incomplete autonomous databases: query rewriting using learned data dependencies

Special Issue Paper
Published: 21 July 2009

Volume 18, pages 1167–1190, (2009)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Garrett Wolf¹,
Aravind Kalavagattu¹,
Hemal Khatri¹,
Raju Balakrishnan¹,
Bhaumik Chokshi¹,
Jianchun Fan¹,
Yi Chen¹ &
…
Subbarao Kambhampati¹

201 Accesses
22 Citations
Explore all metrics

Abstract

Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical missing attributes, even if they wind up being relevant to a user query. Ideally we would like the mediator to retrieve such possibleanswers and gauge their relevance by accessing their likelihood of being pertinent answers to the query. The autonomous nature of web databases poses several challenges in realizing this objective. Such challenges include the restricted access privileges imposed on the data, the limited support for query patterns, and the bounded pool of database and network resources in the web environment. We introduce a novel query rewriting and optimization framework QPIAD that tackles these challenges. Our technique involves reformulating the user query based on mined correlations among the database attributes. The reformulated queries are aimed at retrieving the relevant possibleanswers in addition to the certain answers. QPIAD is able to gauge the relevance of such queries allowing tradeoffs in reducing the costs of database query processing and answer transmission. To support this framework, we develop methods for mining attribute correlations (in terms of Approximate Functional Dependencies), value distributions (in the form of Naïve Bayes Classifiers), and selectivity estimates. We present empirical studies to demonstrate that our approach is able to effectively retrieve relevant possibleanswers with high precision, high recall, and manageable cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Query-Oriented Answer Imputation for Aggregate Queries

Characterizing and Computing Causes for Query Answers in Databases from Database Repairs and Repair Programs

Specifying and computing causes for query answers in databases via database repairs and repair-programs

Article 03 November 2020

Leopoldo Bertossi

References

Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases: a probabilistic approach. In: Proceedings of ICDE (2006)
Antova, L., Koch, C., Olteanu, D.: \({10^{10^{6}}}\) worlds and beyond: Efficient representation and processing of incomplete information. In: Proceedings of ICDE (2007)
Batista, G.E.A.P.A., Monard, M.C.: A study of k-nearest neighbour as an imputation method. In: Proceedings of HIS (2002)
Bertossi L.: Consistent query answering in databases. ACM SIGMOD Record 35(2), 68–76 (2006)
Article Google Scholar
Blum A., Langley P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97, 245–271 (1997)
Article MATH MathSciNet Google Scholar
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: Proceedings of ICDE, pp. 746–755. IEEE, New York (2007)
Burdick D., Deshpande P., Jayram T., Ramakrishnan R., Vaithyanathan S.: OLAP over uncertain and imprecise data. VLDB J. 16(1), 123–144 (2007)
Google Scholar
Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: Proceedings of SIGMOD Conference, pp. 437–448 (2007)
Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: Proceedings of SIGMOD Conference (2003)
Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 1–12. ACM, New York (2007)
Dalvi, N.N., Suciu, D.: Answering queries from statistics and probabilistic views. In: Proceedings of VLDB, pp. 805–816 (2005)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via em algorithm. In: Proceedings of JRSS, pp. 1–38 (1977)
Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of SIGMOD Conference (2001)
Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of PODS, pp. 159–170 (2008)
Golab L., Karloff H.J., Korn F., Srivastava D., Yu B.: On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1), 376–390 (2008)
Google Scholar
Green T.J., Tannen V.: Models for incomplete and probabilistic information. IEEE Data Eng. Bull. 29(1), 17–24 (2006)
Google Scholar
Gupta R., Sarawagi S.: Creating probabilistic databases from information extraction models. Proc. Int. Conf. Very Large Data Bases 32(2), 965 (2006)
Google Scholar
Heckerman, D.: A tutorial on learning with bayesian networks (1995)
Huhtala Y., Karkkainen J., Porkka P., Toivonen H.: Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)
Article MATH Google Scholar
Huhtala, Y., Karkkainen, J., Porkka, P., Toivonen, H.: Efficient discovery of functional and approximate dependencies using partitions. In: Proceedings of ICDE Conference, pp. 392–401 (1998)
Ilyas, I.F., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: Proceedings of SIGMOD ’04: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 647–658. ACM, New York (2004)
Ilyas, I.F., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: Proceedings of SIGMOD Conference, pp. 647–658 (2004)
Imieliski T., Witold Lipski J.: Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984)
Article Google Scholar
Ives, Z., Halevy, A., Weld, D.: Adapting to source properties in processing data integration queries. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 395–406 (2004)
Kalavagattu, A.: Mining Approximate Functional Dependencies as Condensed Representations of Association Rules. Master’s thesis, Arizona State University. http://rakaposhi.eas.asu.edu/Aravind-MSThesis.pdf (2008)
Khatri, H.: Query Processing Over Incomplete Autonomous Web Databases. Master’s thesis, Arizona State University. http://rakaposhi.eas.asu.edu/hemal-thesis.pdf (2006)
Kivinen, J., Mannila, H.: Approximate dependency inference from relations. In: Proceedings of ICDT Conference (1992)
Lembo, D., Lenzerini, M., Rosati, R.: Source inconsistency and incompleteness in data integration. In: Proceedings of KRDB Workshop (2002)
Libkin, L.: Data exchange and incomplete information. In: Proceedings of PODS, pp. 60–69 (2006)
Lipski W.: On semantic issues connected with incomplete information databases. ACM TODS 4(3), 262–296 (1979)
Article Google Scholar
Lopes S., Petit J., Lakhal L.: Functional and approximate dependency mining: database and FCA points of view. J. Exp. Theor. Artif. Intell. 14(2), 93–114 (2002)
Article MATH Google Scholar
Lopes, S., Petit, J.-M., Lakhal, L.: Efficient discovery of functional dependencies and armstrong relations. In: Proceedings of EDBT ’00: Proceedings of the 7th International Conference on Extending Database Technology, pp. 350–364. Springer, London (2000)
Madhavan, J., Halevy, A., Cohen, S., Dong, X., Jeffery, S., Ko, D., Yu, C.: Structured Data Meets the Web: A Few Observations. IEEE Data Eng. Bull. (2006)
Mitchell T.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Muslea, I., Lee, T.J.: Online query relaxation via bayesian causal structures discovery. In: Proceedings of AAAI, pp. 831–836 (2005)
Nambiar, U., Kambhampati, S.: Answering imprecise queries over autonomous web databases. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), vol. 00 (2006)
Nilesh, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: Proceedings of VLDB Conference, pp. 864–875 (2004)
Novelli, N., Cicchetti, R.: FUN: An efficient algorithm for mining functional and embedded dependencies. In: Lecture Notes in Mathematics, pp. 189–203. Springer, Berlin (2000)
Ramoni M., Sebastiani P.: Robust learning with missing data. Mach. Learn. 45(2), 147–170 (2001)
Article MATH Google Scholar
Roderick D.B.R., Little J.A.: Statistical Analysis with Missing Data, 2nd edn. Wiley, New York (2002)
MATH Google Scholar
Sarma, A.D., Benjelloun, O., Halevy, A.Y., Widom, J.: Working models for uncertain data. In: Proceedings of ICDE (2006)
Suciu, D., Dalvi, N.: Tutorial: Foundations of probabilistic answers to queries. In: Proceedings of SIGMOD Conference (2005)
Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. In: Proceedings of CIDR, pp. 262–276 (2005)
Wu, C.-H., Wun, C.-H., Chou, H.-J.: Using association rules for completing missing data. In: Proceedings of HIS Conference (2004)
Wyss, C.M., Giannella, C., Robertson, E.L.: Fastfds: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances—extended abstract. In: Proceedings of DaWaK, pp. 101–110 (2001)
Yao H., Hamilton H.: Mining functional dependencies from data. Data Min. Knowl. Discov. 16(2), 197–219 (2008)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Arizona State University, Tempe, AZ, USA
Garrett Wolf, Aravind Kalavagattu, Hemal Khatri, Raju Balakrishnan, Bhaumik Chokshi, Jianchun Fan, Yi Chen & Subbarao Kambhampati

Authors

Garrett Wolf
View author publications
You can also search for this author in PubMed Google Scholar
Aravind Kalavagattu
View author publications
You can also search for this author in PubMed Google Scholar
Hemal Khatri
View author publications
You can also search for this author in PubMed Google Scholar
Raju Balakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
Bhaumik Chokshi
View author publications
You can also search for this author in PubMed Google Scholar
Jianchun Fan
View author publications
You can also search for this author in PubMed Google Scholar
Yi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Subbarao Kambhampati
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Subbarao Kambhampati.

Additional information

This research was supported in part by the NSF grants IIS 308139, IIS 0624341, IIS 0740129 and IIS 0845647 (CAREER); the ONR grants N000140610058 and N000140910032, a Google research award, as well as support from ASU (via ECR A601), the ASU Prop 301 grant to ET-I3 initiative.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wolf, G., Kalavagattu, A., Khatri, H. et al. Query processing over incomplete autonomous databases: query rewriting using learned data dependencies. The VLDB Journal 18, 1167–1190 (2009). https://doi.org/10.1007/s00778-009-0155-0

Download citation

Received: 16 September 2008
Revised: 05 June 2009
Accepted: 26 June 2009
Published: 21 July 2009
Issue Date: October 2009
DOI: https://doi.org/10.1007/s00778-009-0155-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Query processing over incomplete autonomous databases: query rewriting using learned data dependencies

Abstract

Access this article

Similar content being viewed by others

Query-Oriented Answer Imputation for Aggregate Queries

Characterizing and Computing Causes for Query Answers in Databases from Database Repairs and Repair Programs

Specifying and computing causes for query answers in databases via database repairs and repair-programs

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Query-Oriented Answer Imputation for Aggregate Queries

Characterizing and Computing Causes for Query Answers in Databases from Database Repairs and Repair Programs

Specifying and computing causes for query answers in databases via database repairs and repair-programs

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation