Skip to main content
Log in

Query processing over incomplete autonomous databases: query rewriting using learned data dependencies

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical missing attributes, even if they wind up being relevant to a user query. Ideally we would like the mediator to retrieve such possibleanswers and gauge their relevance by accessing their likelihood of being pertinent answers to the query. The autonomous nature of web databases poses several challenges in realizing this objective. Such challenges include the restricted access privileges imposed on the data, the limited support for query patterns, and the bounded pool of database and network resources in the web environment. We introduce a novel query rewriting and optimization framework QPIAD that tackles these challenges. Our technique involves reformulating the user query based on mined correlations among the database attributes. The reformulated queries are aimed at retrieving the relevant possibleanswers in addition to the certain answers. QPIAD is able to gauge the relevance of such queries allowing tradeoffs in reducing the costs of database query processing and answer transmission. To support this framework, we develop methods for mining attribute correlations (in terms of Approximate Functional Dependencies), value distributions (in the form of Naïve Bayes Classifiers), and selectivity estimates. We present empirical studies to demonstrate that our approach is able to effectively retrieve relevant possibleanswers with high precision, high recall, and manageable cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases: a probabilistic approach. In: Proceedings of ICDE (2006)

  2. Antova, L., Koch, C., Olteanu, D.: \({10^{10^{6}}}\) worlds and beyond: Efficient representation and processing of incomplete information. In: Proceedings of ICDE (2007)

  3. Batista, G.E.A.P.A., Monard, M.C.: A study of k-nearest neighbour as an imputation method. In: Proceedings of HIS (2002)

  4. Bertossi L.: Consistent query answering in databases. ACM SIGMOD Record 35(2), 68–76 (2006)

    Article  Google Scholar 

  5. Blum A., Langley P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97, 245–271 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  6. Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: Proceedings of ICDE, pp. 746–755. IEEE, New York (2007)

  7. Burdick D., Deshpande P., Jayram T., Ramakrishnan R., Vaithyanathan S.: OLAP over uncertain and imprecise data. VLDB J. 16(1), 123–144 (2007)

    Google Scholar 

  8. Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: Proceedings of SIGMOD Conference, pp. 437–448 (2007)

  9. Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: Proceedings of SIGMOD Conference (2003)

  10. Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 1–12. ACM, New York (2007)

  11. Dalvi, N.N., Suciu, D.: Answering queries from statistics and probabilistic views. In: Proceedings of VLDB, pp. 805–816 (2005)

  12. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via em algorithm. In: Proceedings of JRSS, pp. 1–38 (1977)

  13. Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of SIGMOD Conference (2001)

  14. Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of PODS, pp. 159–170 (2008)

  15. Golab L., Karloff H.J., Korn F., Srivastava D., Yu B.: On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1), 376–390 (2008)

    Google Scholar 

  16. Green T.J., Tannen V.: Models for incomplete and probabilistic information. IEEE Data Eng. Bull. 29(1), 17–24 (2006)

    Google Scholar 

  17. Gupta R., Sarawagi S.: Creating probabilistic databases from information extraction models. Proc. Int. Conf. Very Large Data Bases 32(2), 965 (2006)

    Google Scholar 

  18. Heckerman, D.: A tutorial on learning with bayesian networks (1995)

  19. Huhtala Y., Karkkainen J., Porkka P., Toivonen H.: Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)

    Article  MATH  Google Scholar 

  20. Huhtala, Y., Karkkainen, J., Porkka, P., Toivonen, H.: Efficient discovery of functional and approximate dependencies using partitions. In: Proceedings of ICDE Conference, pp. 392–401 (1998)

  21. Ilyas, I.F., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: Proceedings of SIGMOD ’04: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 647–658. ACM, New York (2004)

  22. Ilyas, I.F., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: Proceedings of SIGMOD Conference, pp. 647–658 (2004)

  23. Imieliski T., Witold Lipski J.: Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984)

    Article  Google Scholar 

  24. Ives, Z., Halevy, A., Weld, D.: Adapting to source properties in processing data integration queries. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 395–406 (2004)

  25. Kalavagattu, A.: Mining Approximate Functional Dependencies as Condensed Representations of Association Rules. Master’s thesis, Arizona State University. http://rakaposhi.eas.asu.edu/Aravind-MSThesis.pdf (2008)

  26. Khatri, H.: Query Processing Over Incomplete Autonomous Web Databases. Master’s thesis, Arizona State University. http://rakaposhi.eas.asu.edu/hemal-thesis.pdf (2006)

  27. Kivinen, J., Mannila, H.: Approximate dependency inference from relations. In: Proceedings of ICDT Conference (1992)

  28. Lembo, D., Lenzerini, M., Rosati, R.: Source inconsistency and incompleteness in data integration. In: Proceedings of KRDB Workshop (2002)

  29. Libkin, L.: Data exchange and incomplete information. In: Proceedings of PODS, pp. 60–69 (2006)

  30. Lipski W.: On semantic issues connected with incomplete information databases. ACM TODS 4(3), 262–296 (1979)

    Article  Google Scholar 

  31. Lopes S., Petit J., Lakhal L.: Functional and approximate dependency mining: database and FCA points of view. J. Exp. Theor. Artif. Intell. 14(2), 93–114 (2002)

    Article  MATH  Google Scholar 

  32. Lopes, S., Petit, J.-M., Lakhal, L.: Efficient discovery of functional dependencies and armstrong relations. In: Proceedings of EDBT ’00: Proceedings of the 7th International Conference on Extending Database Technology, pp. 350–364. Springer, London (2000)

  33. Madhavan, J., Halevy, A., Cohen, S., Dong, X., Jeffery, S., Ko, D., Yu, C.: Structured Data Meets the Web: A Few Observations. IEEE Data Eng. Bull. (2006)

  34. Mitchell T.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  35. Muslea, I., Lee, T.J.: Online query relaxation via bayesian causal structures discovery. In: Proceedings of AAAI, pp. 831–836 (2005)

  36. Nambiar, U., Kambhampati, S.: Answering imprecise queries over autonomous web databases. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), vol. 00 (2006)

  37. Nilesh, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: Proceedings of VLDB Conference, pp. 864–875 (2004)

  38. Novelli, N., Cicchetti, R.: FUN: An efficient algorithm for mining functional and embedded dependencies. In: Lecture Notes in Mathematics, pp. 189–203. Springer, Berlin (2000)

  39. Ramoni M., Sebastiani P.: Robust learning with missing data. Mach. Learn. 45(2), 147–170 (2001)

    Article  MATH  Google Scholar 

  40. Roderick D.B.R., Little J.A.: Statistical Analysis with Missing Data, 2nd edn. Wiley, New York (2002)

    MATH  Google Scholar 

  41. Sarma, A.D., Benjelloun, O., Halevy, A.Y., Widom, J.: Working models for uncertain data. In: Proceedings of ICDE (2006)

  42. Suciu, D., Dalvi, N.: Tutorial: Foundations of probabilistic answers to queries. In: Proceedings of SIGMOD Conference (2005)

  43. Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. In: Proceedings of CIDR, pp. 262–276 (2005)

  44. Wu, C.-H., Wun, C.-H., Chou, H.-J.: Using association rules for completing missing data. In: Proceedings of HIS Conference (2004)

  45. Wyss, C.M., Giannella, C., Robertson, E.L.: Fastfds: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances—extended abstract. In: Proceedings of DaWaK, pp. 101–110 (2001)

  46. Yao H., Hamilton H.: Mining functional dependencies from data. Data Min. Knowl. Discov. 16(2), 197–219 (2008)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Subbarao Kambhampati.

Additional information

This research was supported in part by the NSF grants IIS 308139, IIS 0624341, IIS 0740129 and IIS 0845647 (CAREER); the ONR grants N000140610058 and N000140910032, a Google research award, as well as support from ASU (via ECR A601), the ASU Prop 301 grant to ET-I3 initiative.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wolf, G., Kalavagattu, A., Khatri, H. et al. Query processing over incomplete autonomous databases: query rewriting using learned data dependencies. The VLDB Journal 18, 1167–1190 (2009). https://doi.org/10.1007/s00778-009-0155-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-009-0155-0

Keywords

Navigation