skip to main content
10.1145/1376616.1376702acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Bootstrapping pay-as-you-go data integration systems

Published:09 June 2008Publication History

ABSTRACT

Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services, motivating a pay-as-you-go approach to integration. With that approach, a system starts with very few (or inaccurate) semantic mappings and these mappings are improved over time as deemed necessary.

This paper describes the first completely self-configuring data integration system. The goal of our work is to investigate how advanced of a starting point we can provide a pay-as-you-go system. Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources. We automatically create probabilistic schema mappings between the sources and the mediated schema. We describe experiments in multiple domains, including 50-800 data sources, and show that our system is able to produce high-quality answers with no human intervention.

References

  1. Knitro optimization software. http://www.ziena.com/knitro.htm.Google ScholarGoogle Scholar
  2. Secondstring. http://secondstring.sourceforge.net/.Google ScholarGoogle Scholar
  3. C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. In ACM Computing Surveys, pages 323--364, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. L. Berger, S. A. D. Pietra, and V. J. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, (1):39--71, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Berlin and A. Motro. Database schema matching using machine learning with feature selection. In Proc. of the 14th Int. Conf. on Advanced Information Systems Eng. (CAiSE02), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Buneman, S. Davidson, and A. Kosky. Theoretical aspects of schema merging. In Proc. of EDBT, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Dhamankar, Y. Lee, A. Doan, A. Y. Halevy, and P. Domingos. iMAP: Discovering complex semantic matches between database schemas. In Proc. of ACM SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Do and E. Rahm. COMA - a system for flexible combination of schema matching approaches. In Proc. of VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Doan, J. Madhavan, P. Domingos, and A. Y. Halevy. Learning to map between ontologies on the Semantic Web. In Proc. of the Int. WWW Conf., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. X. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. In Proc. of VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Dudik, S. J. Phillips, and R. E. Schapire. Performance guarantees for regularized maximum entropy density estimation. In Proc. of the 17th Annual Conf. on Computational Learning Theory, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  12. M. Franklin, A. Y. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. In SIGMOD Record, pages 27--33, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Gal. Why is schema matching tough and what can we do about it? SIGMOD Record, 35(4):2--5, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. He and K. C. Chang. Statistical schema matching across web query interfaces. In Proc. of ACM SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Hull. Relative information capacity of simple relational database schemata. In Proc. of ACM PODS, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Jeffery, M. Franklin, and A. Halevy. Pay-as-you-go user feedback for dataspace systems. In Proc. of ACM SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. A. Kalinichenko. Methods and tools for equivalent data model mapping construction. In Proc. of EDBT, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Kang and J. Naughton. On schema matching with opaque column names and data values. In Proc. of ACM SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Magnani and D. Montesi. Uncertainty in data integration: current approaches and open problems. In VLDB workshop on Management of Uncertain Data, pages 18--32, 2007.Google ScholarGoogle Scholar
  20. M. Magnani, N. Rizopoulos, P. Brien, and D. Montesi. Schema integration based on uncertain semantic mappings. Lecture Notes in Computer Science, pages 31--46, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Melnik, H. G. Molina, and E. Rahm. Similarity flooding: A versatile graph matching algorithm. In Proc. of ICDE, pages 117--128, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. J. Miller, Y. Ioannidis, and R. Ramakrishnan. The use of information capacity in schema integration and translation. In Proc. of VLDB, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Nottelmann and U. Straccia. Information retrieval and machine learning for probabilistic schema matching. Information Processing and Management, 43(3):552--576, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. D. Pietra, V. D. Pietra, and J. Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380--393, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Pottinger and P. Bernstein. Creating a mediated schema based on initial correspondences. In IEEE Data Eng. Bulletin, pages 26--31, Sept 2002.Google ScholarGoogle Scholar
  26. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. E. Fienberg W. Cohen, P. Ravikumar. A comparison of string distance metrics for name-matching tasks. In Proc. of IJCAI, 2003.Google ScholarGoogle Scholar
  28. J. Wang, J. Wen, F. H. Lochovsky, and W. Ma. Instance-based schema matching for Web databases by domain-specific query probing. In Proc. of VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Bootstrapping pay-as-you-go data integration systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
      June 2008
      1396 pages
      ISBN:9781605581026
      DOI:10.1145/1376616

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 June 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader