ABSTRACT
Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services, motivating a pay-as-you-go approach to integration. With that approach, a system starts with very few (or inaccurate) semantic mappings and these mappings are improved over time as deemed necessary.
This paper describes the first completely self-configuring data integration system. The goal of our work is to investigate how advanced of a starting point we can provide a pay-as-you-go system. Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources. We automatically create probabilistic schema mappings between the sources and the mediated schema. We describe experiments in multiple domains, including 50-800 data sources, and show that our system is able to produce high-quality answers with no human intervention.
- Knitro optimization software. http://www.ziena.com/knitro.htm.Google Scholar
- Secondstring. http://secondstring.sourceforge.net/.Google Scholar
- C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. In ACM Computing Surveys, pages 323--364, 1986. Google ScholarDigital Library
- A. L. Berger, S. A. D. Pietra, and V. J. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, (1):39--71, 1996. Google ScholarDigital Library
- J. Berlin and A. Motro. Database schema matching using machine learning with feature selection. In Proc. of the 14th Int. Conf. on Advanced Information Systems Eng. (CAiSE02), 2002. Google ScholarDigital Library
- P. Buneman, S. Davidson, and A. Kosky. Theoretical aspects of schema merging. In Proc. of EDBT, 1992. Google ScholarDigital Library
- R. Dhamankar, Y. Lee, A. Doan, A. Y. Halevy, and P. Domingos. iMAP: Discovering complex semantic matches between database schemas. In Proc. of ACM SIGMOD, 2004. Google ScholarDigital Library
- H. Do and E. Rahm. COMA - a system for flexible combination of schema matching approaches. In Proc. of VLDB, 2002. Google ScholarDigital Library
- A. Doan, J. Madhavan, P. Domingos, and A. Y. Halevy. Learning to map between ontologies on the Semantic Web. In Proc. of the Int. WWW Conf., 2002. Google ScholarDigital Library
- X. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. In Proc. of VLDB, 2007. Google ScholarDigital Library
- M. Dudik, S. J. Phillips, and R. E. Schapire. Performance guarantees for regularized maximum entropy density estimation. In Proc. of the 17th Annual Conf. on Computational Learning Theory, 2004.Google ScholarCross Ref
- M. Franklin, A. Y. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. In SIGMOD Record, pages 27--33, 2005. Google ScholarDigital Library
- A. Gal. Why is schema matching tough and what can we do about it? SIGMOD Record, 35(4):2--5, 2007. Google ScholarDigital Library
- B. He and K. C. Chang. Statistical schema matching across web query interfaces. In Proc. of ACM SIGMOD, 2003. Google ScholarDigital Library
- R. Hull. Relative information capacity of simple relational database schemata. In Proc. of ACM PODS, 1984. Google ScholarDigital Library
- S. Jeffery, M. Franklin, and A. Halevy. Pay-as-you-go user feedback for dataspace systems. In Proc. of ACM SIGMOD, 2008. Google ScholarDigital Library
- L. A. Kalinichenko. Methods and tools for equivalent data model mapping construction. In Proc. of EDBT, 1990. Google ScholarDigital Library
- J. Kang and J. Naughton. On schema matching with opaque column names and data values. In Proc. of ACM SIGMOD, 2003. Google ScholarDigital Library
- M. Magnani and D. Montesi. Uncertainty in data integration: current approaches and open problems. In VLDB workshop on Management of Uncertain Data, pages 18--32, 2007.Google Scholar
- M. Magnani, N. Rizopoulos, P. Brien, and D. Montesi. Schema integration based on uncertain semantic mappings. Lecture Notes in Computer Science, pages 31--46, 2005. Google ScholarDigital Library
- S. Melnik, H. G. Molina, and E. Rahm. Similarity flooding: A versatile graph matching algorithm. In Proc. of ICDE, pages 117--128, 2002. Google ScholarDigital Library
- R. J. Miller, Y. Ioannidis, and R. Ramakrishnan. The use of information capacity in schema integration and translation. In Proc. of VLDB, 1993. Google ScholarDigital Library
- H. Nottelmann and U. Straccia. Information retrieval and machine learning for probabilistic schema matching. Information Processing and Management, 43(3):552--576, 2007. Google ScholarDigital Library
- S. D. Pietra, V. D. Pietra, and J. Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380--393, 1997. Google ScholarDigital Library
- R. Pottinger and P. Bernstein. Creating a mediated schema based on initial correspondences. In IEEE Data Eng. Bulletin, pages 26--31, Sept 2002.Google Scholar
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001. Google ScholarDigital Library
- S. E. Fienberg W. Cohen, P. Ravikumar. A comparison of string distance metrics for name-matching tasks. In Proc. of IJCAI, 2003.Google Scholar
- J. Wang, J. Wen, F. H. Lochovsky, and W. Ma. Instance-based schema matching for Web databases by domain-specific query probing. In Proc. of VLDB, 2004. Google ScholarDigital Library
Index Terms
- Bootstrapping pay-as-you-go data integration systems
Recommendations
Quasi-inverses of schema mappings
Schema mappings are high-level specifications that describe the relationship between two database schemas. Two operators on schema mappings, namely the composition operator and the inverse operator, are regarded as especially important. Progress on the ...
Structural characterizations of schema-mapping languages
ICDT '09: Proceedings of the 12th International Conference on Database TheorySchema mappings are declarative specifications that describe the relationship between two database schemas. In recent years, there has been an extensive study of schema mappings and of their applications to several different data inter-operability tasks,...
Quasi-inverses of schema mappings
PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsSchema mappings are high-level specifications that describe the relationship between two database schemas. Two operators on schema mappings, namely the composition operator and the inverse operator, are regarded as especially important. Progress on the ...
Comments