research-article

Bootstrapping pay-as-you-go data integration systems

Authors:
Anish Das Sarma

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
Xin Dong

AT&T Labs-Research, New Jersey, NJ, USA

AT&T Labs-Research, New Jersey, NJ, USA
View Profile

,
Alon Halevy

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataJune 2008Pages 861–874https://doi.org/10.1145/1376616.1376702

Published:09 June 2008Publication History

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Pages 861–874

ABSTRACT

Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services, motivating a pay-as-you-go approach to integration. With that approach, a system starts with very few (or inaccurate) semantic mappings and these mappings are improved over time as deemed necessary.

This paper describes the first completely self-configuring data integration system. The goal of our work is to investigate how advanced of a starting point we can provide a pay-as-you-go system. Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources. We automatically create probabilistic schema mappings between the sources and the mediated schema. We describe experiments in multiple domains, including 50-800 data sources, and show that our system is able to produce high-quality answers with no human intervention.

References

Knitro optimization software. http://www.ziena.com/knitro.htm.Google Scholar
Secondstring. http://secondstring.sourceforge.net/.Google Scholar
C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. In ACM Computing Surveys, pages 323--364, 1986. Google ScholarDigital Library
A. L. Berger, S. A. D. Pietra, and V. J. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, (1):39--71, 1996. Google ScholarDigital Library
J. Berlin and A. Motro. Database schema matching using machine learning with feature selection. In Proc. of the 14th Int. Conf. on Advanced Information Systems Eng. (CAiSE02), 2002. Google ScholarDigital Library
P. Buneman, S. Davidson, and A. Kosky. Theoretical aspects of schema merging. In Proc. of EDBT, 1992. Google ScholarDigital Library
R. Dhamankar, Y. Lee, A. Doan, A. Y. Halevy, and P. Domingos. iMAP: Discovering complex semantic matches between database schemas. In Proc. of ACM SIGMOD, 2004. Google ScholarDigital Library
H. Do and E. Rahm. COMA - a system for flexible combination of schema matching approaches. In Proc. of VLDB, 2002. Google ScholarDigital Library
A. Doan, J. Madhavan, P. Domingos, and A. Y. Halevy. Learning to map between ontologies on the Semantic Web. In Proc. of the Int. WWW Conf., 2002. Google ScholarDigital Library
X. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. In Proc. of VLDB, 2007. Google ScholarDigital Library
M. Dudik, S. J. Phillips, and R. E. Schapire. Performance guarantees for regularized maximum entropy density estimation. In Proc. of the 17th Annual Conf. on Computational Learning Theory, 2004.Google ScholarCross Ref
M. Franklin, A. Y. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. In SIGMOD Record, pages 27--33, 2005. Google ScholarDigital Library
A. Gal. Why is schema matching tough and what can we do about it? SIGMOD Record, 35(4):2--5, 2007. Google ScholarDigital Library
B. He and K. C. Chang. Statistical schema matching across web query interfaces. In Proc. of ACM SIGMOD, 2003. Google ScholarDigital Library
R. Hull. Relative information capacity of simple relational database schemata. In Proc. of ACM PODS, 1984. Google ScholarDigital Library
S. Jeffery, M. Franklin, and A. Halevy. Pay-as-you-go user feedback for dataspace systems. In Proc. of ACM SIGMOD, 2008. Google ScholarDigital Library
L. A. Kalinichenko. Methods and tools for equivalent data model mapping construction. In Proc. of EDBT, 1990. Google ScholarDigital Library
J. Kang and J. Naughton. On schema matching with opaque column names and data values. In Proc. of ACM SIGMOD, 2003. Google ScholarDigital Library
M. Magnani and D. Montesi. Uncertainty in data integration: current approaches and open problems. In VLDB workshop on Management of Uncertain Data, pages 18--32, 2007.Google Scholar
M. Magnani, N. Rizopoulos, P. Brien, and D. Montesi. Schema integration based on uncertain semantic mappings. Lecture Notes in Computer Science, pages 31--46, 2005. Google ScholarDigital Library
S. Melnik, H. G. Molina, and E. Rahm. Similarity flooding: A versatile graph matching algorithm. In Proc. of ICDE, pages 117--128, 2002. Google ScholarDigital Library
R. J. Miller, Y. Ioannidis, and R. Ramakrishnan. The use of information capacity in schema integration and translation. In Proc. of VLDB, 1993. Google ScholarDigital Library
H. Nottelmann and U. Straccia. Information retrieval and machine learning for probabilistic schema matching. Information Processing and Management, 43(3):552--576, 2007. Google ScholarDigital Library
S. D. Pietra, V. D. Pietra, and J. Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380--393, 1997. Google ScholarDigital Library
R. Pottinger and P. Bernstein. Creating a mediated schema based on initial correspondences. In IEEE Data Eng. Bulletin, pages 26--31, Sept 2002.Google Scholar
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001. Google ScholarDigital Library
S. E. Fienberg W. Cohen, P. Ravikumar. A comparison of string distance metrics for name-matching tasks. In Proc. of IJCAI, 2003.Google Scholar
J. Wang, J. Wen, F. H. Lochovsky, and W. Ma. Instance-based schema matching for Web databases by domain-specific query probing. In Proc. of VLDB, 2004. Google ScholarDigital Library

Index Terms

Bootstrapping pay-as-you-go data integration systems
1. Information systems

Recommendations

Quasi-inverses of schema mappings

Schema mappings are high-level specifications that describe the relationship between two database schemas. Two operators on schema mappings, namely the composition operator and the inverse operator, are regarded as especially important. Progress on the ...
Read More
Structural characterizations of schema-mapping languages
ICDT '09: Proceedings of the 12th International Conference on Database Theory

Schema mappings are declarative specifications that describe the relationship between two database schemas. In recent years, there has been an extensive study of schema mappings and of their applications to several different data inter-operability tasks,...
Read More
Quasi-inverses of schema mappings
PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Schema mappings are high-level specifications that describe the relationship between two database schemas. Two operators on schema mappings, namely the composition operator and the inverse operator, are regarded as especially important. Progress on the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
June 2008
1396 pages
ISBN:9781605581026
DOI:10.1145/1376616
General Chairs:
Laks V. S. Lakshmanan
University of British Columbia, Canada
,
Raymond T. Ng
University of British Columbia, Canada
,
Dennis Shasha
New York University, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data integration
mediated schema
pay-as-you-go
schema mapping
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 135
  Total Citations
  View Citations
- 1,881
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.