research-article

Keeping the Data Lake in Form: Proximity Mining for Pre-Filtering Schema Matching

Authors:
Ayman Alserafi

Universitat Politècnica de Catalunya and Université Libre de Bruxelles, Bruxelles, Belgium

Universitat Politècnica de Catalunya and Université Libre de Bruxelles, Bruxelles, Belgium
View Profile

,
Alberto Abelló

Universitat Politècnica de Catalunya, Barcelona, Catalunya, Spain

Universitat Politècnica de Catalunya, Barcelona, Catalunya, Spain
View Profile

,
Oscar Romero

Universitat Politècnica de Catalunya, Barcelona, Catalunya, Spain

Universitat Politècnica de Catalunya, Barcelona, Catalunya, Spain
View Profile

,
Toon Calders

Université Libre de Bruxelles and Universiteit Antwerpen, Antwerpen, Belgium

Université Libre de Bruxelles and Universiteit Antwerpen, Antwerpen, Belgium
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 38 Issue 3Article No.: 26pp 1–30https://doi.org/10.1145/3388870

Published:18 May 2020Publication History

ACM Transactions on Information Systems

Abstract

Data lakes (DLs) are large repositories of raw datasets from disparate sources. As more datasets are ingested into a DL, there is an increasing need for efficient techniques to profile them and to detect the relationships among their schemata, commonly known as holistic schema matching. Schema matching detects similarity between the information stored in the datasets to support information discovery and retrieval. Currently, this is computationally expensive with the volume of state-of-the-art DLs. To handle this challenge, we propose a novel early-pruning approach to improve efficiency, where we collect different types of content metadata and schema metadata about the datasets, and then use this metadata in early-pruning steps to pre-filter the schema matching comparisons. This involves computing proximities between datasets based on their metadata, discovering their relationships based on overall proximities and proposing similar dataset pairs for schema matching. We improve the effectiveness of this task by introducing a supervised mining approach for effectively detecting similar datasets that are proposed for further schema matching. We conduct extensive experiments on a real-world DL that proves the success of our approach in effectively detecting similar datasets for schema matching, with recall rates of more than 85% and efficiency improvements above 70%. We empirically show the computational cost saving in space and time by applying our approach in comparison to instance-based schema matching techniques.

References

Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: A survey. VLDB Journal 24, 4 (2015), 557--581. DOI:https://doi.org/10.1007/s00778-015-0389-yGoogle ScholarDigital Library
Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen, and Alexander Tuzhilin. 2005. Incorporating contextual information in recommender systems using a multidimensional approach. ACM Transactions on Information Systems 23, 1 (2005), 103--145. DOI:https://doi.org/doi: 10.1145/1055709.1055714Google ScholarDigital Library
Alsayed Algergawy, Sabine Massmann, and Erhard Rahm. 2011. A clustering-based approach for large-scale ontology matching. In Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS’11). 415--428. DOI:https://doi.org/10.1007/978-3-642-23737-9_30Google ScholarCross Ref
Ayman Alserafi, Alberto Abelló, Oscar Romero, and Toon Calders. 2016. Towards information profiling: Data lake content metadata management. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW’16). 178--185. DOI:https://doi.org/10.1109/ICDMW.2016.0033Google ScholarCross Ref
Ayman Alserafi, Toon Calders, Alberto Abelló, and Oscar Romero. 2017. DS-Prox: Dataset proximity mining for governing the data lake. In Similarity Search and Applications. Lecture Notes in Computer Science, Vol. 10609. Springer, 284--299. DOI:https://doi.org/10.1007/978-3-319-68474-1_20Google Scholar
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze, and Konstantin Todorov. 2016. Dataset recommendation for data linking: An intensional approach. In Proceedings of the International Semantic Web Conference: The Semantic Web. Latest Advances and New Domains, Vol. 9678. 36--51. DOI:https://doi.org/10.1007/978-3-319-34129-3_3Google ScholarDigital Library
Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic schema matching, ten years later. Proceedings of the VLDB Endowment 4, 11 (2011), 695--701.Google ScholarDigital Library
Alexander Bilke and Felix Naumann. 2005. Schema matching using duplicates. In Proceedings of the 21st International Conference on Data Engineering. 69--80.Google ScholarDigital Library
Leo Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32.Google ScholarDigital Library
Chen Chen, Alon Halevy, and Wang-Chiew Tan. 2018. BigGorilla: An open-source ecosystem for data preparation and integration. IEEE Data Engineering Bulletin 41, 2 (2018), 10--22.Google Scholar
Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D. Davison. 2018. Generating schema labels through dataset content analysis. In Companion Proceedings of the Web Conference 2018 (WWW’18). 1515--1522. DOI:https://doi.org/10.1145/3184558.3191601Google Scholar
Hélio Rodrigues de Oliveira, Alberto Trindade Tavares, and Bernadette Farias Lóscio. 2012. Feedback-based data set recommendation for building linked data applications. In Proceedings of the 8th International Conference on Semantic Systems (I-SEMANTICS’12). ACM, New York, NY, 49.Google ScholarDigital Library
Dong Deng, Albert Kim, Samuel Madden, and Michael Stonebraker. 2017. SilkMoth: An efficient method for finding related sets with maximum matching constraints. Proceedings of the VLDB Endowment 10, 10 (2017), 1082--1093.Google ScholarDigital Library
Tim Furche, Georg Gottlob, Leonid Libkin, Giorgio Orsi, and Norman W. Paton. 2016. Data wrangling for big data: Challenges and opportunities. In Proceedings of the 19th Conference on Extending Database Technology (EDBT’16), Vol. 16. 473--478.Google Scholar
Enrico Gallinucci, Matteo Golfarelli, and Stefano Rizzi. 2018. Schema profiling of document-oriented databases. Information Systems 75 (2018), 13--25. DOI:https://doi.org/10.1016/j.is.2018.02.007Google ScholarCross Ref
Jonathan Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedel. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems 22, 1 (2004), 5--53.Google ScholarDigital Library
Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank Van Ham, Nathalie Henry Riche, Chris Weaver, Bongshin Lee, Dominique Brodbeck, and Paolo Buono. 2011. Research directions in data wrangling: Visualizations and transformations for usable and credible data. Information Visualization 10, 4 (2011), 271--288.Google ScholarDigital Library
Jaewook Kim, Yun Peng, Nenad Ivezic, and Junho Shin. 2011. An optimization approach for semantic-based XML schema matching. International Journal of Trade, Economics and Finance 2, 1 (2011), 78--86.Google ScholarCross Ref
Sebastian Kruse, Thorsten Papenbrock, Hazar Harmouch, and Felix Naumann. 2016. Data anamnesis: Admitting raw data into an organization. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 39, 2 (2016), 8--20.Google Scholar
Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, and Zoubin Ghahramani. 2013. SiGMa: Simple greedy matching for aligning large knowledge bases. In Proceedings of the 19th ACM SIGKDD International Conference. 572--580. DOI:https://doi.org/10.1145/2487575.2487592Google ScholarDigital Library
Antonio Maccioni and Riccardo Torlone. 2018. KAYAK: A framework for just-in-time data preparation in a data lake. In Proceedings of the International Conference on Advanced Information Systems Engineering. 474--489. DOI:https://doi.org/10.1007/978-3-319-91563-0Google ScholarDigital Library
Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. 2001. Generic schema matching with Cupid. VLDB Journal 1 (2001), 49--58.Google Scholar
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2009. An Introduction to Information Retrieval. Number C. Cambridge UP.Google Scholar
J. Miller. 2018. Open data integration. Proceedings of the VLDB Endowment 11, 12 (2018), 2130--2139.Google ScholarDigital Library
Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4 (2014), 40--49.Google ScholarDigital Library
Alessandreia Oliveira, Gabriel Tessarolli, Gleiph Ghiotto, Bruno Pinto, Fernando Campello, Matheus Marques, Carlos Oliveira, et al. 2018. An efficient similarity-based approach for comparing XML documents. Information Systems 78 (2018), 40--57. DOI:https://doi.org/10.1016/j.is.2018.07.001Google ScholarCross Ref
Jin Pei, Jun Hong, and David Bell. 2006. A novel clustering-based approach to schema matching. In Proceedings of the International Conference on Advances in Information Systems. 60--69.Google ScholarDigital Library
Erhard Rahm. 2011. Towards large-scale schema and ontology matching. In Schema Matching and Mapping. Springer Berlin, Germany, 3--27.Google Scholar
Erhard Rahm. 2016. The case for holistic data integration. In Advances in Databases and Information Systems. Lecture Notes in Computer Science, Vol. 9809. Springer, 11--27. DOI:https://doi.org/10.1007/978-3-319-44039-2Google Scholar
Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching. VLDB Journal 10, 4 (2001), 334--350. DOI:https://doi.org/10.1007/s007780100057Google ScholarDigital Library
Pavel Shvaiko. 2005. A survey of schema-based matching approaches. Journal on Data Semantics 3730 (2005), 146--171. DOI:https://doi.org/10.1007/11603412_5Google ScholarCross Ref
Rebecca Steorts, Samuel Ventura, Mauricio Sadinle, and Stephen Fienberg. 2014. A comparison of blocking methods for record linkage. In Proceedings of the International Conference on Privacy in Statistical Databases. 253--268.Google ScholarCross Ref
Fabian M. Suchanek, Serge Abiteboul, and Pierre Senellart. 2011. PARIS: Probabilistic alignment of relations, instances, and schema. Proceedings of the VLDB Endowment 5, 3 (2011), 157--168. DOI:https://doi.org/10.14778/2078331.2078332Google ScholarDigital Library
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2006. Introduction to Data Mining. Pearson Education.Google Scholar
Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino. 2015. Data wrangling: The challenging journey from the wild to the lake. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR’15).Google Scholar
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2014. OpenML: Networked science in machine learning. ACM SIGKDD Explorations Newsletter 15, 2 (2014), 49--60.Google ScholarDigital Library

Index Terms

Keeping the Data Lake in Form: Proximity Mining for Pre-Filtering Schema Matching
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data access methods
        Proximity search
  2. Information retrieval

Recommendations

Towards a Holistic Schema Matching Approach Designed for Large-Scale Schemas
Computational Collective Intelligence
Abstract
Holistic schema matching is a fundamental challenge in the big data integration domain. Ideally, clusters of semantically corresponding elements are created and are updated as more schemas are matched. Developing a high-quality holistic schema ...
Read More
Using a compact tree to index and query XML data
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

Indexing XML is crucial for efficient XML query processing. We propose a compact tree (Ctree) for XML indexing, which provides not only concise path summaries at group level but also detailed child-parent relationships at element level. Based on Ctree, ...
Read More
Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining
Model and Data Engineering
Abstract
With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Information Systems Volume 38, Issue 3
July 2020
311 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3394096
Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 May 2020
- Online AM: 7 May 2020
- Revised: 1 March 2020
- Accepted: 1 March 2020
- Received: 1 April 2019
Published in tois Volume 38, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data lake
content metadata management
data governance
dataset similarity mining
early pruning
holistic schema matching
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 646
  Total Downloads
- Downloads (Last 12 months)66
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Keeping the Data Lake in Form: Proximity Mining for Pre-Filtering Schema Matching

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Towards a Holistic Schema Matching Approach Designed for Large-Scale Schemas

Using a compact tree to index and query XML data

Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining