Abstract
Data lakes (DLs) are large repositories of raw datasets from disparate sources. As more datasets are ingested into a DL, there is an increasing need for efficient techniques to profile them and to detect the relationships among their schemata, commonly known as holistic schema matching. Schema matching detects similarity between the information stored in the datasets to support information discovery and retrieval. Currently, this is computationally expensive with the volume of state-of-the-art DLs. To handle this challenge, we propose a novel early-pruning approach to improve efficiency, where we collect different types of content metadata and schema metadata about the datasets, and then use this metadata in early-pruning steps to pre-filter the schema matching comparisons. This involves computing proximities between datasets based on their metadata, discovering their relationships based on overall proximities and proposing similar dataset pairs for schema matching. We improve the effectiveness of this task by introducing a supervised mining approach for effectively detecting similar datasets that are proposed for further schema matching. We conduct extensive experiments on a real-world DL that proves the success of our approach in effectively detecting similar datasets for schema matching, with recall rates of more than 85% and efficiency improvements above 70%. We empirically show the computational cost saving in space and time by applying our approach in comparison to instance-based schema matching techniques.
- Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: A survey. VLDB Journal 24, 4 (2015), 557--581. DOI:https://doi.org/10.1007/s00778-015-0389-yGoogle ScholarDigital Library
- Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen, and Alexander Tuzhilin. 2005. Incorporating contextual information in recommender systems using a multidimensional approach. ACM Transactions on Information Systems 23, 1 (2005), 103--145. DOI:https://doi.org/doi: 10.1145/1055709.1055714Google ScholarDigital Library
- Alsayed Algergawy, Sabine Massmann, and Erhard Rahm. 2011. A clustering-based approach for large-scale ontology matching. In Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS’11). 415--428. DOI:https://doi.org/10.1007/978-3-642-23737-9_30Google ScholarCross Ref
- Ayman Alserafi, Alberto Abelló, Oscar Romero, and Toon Calders. 2016. Towards information profiling: Data lake content metadata management. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW’16). 178--185. DOI:https://doi.org/10.1109/ICDMW.2016.0033Google ScholarCross Ref
- Ayman Alserafi, Toon Calders, Alberto Abelló, and Oscar Romero. 2017. DS-Prox: Dataset proximity mining for governing the data lake. In Similarity Search and Applications. Lecture Notes in Computer Science, Vol. 10609. Springer, 284--299. DOI:https://doi.org/10.1007/978-3-319-68474-1_20Google Scholar
- Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze, and Konstantin Todorov. 2016. Dataset recommendation for data linking: An intensional approach. In Proceedings of the International Semantic Web Conference: The Semantic Web. Latest Advances and New Domains, Vol. 9678. 36--51. DOI:https://doi.org/10.1007/978-3-319-34129-3_3Google ScholarDigital Library
- Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic schema matching, ten years later. Proceedings of the VLDB Endowment 4, 11 (2011), 695--701.Google ScholarDigital Library
- Alexander Bilke and Felix Naumann. 2005. Schema matching using duplicates. In Proceedings of the 21st International Conference on Data Engineering. 69--80.Google ScholarDigital Library
- Leo Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32.Google ScholarDigital Library
- Chen Chen, Alon Halevy, and Wang-Chiew Tan. 2018. BigGorilla: An open-source ecosystem for data preparation and integration. IEEE Data Engineering Bulletin 41, 2 (2018), 10--22.Google Scholar
- Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D. Davison. 2018. Generating schema labels through dataset content analysis. In Companion Proceedings of the Web Conference 2018 (WWW’18). 1515--1522. DOI:https://doi.org/10.1145/3184558.3191601Google Scholar
- Hélio Rodrigues de Oliveira, Alberto Trindade Tavares, and Bernadette Farias Lóscio. 2012. Feedback-based data set recommendation for building linked data applications. In Proceedings of the 8th International Conference on Semantic Systems (I-SEMANTICS’12). ACM, New York, NY, 49.Google ScholarDigital Library
- Dong Deng, Albert Kim, Samuel Madden, and Michael Stonebraker. 2017. SilkMoth: An efficient method for finding related sets with maximum matching constraints. Proceedings of the VLDB Endowment 10, 10 (2017), 1082--1093.Google ScholarDigital Library
- Tim Furche, Georg Gottlob, Leonid Libkin, Giorgio Orsi, and Norman W. Paton. 2016. Data wrangling for big data: Challenges and opportunities. In Proceedings of the 19th Conference on Extending Database Technology (EDBT’16), Vol. 16. 473--478.Google Scholar
- Enrico Gallinucci, Matteo Golfarelli, and Stefano Rizzi. 2018. Schema profiling of document-oriented databases. Information Systems 75 (2018), 13--25. DOI:https://doi.org/10.1016/j.is.2018.02.007Google ScholarCross Ref
- Jonathan Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedel. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems 22, 1 (2004), 5--53.Google ScholarDigital Library
- Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank Van Ham, Nathalie Henry Riche, Chris Weaver, Bongshin Lee, Dominique Brodbeck, and Paolo Buono. 2011. Research directions in data wrangling: Visualizations and transformations for usable and credible data. Information Visualization 10, 4 (2011), 271--288.Google ScholarDigital Library
- Jaewook Kim, Yun Peng, Nenad Ivezic, and Junho Shin. 2011. An optimization approach for semantic-based XML schema matching. International Journal of Trade, Economics and Finance 2, 1 (2011), 78--86.Google ScholarCross Ref
- Sebastian Kruse, Thorsten Papenbrock, Hazar Harmouch, and Felix Naumann. 2016. Data anamnesis: Admitting raw data into an organization. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 39, 2 (2016), 8--20.Google Scholar
- Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, and Zoubin Ghahramani. 2013. SiGMa: Simple greedy matching for aligning large knowledge bases. In Proceedings of the 19th ACM SIGKDD International Conference. 572--580. DOI:https://doi.org/10.1145/2487575.2487592Google ScholarDigital Library
- Antonio Maccioni and Riccardo Torlone. 2018. KAYAK: A framework for just-in-time data preparation in a data lake. In Proceedings of the International Conference on Advanced Information Systems Engineering. 474--489. DOI:https://doi.org/10.1007/978-3-319-91563-0Google ScholarDigital Library
- Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. 2001. Generic schema matching with Cupid. VLDB Journal 1 (2001), 49--58.Google Scholar
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2009. An Introduction to Information Retrieval. Number C. Cambridge UP.Google Scholar
- J. Miller. 2018. Open data integration. Proceedings of the VLDB Endowment 11, 12 (2018), 2130--2139.Google ScholarDigital Library
- Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4 (2014), 40--49.Google ScholarDigital Library
- Alessandreia Oliveira, Gabriel Tessarolli, Gleiph Ghiotto, Bruno Pinto, Fernando Campello, Matheus Marques, Carlos Oliveira, et al. 2018. An efficient similarity-based approach for comparing XML documents. Information Systems 78 (2018), 40--57. DOI:https://doi.org/10.1016/j.is.2018.07.001Google ScholarCross Ref
- Jin Pei, Jun Hong, and David Bell. 2006. A novel clustering-based approach to schema matching. In Proceedings of the International Conference on Advances in Information Systems. 60--69.Google ScholarDigital Library
- Erhard Rahm. 2011. Towards large-scale schema and ontology matching. In Schema Matching and Mapping. Springer Berlin, Germany, 3--27.Google Scholar
- Erhard Rahm. 2016. The case for holistic data integration. In Advances in Databases and Information Systems. Lecture Notes in Computer Science, Vol. 9809. Springer, 11--27. DOI:https://doi.org/10.1007/978-3-319-44039-2Google Scholar
- Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching. VLDB Journal 10, 4 (2001), 334--350. DOI:https://doi.org/10.1007/s007780100057Google ScholarDigital Library
- Pavel Shvaiko. 2005. A survey of schema-based matching approaches. Journal on Data Semantics 3730 (2005), 146--171. DOI:https://doi.org/10.1007/11603412_5Google ScholarCross Ref
- Rebecca Steorts, Samuel Ventura, Mauricio Sadinle, and Stephen Fienberg. 2014. A comparison of blocking methods for record linkage. In Proceedings of the International Conference on Privacy in Statistical Databases. 253--268.Google ScholarCross Ref
- Fabian M. Suchanek, Serge Abiteboul, and Pierre Senellart. 2011. PARIS: Probabilistic alignment of relations, instances, and schema. Proceedings of the VLDB Endowment 5, 3 (2011), 157--168. DOI:https://doi.org/10.14778/2078331.2078332Google ScholarDigital Library
- Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2006. Introduction to Data Mining. Pearson Education.Google Scholar
- Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino. 2015. Data wrangling: The challenging journey from the wild to the lake. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR’15).Google Scholar
- Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2014. OpenML: Networked science in machine learning. ACM SIGKDD Explorations Newsletter 15, 2 (2014), 49--60.Google ScholarDigital Library
Index Terms
- Keeping the Data Lake in Form: Proximity Mining for Pre-Filtering Schema Matching
Recommendations
Towards a Holistic Schema Matching Approach Designed for Large-Scale Schemas
Computational Collective IntelligenceAbstractHolistic schema matching is a fundamental challenge in the big data integration domain. Ideally, clusters of semantically corresponding elements are created and are updated as more schemas are matched. Developing a high-quality holistic schema ...
Using a compact tree to index and query XML data
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge managementIndexing XML is crucial for efficient XML query processing. We propose a compact tree (Ctree) for XML indexing, which provides not only concise path summaries at group level but also detailed child-parent relationships at element level. Based on Ctree, ...
Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining
Model and Data EngineeringAbstractWith the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility ...
Comments