skip to main content
research-article

Keeping the Data Lake in Form: Proximity Mining for Pre-Filtering Schema Matching

Authors Info & Claims
Published:18 May 2020Publication History
Skip Abstract Section

Abstract

Data lakes (DLs) are large repositories of raw datasets from disparate sources. As more datasets are ingested into a DL, there is an increasing need for efficient techniques to profile them and to detect the relationships among their schemata, commonly known as holistic schema matching. Schema matching detects similarity between the information stored in the datasets to support information discovery and retrieval. Currently, this is computationally expensive with the volume of state-of-the-art DLs. To handle this challenge, we propose a novel early-pruning approach to improve efficiency, where we collect different types of content metadata and schema metadata about the datasets, and then use this metadata in early-pruning steps to pre-filter the schema matching comparisons. This involves computing proximities between datasets based on their metadata, discovering their relationships based on overall proximities and proposing similar dataset pairs for schema matching. We improve the effectiveness of this task by introducing a supervised mining approach for effectively detecting similar datasets that are proposed for further schema matching. We conduct extensive experiments on a real-world DL that proves the success of our approach in effectively detecting similar datasets for schema matching, with recall rates of more than 85% and efficiency improvements above 70%. We empirically show the computational cost saving in space and time by applying our approach in comparison to instance-based schema matching techniques.

References

  1. Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: A survey. VLDB Journal 24, 4 (2015), 557--581. DOI:https://doi.org/10.1007/s00778-015-0389-yGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  2. Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen, and Alexander Tuzhilin. 2005. Incorporating contextual information in recommender systems using a multidimensional approach. ACM Transactions on Information Systems 23, 1 (2005), 103--145. DOI:https://doi.org/doi: 10.1145/1055709.1055714Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alsayed Algergawy, Sabine Massmann, and Erhard Rahm. 2011. A clustering-based approach for large-scale ontology matching. In Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS’11). 415--428. DOI:https://doi.org/10.1007/978-3-642-23737-9_30Google ScholarGoogle ScholarCross RefCross Ref
  4. Ayman Alserafi, Alberto Abelló, Oscar Romero, and Toon Calders. 2016. Towards information profiling: Data lake content metadata management. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW’16). 178--185. DOI:https://doi.org/10.1109/ICDMW.2016.0033Google ScholarGoogle ScholarCross RefCross Ref
  5. Ayman Alserafi, Toon Calders, Alberto Abelló, and Oscar Romero. 2017. DS-Prox: Dataset proximity mining for governing the data lake. In Similarity Search and Applications. Lecture Notes in Computer Science, Vol. 10609. Springer, 284--299. DOI:https://doi.org/10.1007/978-3-319-68474-1_20Google ScholarGoogle Scholar
  6. Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze, and Konstantin Todorov. 2016. Dataset recommendation for data linking: An intensional approach. In Proceedings of the International Semantic Web Conference: The Semantic Web. Latest Advances and New Domains, Vol. 9678. 36--51. DOI:https://doi.org/10.1007/978-3-319-34129-3_3Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic schema matching, ten years later. Proceedings of the VLDB Endowment 4, 11 (2011), 695--701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Alexander Bilke and Felix Naumann. 2005. Schema matching using duplicates. In Proceedings of the 21st International Conference on Data Engineering. 69--80.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Leo Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chen Chen, Alon Halevy, and Wang-Chiew Tan. 2018. BigGorilla: An open-source ecosystem for data preparation and integration. IEEE Data Engineering Bulletin 41, 2 (2018), 10--22.Google ScholarGoogle Scholar
  11. Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D. Davison. 2018. Generating schema labels through dataset content analysis. In Companion Proceedings of the Web Conference 2018 (WWW’18). 1515--1522. DOI:https://doi.org/10.1145/3184558.3191601Google ScholarGoogle Scholar
  12. Hélio Rodrigues de Oliveira, Alberto Trindade Tavares, and Bernadette Farias Lóscio. 2012. Feedback-based data set recommendation for building linked data applications. In Proceedings of the 8th International Conference on Semantic Systems (I-SEMANTICS’12). ACM, New York, NY, 49.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dong Deng, Albert Kim, Samuel Madden, and Michael Stonebraker. 2017. SilkMoth: An efficient method for finding related sets with maximum matching constraints. Proceedings of the VLDB Endowment 10, 10 (2017), 1082--1093.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Tim Furche, Georg Gottlob, Leonid Libkin, Giorgio Orsi, and Norman W. Paton. 2016. Data wrangling for big data: Challenges and opportunities. In Proceedings of the 19th Conference on Extending Database Technology (EDBT’16), Vol. 16. 473--478.Google ScholarGoogle Scholar
  15. Enrico Gallinucci, Matteo Golfarelli, and Stefano Rizzi. 2018. Schema profiling of document-oriented databases. Information Systems 75 (2018), 13--25. DOI:https://doi.org/10.1016/j.is.2018.02.007Google ScholarGoogle ScholarCross RefCross Ref
  16. Jonathan Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedel. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems 22, 1 (2004), 5--53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank Van Ham, Nathalie Henry Riche, Chris Weaver, Bongshin Lee, Dominique Brodbeck, and Paolo Buono. 2011. Research directions in data wrangling: Visualizations and transformations for usable and credible data. Information Visualization 10, 4 (2011), 271--288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jaewook Kim, Yun Peng, Nenad Ivezic, and Junho Shin. 2011. An optimization approach for semantic-based XML schema matching. International Journal of Trade, Economics and Finance 2, 1 (2011), 78--86.Google ScholarGoogle ScholarCross RefCross Ref
  19. Sebastian Kruse, Thorsten Papenbrock, Hazar Harmouch, and Felix Naumann. 2016. Data anamnesis: Admitting raw data into an organization. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 39, 2 (2016), 8--20.Google ScholarGoogle Scholar
  20. Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, and Zoubin Ghahramani. 2013. SiGMa: Simple greedy matching for aligning large knowledge bases. In Proceedings of the 19th ACM SIGKDD International Conference. 572--580. DOI:https://doi.org/10.1145/2487575.2487592Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Antonio Maccioni and Riccardo Torlone. 2018. KAYAK: A framework for just-in-time data preparation in a data lake. In Proceedings of the International Conference on Advanced Information Systems Engineering. 474--489. DOI:https://doi.org/10.1007/978-3-319-91563-0Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. 2001. Generic schema matching with Cupid. VLDB Journal 1 (2001), 49--58.Google ScholarGoogle Scholar
  23. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2009. An Introduction to Information Retrieval. Number C. Cambridge UP.Google ScholarGoogle Scholar
  24. J. Miller. 2018. Open data integration. Proceedings of the VLDB Endowment 11, 12 (2018), 2130--2139.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4 (2014), 40--49.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Alessandreia Oliveira, Gabriel Tessarolli, Gleiph Ghiotto, Bruno Pinto, Fernando Campello, Matheus Marques, Carlos Oliveira, et al. 2018. An efficient similarity-based approach for comparing XML documents. Information Systems 78 (2018), 40--57. DOI:https://doi.org/10.1016/j.is.2018.07.001Google ScholarGoogle ScholarCross RefCross Ref
  27. Jin Pei, Jun Hong, and David Bell. 2006. A novel clustering-based approach to schema matching. In Proceedings of the International Conference on Advances in Information Systems. 60--69.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Erhard Rahm. 2011. Towards large-scale schema and ontology matching. In Schema Matching and Mapping. Springer Berlin, Germany, 3--27.Google ScholarGoogle Scholar
  29. Erhard Rahm. 2016. The case for holistic data integration. In Advances in Databases and Information Systems. Lecture Notes in Computer Science, Vol. 9809. Springer, 11--27. DOI:https://doi.org/10.1007/978-3-319-44039-2Google ScholarGoogle Scholar
  30. Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching. VLDB Journal 10, 4 (2001), 334--350. DOI:https://doi.org/10.1007/s007780100057Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Pavel Shvaiko. 2005. A survey of schema-based matching approaches. Journal on Data Semantics 3730 (2005), 146--171. DOI:https://doi.org/10.1007/11603412_5Google ScholarGoogle ScholarCross RefCross Ref
  32. Rebecca Steorts, Samuel Ventura, Mauricio Sadinle, and Stephen Fienberg. 2014. A comparison of blocking methods for record linkage. In Proceedings of the International Conference on Privacy in Statistical Databases. 253--268.Google ScholarGoogle ScholarCross RefCross Ref
  33. Fabian M. Suchanek, Serge Abiteboul, and Pierre Senellart. 2011. PARIS: Probabilistic alignment of relations, instances, and schema. Proceedings of the VLDB Endowment 5, 3 (2011), 157--168. DOI:https://doi.org/10.14778/2078331.2078332Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2006. Introduction to Data Mining. Pearson Education.Google ScholarGoogle Scholar
  35. Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino. 2015. Data wrangling: The challenging journey from the wild to the lake. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR’15).Google ScholarGoogle Scholar
  36. Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2014. OpenML: Networked science in machine learning. ACM SIGKDD Explorations Newsletter 15, 2 (2014), 49--60.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Keeping the Data Lake in Form: Proximity Mining for Pre-Filtering Schema Matching

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Information Systems
                ACM Transactions on Information Systems  Volume 38, Issue 3
                July 2020
                311 pages
                ISSN:1046-8188
                EISSN:1558-2868
                DOI:10.1145/3394096
                Issue’s Table of Contents

                Copyright © 2020 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 18 May 2020
                • Online AM: 7 May 2020
                • Revised: 1 March 2020
                • Accepted: 1 March 2020
                • Received: 1 April 2019
                Published in tois Volume 38, Issue 3

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              HTML Format

              View this article in HTML Format .

              View HTML Format