ABSTRACT
We argue that emerging federated data management architectures require a means of gathering, linking, curating and enriching metadata in a graph. We call the system that supports these tasks a metadata lake. We explain the underlying architectural principles that are required to achieve such a system and describe our current implementation. We show how our metadata lake is used to achieve certain advanced capabilities and report on its performance.
- [n.d.]. Parquet Format. https://parquet.apache.org/documentation/latest/Google Scholar
- 2022. Neo4j Architecture: SinkConsume. (2022). https://neo4j.com/labs/kafka/4.0/architecture/sinkconsume)Google Scholar
- Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. 2017. Foundations of Modern Query Languages for Graph Databases. ACM Computing Surveys (CSUR) 50 (2017), 1--40.Google ScholarDigital Library
- Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. 2017. Foundations of Modern Query Languages for Graph Databases. ACM Comput. Surv. 50, 5 (2017), 68:1--68:40. Google ScholarDigital Library
- Daniel Bauer, Florian Froese, Luis Garcés-Erice, Chris Giblin, Abdel Labbi, Zoltán A. Nagy, Niels Pardon, Sean Rooney, Peter Urbanetz, Pascal Vetsch, and Andreas Wespi. 2021. Building and Operating a Large-Scale Enterprise Data Analytics Platform. Big Data Research 23 (2021), 100181. Google ScholarCross Ref
- Maciej Besta, Marc Fischer, Vasiliki Kalavri, Michael Kapralov, and Torsten Hoefler. 2019. Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism. ArXiv abs/1912.12740 (2019).Google Scholar
- Maciej Besta, Emanuel Peter, Robert Gerstenberger, Marc Fischer, Michal Podstawski, Claude Barthels, Gustavo Alonso, and Torsten Hoefler. 2019. Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries. abs/1910.09017 (2019).Google Scholar
- Data Bricks. 2020. DeltaLake. Linux Foundation. Retrieved May 2020 from https://docs.delta.io/latest/index.htmlGoogle Scholar
- Ariel Debrouvier, Matías Perazzo, Eliseo Parodi, Valeria Soliani, and Alejandro Vaisman. 2021. A Model and Query Language for Temporal Graph Databases. The VLDB Journal 30 (09 2021). Google ScholarDigital Library
- Z. Dehghani. 2022. Data Mesh: Delivering Data-Driven Value at Scale. O'Reilly Media, Incorporated. https://books.google.ch/books?id=M5J5zgEACAAJGoogle Scholar
- Facebook. 2015. GraphQL. http://facebook.github.io/graphql/Google Scholar
- Amazon Inc. 2019. What is Cloud Object Storage. (2019). https://aws.amazon.com/what-is-cloud-object-storage/Google Scholar
- Othon Michail. 2015. An Introduction to Temporal Graphs: An Algorithmic Perspective. CoRR abs/1503.00278 (2015). arXiv:1503.00278 http://arxiv.org/abs/1503.00278Google Scholar
- Ivanilton Polato, Reginaldo Ré, Alfredo Goldman, and Fabio Kon. [n.d.]. A Comprehensive View of Hadoop Research A Systematic Literature Review. 46 ([n. d.]), 1--25. Google ScholarDigital Library
- Prukalpa. 2021. The rise of the metadata lake. https://towardsdatascience.com/the-rise-of-the-metadata-lake-1e95127594deGoogle Scholar
- Raghu Ramakrishnan et al. 2017. Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics. In Proceedings of the ACM International Conference on Management of Data (Chicago, Illinois, USA). ACM, 51--63. Google ScholarDigital Library
- Redhat. 2019. Debezium Stream Changes from your Database. https://debezium.io/docsGoogle Scholar
- Sean Rooney, Luis Garcés-Erice, Daniel Bauer, and Peter Urbanetz. 2021. Pathfinder: Building the Enterprise Data Map.. In IEEE BigData, Yixin Chen, Heiko Ludwig, Yicheng Tu, Usama M. Fayyad, Xingquan Zhu, Xiaohua Hu, Suren Byna, Xiong Liu, Jianping Zhang, Shirui Pan, Vagelis Papalexakis, Jianwu Wang, Alfredo Cuzzocrea, and Carlos Ordonez (Eds.). IEEE, 1909--1919. http://dblp.unitrier.de/db/conf/bigdataconf/bigdataconf2021.html#RooneyGBU21Google Scholar
- Wen Sun, Achille Fokoue, Kavitha Srinivas, Anastasios Kementsietsidis, Gang Hu, and Guo Tong Xie. 2015. SQLGraph: An Efficient Relational-Based Property Graph Store. In SIGMOD Conference. ACM, 1887--1901.Google ScholarDigital Library
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. [n.d.]. Hive: A Warehousing Solution over a Map-Reduce Framework. 2, 2 ([n. d.]), 1626--1629. Google ScholarDigital Library
- David Wood, Markus Lanthaler, and Richard Cyganiak. 2014. RDF 1.1 Concepts and Abstract Syntax. http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/Google Scholar
- Noel Yuhanna and Mike Gilpin. 2013. Information Fabric 3.0. Technical Report RES99201. Forrester.Google Scholar
Index Terms
- Revisiting data lakes: the metadata lake
Recommendations
Modeling metadata in data lakes—A generic model
AbstractData contains important knowledge and has the potential to provide new insights. Due to new technological developments such as the Internet of Things, data is generated in increasing volumes. In order to deal with these data volumes ...
HANDLE - A Generic Metadata Model for Data Lakes
Big Data Analytics and Knowledge DiscoveryAbstractThe substantial increase in generated data induced the development of new concepts such as the data lake. A data lake is a large storage repository designed to enable flexible extraction of the data’s value. A key aspect of exploiting data value ...
Analysis-oriented Metadata for Data Lakes
IDEAS '21: Proceedings of the 25th International Database Engineering & Applications SymposiumData lakes are supposed to enable analysts to perform more efficient and efficacious data analysis by crossing multiple existing data sources, processes and analyses. However, it is impossible to achieve that when a data lake does not have a metadata ...
Comments