ABSTRACT
With the rise of big data, business intelligence had to find solutions for managing even greater data volumes and variety than in data warehouses, which proved ill-adapted. Data lakes answer these needs from a storage point of view, but require managing adequate metadata to guarantee an efficient access to data. Starting from a multidimensional metadata model designed for an industrial heritage data lake presenting a lack of schema evolutivity, we propose in this paper to use ensemble modeling, and more precisely a data vault, to address this issue. To illustrate the feasibility of this approach, we instantiate our metadata conceptual model into relational and document-oriented logical and physical models, respectively. We also compare the physical models in terms of metadata storage and query response time.
- Hassan H. Alrehamy and Coral Walker. 2015. Personal Data Lake With Data Gravity Pull. In IEEE 5th International Conference on Big Data and Cloud Computing (BDCloud 2015), Dalian, China. IEEE Computer Society, Washington, DC, USA, 160--167. Google ScholarDigital Library
- Carlyna Bondiombouy and Patrick Valduriez. 2010. Query Processing in Multi-store Systems: an overview. Technical Report RR-8890. INRIA Sophia Antipolis-Méditerranée.Google Scholar
- James Dixon. 2010. Pentaho, Hadoop and Data Lakes. https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/.Google Scholar
- Fentaw Awel Eshetu. 2014. Data Vault Modelling: An Introductory Guide. B.Sc. Thesis, Helsinki Metropolia University of Applied Sciences, Finland.Google Scholar
- Huang Fang. 2015. Managing Data Lakes in Big Data Era: What's a data lake and why has it became popular in data management ecosystem. In 5th Annual IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems (CYBER 2015), Shenyang, China. 820--824.Google ScholarCross Ref
- Pravin Ganore. 2015. Introduction To The Concept Of Data Lake And Its Benefits. https://www.esds.co.in/blog/introduction-to-the-concept-of-data-lake-and-its-benefits/.Google Scholar
- Harold Giménez. 2011. PostgreSQL Performance Considerations. https://robots.thoughtbot.com/postgresql-performance-considerations.Google Scholar
- Hans Hultgren. 2012. Data vault modelling guide -- Introductory guide to data vault modelling. Genesee Academy. https://hanshultgren.files.wordpress.com/2012/09/data-vault-modeling-guide.pdf.Google Scholar
- Bill Inmon. 2016. Data Lake Architecture: Designing the Data Lake and avoiding the garbage dump. Technics Publications. Google ScholarDigital Library
- Tomcy John and Pankaj Misra. 2017. Data Lake for Enterprises: Lambda Architecture for building enterprise data systems. Packt Publishing. Google ScholarDigital Library
- Vladan Jovanovic and Ivan Bojicic. 2012. Conceptual Data Vault Model. In Southern Association for Information Systems Conference, Atlanta, GA, USA. Association for Information Systems, 131--136.Google Scholar
- Eric Kergosien. 2017. TEchnologies de l'information et de la communication au Cœur du Territoire NumérIQue pour la valorisation du patrimoine. https://tectoniq.meshs.fr/.Google Scholar
- Eric Kergosien, B. Jacquemin, M. Severo, and S. Chaudron. 2015. Vers l'interopérabilité des données hétérogènes liées au patrimoine industriel textile. In 18 colloque international sur le document numérique (CIDE18), Montpellier, France. 15.Google Scholar
- Pwint Phyu Khine and Zhao Shun Wang. 2017. Data Lake: A New Ideology in Big Data Era. In 4th International Conference on Wireless Communication and Sensor Network (WCSN 2017), Wuhan, China (ITM Web of Conferences), Vol. 17. 1--6.Google Scholar
- Dragoljub Krneta, Vladan Jovanovic, and Zoran Marjanovic. 2014. A direct approach to physical Data Vault design. Computer Science and Information Systems 11, 2 (2014), 569--599.Google ScholarCross Ref
- Dan Linstedt. 2011. Super Charge your Data Warehouse: Invaluable Data Modeling Rules to Implement Your Data Vault. CreateSpace Independent Publishing.Google Scholar
- Dan Linstedt. 2015. Data Vault Basics. https://danlinstedt.com/solutions-2/data-vault-basics/.Google Scholar
- Dan Linstedt and Michael Olschimke. 2015. Building a Scalable Data Warehouse with Data Vault 2.0. Morgan Kaufmann, Cambridge, MA, USA. Google ScholarDigital Library
- Natalia Miloslavskaya and Alexander Tolstoy. 2016. Big Data, Fast Data and Data Lake Concepts. Procedia Computer Science 88 (2016), 300--305.Google ScholarCross Ref
- Daniel E. O'Leary. 2014. Embedding AI and Crowdsourcing in the Big Data Lake. IEEE Intelligent Systems 29, 5 (November 2014), 70--73.Google ScholarCross Ref
- Nishara Pathirana. 2015. Modeling territorial knowledge from web data about natural and cultural heritage. M.Sc. Thesis, Université Lumière Lyon 2, France.Google Scholar
- Glenn Norman Paulley. 2000. Exploiting Functional Dependence in Query Optimization. Ph.D. Dissertation. University of Waterloo, Canada.Google Scholar
- Olle Regardt, Lars Rönnbäck, Maria Bergholtz, Paul Johannesson, and Petia Wohed. 2009. Anchor Modeling. In 288th International Conference on Conceptual Modeling (ER 2009), Gramado, Brazil (Lecture Notes in Computer Science), Vol. 5829. Springer, Heidelberg, Germany, 234--250. Google ScholarDigital Library
- Lars Rönnbäck and Hans Hultgren. 2013. Comparing Anchor Modeling with Data Vault Modeling. https://hanshultgren.files.wordpress.com/2013/06/modeling_compare_05_larshans.pdf.Google Scholar
- Lars Rönnbäck, Olle Regardt, Maria Bergholtz, Paul Johannesson, and Petia Wohed. 2010. Anchor modeling -- Agile information modeling in evolving data environments. Data and Knowledge Engineering 69, 12 (2010), 1229--1253. Google ScholarDigital Library
- Brian Stein and Alan Morrison. 2014. The enterprise data lake: Better integration and deeper analytics. Technology Forecast, 1. http://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/assets/pdf/pwc-technology-forecast-data-lakes.pdf.Google Scholar
- Ran Tan, Rada Chirkova, Vijay Gadepally, and Timothy G. Mattson. 2017. Enabling Query Processing across Heterogeneous Data Models: A Survey. In 2017 IEEE International Conference on Big Data (BIGDATA 2017), Boston, USA. 3211--3220.Google ScholarCross Ref
Index Terms
- Modeling Data Lake Metadata with a Data Vault
Recommendations
Modeling metadata in data lakes—A generic model
AbstractData contains important knowledge and has the potential to provide new insights. Due to new technological developments such as the Internet of Things, data is generated in increasing volumes. In order to deal with these data volumes ...
HANDLE - A Generic Metadata Model for Data Lakes
Big Data Analytics and Knowledge DiscoveryAbstractThe substantial increase in generated data induced the development of new concepts such as the data lake. A data lake is a large storage repository designed to enable flexible extraction of the data’s value. A key aspect of exploiting data value ...
Multidimensional Information Systems Metadata Repository Development with a Data Warehouse Structure Using "Data Vault" Methodology
CSIS'2019: Proceedings of the XI International Scientific Conference Communicative Strategies of the Information SocietyWhen organizing automated data collection in a data warehouse under the conditions of increasing data volume and complicating the business model of an enterprise, an information system data model control becomes one of the priority tasks. The article ...
Comments