Data Lakes: A Survey of Concepts and Architectures

: This paper presents a comprehensive literature review on the evolution of data-lake technology, with a particular focus on data-lake architectures. By systematically examining the existing body of research, we identify and classify the major types of data-lake architectures that have been proposed and implemented over time. The review highlights key trends in the development of data-lake architectures, identifies the primary challenges faced in their implementation, and discusses future directions for research and practice in this rapidly evolving field. We have developed diagrammatic representations to highlight the evolution of various architectures. These diagrams use consistent notations across all architectures to further enhance the comparative analysis of the different architectural components. We also explore the differences between data warehouses and data lakes. Our findings provide valuable insights for researchers and practitioners seeking to understand the current state of data-lake technology and its potential future trajectory.


Introduction
Timely access to high-quality data is the life-giving force of modern decision-making systems and the foundation for accountable knowledge.In this essence, consistent focus on data and adapting data-driven approaches from academics and professionals exists because the knowledge extracted from data analysis leads to innovations that redefine enterprises and national economies [1].
In the era of the 4th Industrial Revolution, approximately 328.77 million terabytes of data are generated, captured, copied, or consumed globally every day [2].Organizations of all sizes across various sectors are developing their technological capabilities to extract knowledge from large and complex datasets, commonly known as "big data".This term refers to large datasets that include structured, semi-structured, and unstructured data, which traditional data management tools struggle with due to their vast size, heterogeneity, and complexity.In addition, big data comes in multiple formats, including text, sound, video, and images, with unstructured data growing faster than structured data, accounting for 90% of all data [3].Driven by the changing data landscape, new processing capabilities are required to gain insights from big data, leading to better decision-making.
The challenges associated with the data life cycle are primarily related to data processing and management issues that arise from the volume, velocity, variety, and veracity of big data [3].Data processing challenges involve techniques that are used for acquiring, integrating, transforming, and analyzing big data.On the other hand, data management challenges deal with ensuring data security, privacy, governance, and operational cost issues.All of these challenges can lead to information overload and low productivity due to difficulties in obtaining actionable knowledge from data.
Data management is the systematic retrieval and administration of information assets within an organization to ensure that the data are accessible and shareable across different formats and systems.Data management systems utilize a variety of technical frameworks, methodologies, and tools that are applicable to different scenarios, use cases, and lines of business demands.However, the very nature of modern big data poses significant challenges to traditional data management systems [4].
The main limitations of traditional data management systems are represented by their fixed schemas and rigid data models, which in many situations constrain the organizations' capacity to handle diverse data types.It is challenging to include new data sources or to evolve with the business needs [5].Also, data silos, created when departments or applications store data independently, obstruct data sharing and integration, resulting in incomplete and inconsistent datasets that undermine the trustworthiness of analysis results.Finally, scaling these systems to manage extremely large volumes of data is often excessively expensive in terms of the initial investment in hardware and software platforms, as well as the ongoing maintenance and upgrade costs, which can represent a significant cost factor for organizations dealing with big data [5].
To overcome these limitations, the concept of the "data lake" has been introduced.Data lakes consolidate data from various disparate sources into a single, unified management system.This approach eliminates the fragmentation and inconsistency of data silos and allows organizations to apply uniform data governance.Data lakes provide flexible data access, automated and efficient data preparation workflows, and reliable data pipelines [6].These capabilities support a wide range of analytical applications, enabling organizations to conduct comprehensive analytics on all available data and facilitate end-to-end machinelearning workflows.
This survey aims to shed light on the existing works that explore data-lake architectures.The main research contributions made in this study are as follows: • We provide a thorough explanation of the data-lake definition and concept.

•
We provide a detailed overview of the differences between both data warehouses and data lakes.

•
We review and categorize existing data-lake solutions based on their architectures.

•
We construct a chronological timeline graph to illustrate the evolution of data-lake architectures.

•
We explore the significance of data lakes in modern data architecture, the challenges they pose, and future data-lake development.
The remainder of this paper is organized as follows: Section 2 presents the applied review methodology.Section 3 introduces the data-lake definition and its main characteristics.Recent existing studies related to data lakes are reviewed in Section 4. The findings of this study are presented in Section 5. Section 6 provides a chronological timeline graph to illustrate the evolution of data-lake architectures.Finally, Section 7 concludes the paper by addressing the importance of data lakes in modern data architecture, along with their adaptation challenges and future development trends.

Review Methodology
This study employed a systematic literature review approach to examine and synthesize the existing research on data-lake architectures.The review process followed these key steps: 1.
Research question formulation: We defined the primary research question as "What are the major types of data-lake architectures that have been proposed and implemented, and how have they evolved over time?" 2.
Literature search: A comprehensive search was conducted using academic databases including IEEE Xplore, ACM Digital Library, ScienceDirect, and Google Scholar.These databases were chosen for their extensive coverage of relevant literature: • IEEE Xplore: Provides a vast collection of technical literature in electrical engineering, computer science, and electronics, essential for research on data-lake architectures.
• ACM Digital Library: Contains a comprehensive collection of full-text articles and bibliographic records in computing and information technology.• ScienceDirect: Offers access to a wide range of scientific and technical research, including key journals and conference proceedings.

•
Google Scholar: Ensures broad coverage across disciplines and indexes a variety of academic publications.
Search terms included "data lake architecture", "data lake design", "data lake implementation", and related keywords.The search covered publications from 2008 to 2024 to capture the full evolution of data-lake concepts.The year 2008 was chosen as the starting point as it marks the early development of data-lake technology, and the period up to 2024 includes the latest advancements and implementations in the field.

3.
Study selection: Inclusion criteria were applied to select relevant papers that specifically discussed data-lake architectural models, implementations, or evaluations.Exclusion criteria filtered out papers that only mentioned data lakes tangentially or did not provide substantive architectural details.

4.
Data extraction: Key information was extracted from each selected paper, including the proposed architecture type, key components, advantages, limitations, and use cases.A standardized data extraction form was used to ensure consistency.

5.
Architectural diagram standardization: To facilitate easy evaluation and comparison, the extracted architecture information was sketched into new diagrams using consistent notations across all architectures.This standardization process ensured that all architectural representations followed a uniform format, making it easier to identify similarities, differences, and trends across various data-lake designs.

6.
Quality assessment: The selected papers were evaluated for quality based on criteria such as clarity of architectural description, empirical evidence provided, and relevance to practical implementation.7.
Data synthesis: The extracted information was synthesized to identify major categories of data-lake architectures, their defining characteristics, and trends in their development over time.A chronological analysis was conducted to map the evolution of architectural approaches.

8.
Critical analysis: The strengths, weaknesses, and applicability of different architectural models were critically analyzed.Comparisons were made between different approaches to highlight their relative merits and limitations.9.
Findings compilation: The key findings from the analysis were compiled, including a classification of major data-lake architecture types, a timeline of architectural evolution, and insights into the drivers of architectural changes over time.
We followed this approach in this survey study to present a comprehensive and structured review of the literature on data-lake architectures, enabling the identification of key trends, challenges, and future directions in this rapidly evolving field.

Data-Lake Definition and Characteristics
The term data lake was first introduced by Pentaho CTO James Dixon [7], to address the limitations of data marts, which are business-specific subsets of data warehouses, allowing only a subset of questions to be answered.In the research literature, data lakes are also referred to as data reservoirs [8] and data hubs.However, data lake is the most commonly used term in the literature.In the early stages of development, data lakes were perceived as the Hadoop technology equivalent.From this perspective, the data-lake concept refers to the practice of using open or low-cost technologies, typically Hadoop, to store, process and explore raw data in the enterprise [9].In recent years, data lakes are generally seen as a central data repository where data of all types are stored in a loosely defined schema for future use.This definition is derived from two distinct characteristics of data lakes: data variety and schema-on-read approach "late binding", which states that schema and data requirements are not defined until the time of data querying [10].
The most elaborate definition of data lake, as quoted in reference [11], is "a scalable storage and analysis system for data of any type, retained in their native format, and employed primarily by data specialists, (statisticians, data scientists, or analysts) for knowledge discovery.The data-lake properties include: (1) a metadata catalogue that imposes data quality; (2) data governance policies and tools; (3) availability to different kinds of users; (4) integration of any kind of data; (5) a logical and physical organization; (6) scalability in terms of storage and processing".
The characteristics of a data lake have been widely explored in academic literature.First, data lakes are characterized by storing massive amounts of data, both structured and unstructured, from diverse and varying sources.Second, data lakes aim to be agile and scalable, i.e., data volumes and types stored in the data lake may vary as business demands change.Third, data lakes are envisioned to support a wide range of data processing and analytics technologies, such as batch processing, interactive analytic queries, and machine learning [4].
Another important characteristic of data lakes is that they utilize metadata to store, manage, and analyze the data stored in the lake.Metadata describes the data, such as its source, format, and schema [11].In addition, data lakes provide data exploration and discovery, where data analysts can navigate and delve into data without requiring explicit schemas or structures to be defined prior to analysis.Finally, data lakes are characterized by the use of low-cost storage, which permits organizations to store petabytes of data without incurring significant expenses [12].

Literature Review
According to Dixon, "whilst a data warehouse seems to be a bottle of water cleaned and ready for consumption, then 'Data Lake' is considered as a whole lake of data in a more natural state" [7].Understanding the distinctions and advantages of both approaches can help businesses make informed decisions on which option best suits their needs.In [13], the authors delve into whether data lakes will supplant data warehouses in the foreseeable future.They begin by introducing the data-lake concept as a new architecture that stores data in its raw format, allowing for flexible processing and analysis.The paper then performs a critical analysis of the advantages and disadvantages currently offered by data warehouses to compare the two concepts-data lake (DL) and data warehouse (DW).The authors explore the scalability, flexibility, and cost-effectiveness of data lakes, highlighting how they accommodate the increasing variety and volume of data in modern business environments.The paper contrasts these features with the structured, schemaon-write framework of traditional data warehouses that are less adaptable to the dynamic nature of today's data needs.Furthermore, the discussion includes the transition from ETL to ELT processes, emphasizing how data lakes support real-time data processing and analytics.
The work presented in [14] discusses the fundamental differences between data warehouses (DW) and data lakes (DL), comparing them across various dimensions to clearly distinguish their functionalities and use cases.The authors focus on key aspects such as data handling, schema design, data volume and growth, data processing, agility, user access and tools, and integration and maintenance costs.This comparison highlights the distinct advantages and disadvantages of each system, providing insights into their suitability for different analytical needs and environments.
Nambiar and Mundra [15] conduct a comparative analysis between data warehouses and data lakes, emphasizing their key distinctions.In particular, the review detailed the definitions, characteristics, and principal differences of data warehouses and data lakes.Additionally, the architecture and design aspects of both storage systems were thoroughly discussed.Their paper also provided an extensive overview of popular tools and services associated with data warehouses and data lakes.Furthermore, the review critically examined the overarching challenges associated with big-data analytics, as well as specific hurdles related to the implementation of data warehouses and data lakes.
Harby and Zulkernine [16] present a comparative review of existing data-warehouse and data-lake technologies to highlight their strengths, weaknesses, and shortcomings.They analyze design choices, architecture, and metadata storage and processing features.Consequently, the authors propose the desired and necessary features of the data-lakehouse architecture, which has recently gained a lot of attention in the big-data management research community.
ElAissi et al. [17] discuss in depth the differences between data lake and data-warehouse architectures.The authors emphasize that a dedicated architecture, known as a data lake, has been developed to extract valuable insights from large volumes of diverse data types, including structured, semi-structured, and unstructured data.
Hagstroem et al. [18] discuss the advantages and strategic implementation of data lakes in business settings, emphasizing their role in handling vast amounts of structured and unstructured data cost-effectively.Furthermore, there is a focus on the stages of data-lake development, particularly how organizations can use data lakes for scalable and low-cost data storage solutions.This includes the creation of raw-data zones and integration with existing data warehouses.
Hassan et al. [19] discuss different big-data storage options, including data lakes, bigdata warehouses, and lakehouses, and compare their major characteristics.The paper then explores the transformation among these approaches to introduce the data lakehouse as a new concept that combines the flexibility of data lakes with the structured query capabilities of data warehouses.Additionally, the authors conduct a detailed comparison of data warehouses, big-data warehouses, data lakes, and lakehouses, highlighting their individual strengths and limitations in terms of storage capacity, data processing capabilities, security, and cost-effectiveness.
Over time, various architectures within the data-lake ecosystem have emerged, each designed to address specific needs related to data storage, analysis, and accessibility for end-users.According to [20], the idea of data lakes first emerged with the introduction of Hadoop, which Doug Cutting and Mike Cafarella developed while working to fix the Nutch search engine at Yahoo.Google's MapReduce [21] and Google File System [22] papers served as inspiration for them.Hadoop provided a framework for distributed storage and processing, paving the way for the first versions of data lakes In his blog post "How to beat the CAP theorem", Nathan Marz [23] first introduced the concept of Lambda architecture.He developed his framework to efficiently manage large volumes of data by utilizing both batch and real-time processing techniques.Lambda architecture describes two processing layers: A batch layer and a speed layer.In some cases, a serving layer is also included, as in [24].The designed system aims to optimize for fault tolerance, latency, and throughput by processing data through those three layers.On the other hand, maintaining code that needs to produce the same result in two complex distributed systems was the problem with Lambda architecture.In summer 2014, Kreps [25] posted an article addressing the pitfalls associated with Lambda architecture in [23] and introduced the idea of Kappa architecture to handle those drawbacks.To manage various data types, including structured and textual data, and to organize them into an analyzable structure, Bill Inmon [26] introduced the data-pond architecture model.This model comprises a series of components known as ponds, each distinct and logically segregated from the others.
Five major zone models (i.e., variants of the zone architecture) have been discovered: Gorelik [27], IBM [28], Madsen [29], Ravat [30], and Zaloni, whose model exists in multiple versions by different authors [31,32].Ravat and Zhao [30] review various architectural approaches and then propose a generic, extensible architecture for data lakes.The authors propose a new, more structured architecture for data lakes that consists of four key zones, including raw-data ingestion, data processing, data access, and governance.Each zone is designed to handle specific functions, which enhances the management and utility of data within the lake.In [33], the authors present a three-step solution to ensure that ingested data are findable, accessible, interoperable, and reusable at all times.Firstly, they introduce a metadata management system that enables users to interact with and manage metadata easily.This system is crucial for ensuring that data ingested into the data lake can be efficiently used and managed over time.Then, they detail algorithms for managing the ingestion process, including storing data and corresponding metadata in the data lake.These processes ensure that metadata is captured accurately and maintained throughout the data lifecycle.Finally, the metadata management system is illustrated through use cases, demonstrating its application in managing real datasets and facilitating easy data discovery and usage within a data-lake environment.
The data lakehouse was first introduced by Armbrust et al. [34] as a recent approach to combine a data lake with a warehouse.Specifically, they proposed a unified data platform architecture, which integrates data warehousing capabilities with open data lake file formats, can achieve performance levels competitive with current data-warehouse systems, and can effectively tackle numerous challenges encountered by data-warehouse users.Thus, the lakehouse seeks to provide features unique to data warehouses (ACID characteristics, SQL queries) and data lakes (data versioning, lineage, indexing, polymorphism, inexpensive storage, semantic enrichment, etc.).Their lakehouse platform design was based on the design at Databricks through the Delta Lake, Delta Engine, and Databricks ML Runtime projects.
In recent years, data lakes have emerged as a platform for big-data management in various domains, such as healthcare and air traffic, to name a few.This new approach enables organizations to explore the value of their data using advanced analytics techniques such as machine learning [35].The study in [36] examines various established enterprise data-lake solutions, highlighting that numerous organizations have opted to develop enhancements over Hadoop to mitigate its inherent limitations and enhance data security.This includes adaptations seen in platforms such as Amazon Web Services (AWS) data lake and Azure data lake.Additionally, the authors indicate how these solutions are increasingly being adopted across diverse sectors, including banking, business intelligence, manufacturing, and healthcare, underscoring their growing popularity and utility in handling extensive data needs.
A recent study in [37] proposes a complete fish farming data-lake architecture based on a multi-zone structure with three main zones: the Raw Zone (RZ), the Trusted Zone (TZ), and the Access Zone (AZ).This architecture is designed to collect data from different farms in various formats and types, which is then saved directly into the RZ without any transformation, aiming to form a solid historical repository.Following this, the TZ applies lean transformations to prepare the data for further analysis.Finally, the AZ consists of a dedicated layer with the data required for each team.Additionally, the paper highlights various big-data technologies and their uses.The authors also investigate the recently proposed data-lake functional architecture from a technical perspective by explaining each component of the Hadoop ecosystem used.
Ref. [38] discusses the limitations of data-warehouse approach and how the concept of a data lake is introduced as a solution to overcome these limitations.Further, the paper explores the architecture of a data lake and how the healthcare sector can benefit from data lakes.The study also highlights the use of data lakes in developing a prediction system for cardiovascular diseases.An architectural overview of the Azure data lake is presented to offer a concrete example of how a data lake can be effectively implemented within the healthcare sector.This overview emphasizes the enhancement of data management and analytics capabilities, highlighting key components such as the raw, stage, and curated zones.
The authors in [39] implement a data-lake prototype on a hybrid infrastructure combining Dell servers and Amazon AWS cloud services.The prototype illustrates how air traffic data from various FAA sources like SFDPS, TFMData, TBFM, STDDS, and ITWS can be ingested, processed, and refined for analysis.Another work in [40] presents a comprehensive framework for managing diverse data types across the Internet of Things (IoT) and big-data environments.The proposed architecture is a direct extension and implementation of the conceptual zone-based data-lake architecture outlined in [30].
Some survey articles summarized recent approaches and the architecture of data lakes, such as the one in [11], which reviews several data-lake architectures, introducing a new classification system that categorizes data lakes based on their functionality and maturity.The authors investigate the architectures and technologies used for the implementation of data lakes and propose a new typology of data-lake architectures.This includes zone architectures, which organize data by stages of processing and refinement, and pond architectures, which treat subsets of data-lake contents with specific processing and storage strategies.Another survey in [41] identified two major architectures: the data pond and the data zone.In [42], several architecture models are taken into consideration, including the basic two-layered architecture, multi-layered model, and data-lakehouse model.As more architecture models prove valuable to evaluate, our survey will expand the classification to include more data-lake architectures.

Findings
Based on the review that has been conducted in Section 4, we can categorize the findings observed in this study as follows:

Distinguish Data Lake and Data Warehouse
Many previous research works have highlighted the position of data lakes in the current data architecture.However, it is equally important to distinguish between data lakes and traditional data warehousing since both are storage solutions for large-scale data but with different approaches and functionalities.
Data warehouses are centralized repositories which maintain structured data obtained from diverse sources in a form optimized for querying and analysis.They are based on a schema-on-write strategy, meaning the data are initially structured according to a predefined schema and then stored [19].This schema-first strategy aims to optimize the queries and ensure data consistency, which makes the data-warehouse architecture more suitable for complex analysis and reporting [17].On the other hand, data lakes are generally considered as storage repositories that hold both structured and unstructured data in their native format, without pre-processing or modeling, known as schema-on-read [43].This approach provides flexibility as the schema is applied only during read time, allowing the data to be stored without any predefined structure and enabling the easy incorporation of new data types [44].
Regarding the data processing strategies, data warehouses employ a process named Extract, Transform, Load (ETL), wherein data are extracted from the source systems, transformed to a format suitable for analysis, and then loaded into the warehouse.On the contrary, data lakes employ a process named Extract, Load, and Transform (ELT), wherein data are ingested into the lake in their raw format and then transformed when required for analysis.This strategy provides organizations the freedom to apply different data processing techniques and analytics tools to the data stored in the lake as needed, without the necessity of transforming the data altogether [13,45,46].
Ease of use is another factor that differentiates data warehouses from data lakes.Owing to their unstructured architecture, data lakes offer more flexibility.In contrast, data warehouses are more structured, making them difficult and expensive to manipulate.Furthermore, data lakes compared to data warehouses are more agile and flexible since they are less structured, and developers and data scientists can modify or reconfigure them with relative ease.Table 1 provides a summarized description of the major differences between data warehouses and data lakes.

Data Lake Architecture Classification
In our review paper, we apply the term data-lake architecture only in the sense of its conceptual organization on the highest abstraction level, while excluding architectural technologies.A reference architecture as described by [31] is a conceptual framework for understanding industry best practices, tracking processes, using solution templates, and understanding data structures and elements.It is inherently abstract in that it expresses basic design considerations without any implied constraints on the realization.A reference architecture is a generic use-case free design and not a technology blueprint that dictates choices but rather a way of mapping requirements and establishing the overall pattern for implementation.In contrast, a technology architecture identifies the services or capabilities needed to support these activities, and which services to include or exclude in an implementation given particular requirements [29].Data-lake architecture has evolved over the years with major upgrades since its first proposal in 2010, mainly driven by the demands for agility, flexibility, and ease of accessibility for data analysis.Different data-lake architectures have been proposed, each offering some merit to data storage, data analysis, and consumer (end-user) in various combinations.This section provides further discussion on eight of the most well-known proposed architectural models, each illustrating different styles and merits in the data-lake architecture space.

Mono-Zone Architecture
The initial concept of a data lake involves creating a centralized pool of different data sources in one location, which allows businesses to use this data for future use cases [47].According to this early description, the initial architecture to implement the data lake featured a simple, flat design consisting of a single zone.In this mono-zone architecture, as depicted in Figure 1, all raw data are stored in their native format.This setup is often closely associated with the Hadoop ecosystem, which helps to load diverse and large-scale data at a reduced cost.However, this basic architecture has significant limitations: it lacks the capability for users to process data within the lake, and it does not track or log any user activities [30].

Lambda Architecture
The Lambda architecture was first introduced to handle batch and real-time data processing within a single system Therefore, this architecture puts more focus on data processing and consumption, rather than on data storage [6].The Lambda architecture, as shown in Figure 2, consists of three layers: a batch layer, a speed layer, and a serving layer.In the batch layer, the data stored in the persistent memory are available for consumption, providing a historical overview of the data.In a contrast to the batch layer, the speed layer processes only the incremental data that are still not stored in the persistent memory.Once the data are stored in the persistent memory, they stop being available in the speed layer.Together, these two layers provide the data to the end-users through the views in the serving layer [48].

Kappa Architecture
Kappa architecture [25], as illustrated in Figure 3, is a minimalist version of Lambda architecture which, for the sake of simplicity, removes the batch layer and retains only the speed layer.The primary goal is to eliminate the need to recompute a batch layer from scratch all the time and to try doing almost all of these processes in real-time or through the speed layer.One of the disadvantages of the Lambda architecture, which has been avoided in the Kappa Architecture, is having to code and execute the same logic twice [4].

Data-Pond Architecture
The data pond was introduced by Bill Inmon [26] as a proposed data-lake architecture model.As illustrated in Figure 4, data-pond architecture comprises five logically separated ponds, each serving a distinct purpose.The first pond is the raw-data pond, which houses the ingested raw data from sources and serves as a staging area for other ponds.Raw data are held in this pond until it is transferred to other ponds, after which the raw data are purged and inaccessible for further processing.The analog-data pond stores semistructured, high-velocity data like those generated by IoT devices and APIs.The applicationdata pond functions like a data warehouse that is populated through extract-transformload processes to provide support to businesses and existing applications.The textual-data pond contains unstructured textual data that undergoes a textual ETL process to achieve contextualization for further text analysis.The archival-data pond, on the other hand, serves to offload inactive data from the analog-, application-, and textual-data ponds and is only queried when such data are required for analysis.During the transfer from the raw-data pond, some metadata may be applied to data in the analog-data pond.

Zone-Based Architecture
So-called zone architectures assign data to a zone according to their degree of refinement.Numerous variants of zone architecture have been proposed in the literature, such as the ones in [27][28][29]31,32].These variants differ significantly in the zones they include, the number of zones, the user groups they support (either data scientists only or both data scientists and business users), and their focus (processing [29] versus governance [27]).However, the fundamental idea remains the same.To cite an instance, the zone architecture proposed by Zaloni [31,32] is one of the most widely used models for data lakes in recent years.This model in Figure 5 consists of four general zones and a sandbox zone, each with different data structures and uses.The transient landing zone is where the data first arrive and are temporarily stored in their raw format.The raw zone is where the raw data are permanently stored in their original form, and initial processing is done here, resulting in data indexing and record enrichment with appropriate metadata.The trusted zone holds data that have undergone additional compliance and quality checks depending on their final purpose, and the refined zone is the source of data for users with restricted data access.Finally, the sandbox serves as a test area for ad-hoc analysis and data exploration.
The transient landing zone is similar to storage in data warehousing and includes preliminary data analysis and checks for business and technical compliance.The raw zone is the unique source of trusted data for analysis and further processing and does not check data quality or compliance.The trusted zone contains only technically and regulatory compliant data that have undergone additional compliance and quality checks, and the refined zone stores data in a form adjusted to end-users' business needs.The sandbox provides unrestricted data access for exploration and analysis purposes, but access should be restricted to users who need access to the entire data lake.

Multi-Zone Functional Architecture
Multi-zone functional architecture is based on the model proposed by [30].The architecture provides an efficient means of data analytics by enabling users to easily locate, access, interoperate, and reuse existing data, data preparation processes, and analyses.As illustrated in Figure 6, it comprises several zones, including the raw ingestion zone, the process zone, the access zone, and the govern zone.The raw ingestion zone is where data are ingested and stored in their native format, whether as batch data or near real-time data.The process zone is where users can prepare data according to their needs and store intermediate data and processes, including all treatments and business knowledge applied to the raw data.The access zone is where processed data are accessible and consumable, enabling visualization, real-time analytics, advanced machine learning, BI analytics, and other decision support systems.
Finally, the govern zone is responsible for ensuring data security, quality, life-cycle, access, and metadata management, which are critical aspects of data sustainability and reuse.The govern zone comprises two sub-zones, namely metadata storage and security mechanisms, which allow authentication, authorization, encryption, and multilevel security policies to maintain quality of service through monitoring multilevel resource consumption.

Functional Data Lake Architecture
Multi-layered data-lake architecture, proposed by [49], uses a structured approach with distinct layers for data ingestion, storage, transformation, and interaction.This architecture ensures effective data management by separating concerns and enabling communication between layers.As reflected in Figure 7, the ingestion layer collects heterogeneous data from various sources and performs initial metadata extraction on structured and semi-structured data, which is stored in the metadata repository afterwards.The storage layer includes the metadata repository and raw-data repositories to support various data forms and structures.It also simplifies data storage complexity through a user interface that facilitates querying capabilities.The transformation layer executes data operations such as cleansing, transformation, and integration, creating models similar to data marts for user-specific data access.Finally, the interaction layer provides end-users access to the metadata repository and transformed data for data exploration, analytical queries, and visualization tasks.

Data Lakehouse Architecture
The data-lakehouse architecture represents a novel approach and an architectural pattern, emerging as a potential replacement for traditional data warehouses in the coming years [34].It combines the flexibility of data lakes, allowing storage of various data formats, with the transactional integrity of data warehouses through its layered components.While the data lakehouse does not explicitly define a particular data model, it can be seen as bridging the two-layered architecture and the zone architecture by incorporating the welldefined model of the data warehouse.The data lakehouse employs a relational (structured) form as the final structure, which is chosen because most analytical and visualization tools support relational data, allowing for faster data analysis.Additionally, metadata is used to facilitate the analysis process when working with data from multiple sources.
As demonstrated in Figure 8, at its foundation, a lakehouse leverages cloud object storage to house data in open file formats like Parquet, ORC, and Avro.This open format allows any compatible processing engine to interact with the data [50].Above the storage layer sits the transactional layer, also known as the metadata layer.This layer ensures data integrity through data-warehouse-like ACID (Atomicity, Consistency, Isolation, Durability) properties.It achieves this by managing metadata and organizing data.Additionally, it supports schema enforcement, data versioning, and data lineage, all of which contribute to enhanced data quality.Open table format technologies like Delta Lake, Apache Hudi, and Apache Iceberg play a crucial role in this layer by bringing transactional capabilities to the data lake [51].To summarize, Table 2 provides a comparative analysis of different data-lake architectures, highlighting the main advantages and disadvantages of those architectures.Furthermore, we have included some use cases that offer practical insights into how each data-lake architecture can effectively apply in real-world scenarios, supporting in educated decision-making.• Enhance data management and retrieval by organizing data based on its type and purpose • Facilitates better governance practices as data are managed according to their type and usage requirements • Eliminates data redundancy because data are stored only in the pond that suits their type • Each pond has a fixed infrastructure, including metadata and metaprocess definitions to support scalability • Loss of original raw data as this design mandates transforming data after it leaves the raw pond, contradicting the core data-lake concept that data should be stored permanently in its native form for future reference and integrity • Enhanced data quality and security as data undergoes through several layers of processing and validation • Better data governance as metadata can be applied throughout the zones to ensure that data lineage, quality, privacy, and rolebased security measures are maintained consistently • Supports both operational use cases and ad hoc exploratory analysis through the use of sandbox and explorative zones.This allows data scientists to perform advanced analytics without interfering with the production environment • Optimized storage costs and improved performance by categorizing data into hot and cold zones • Adds complexity and maintenance overhead due to varying management needs for each zone • Data duplication across multiple zones significantly challenges data lineage management, increases storage costs, and risks data inconsistencies • Delayed data availability as data must pass through several zones before it becomes available for analysis, which can be a drawback for use cases requiring realtime or near-real-time data access • Requires domain expertise in integration, metadata management, and enforcing governance across zones • Fish farming management [37] • Banking Data Management [60] • Cardiovascular disease prediction [38] Functional Data Lake Architecture [49] • Layering provides a logical framework for grouping services that provide similar functionality, which can enhance clarity and navigability for users and administrators • Due to clear separation of concerns, each layer can be independently maintained, upgraded, troubleshooted, and scaled, ensuring efficient resource utilization and cost optimization while minimizing disruptions to other layers • The inherent modularity enables the selection of technologies tailored to each layer's specific requirements, thus allowing for greater specialization and optimization • Absence of a clear data organization within the storage layer can hinder the implementation and the enforcement of strict data governance policies across layers • Oversimplifies data flow as a linear progression (ingestion-storage-processing-consumption), neglecting the iterative nature of data cycling between storage and processing, which can lead to inefficient resource utilization • Environmental Monitoring [61] • NetFlow-based Cyberattack Detection [62] • Energy Optimization Analytics [63] Multi-Zone Functional Architecture [30] • Offers the flexibility and scalability of the functional approach, along with the enhanced governance and data lifecycle management benefits of the zone-based approach • Supports various use cases, from simple data lakes for small teams to complex enterprise-level data platforms with strict governance requirements • Unlike purely functional architectures that often depict a linear flow, the hybrid approach explicitly showcases the continuous data processing loop, offering a more realistic depiction of how data are refined and enriched over time • Simplifies enterprise analytics architectures, which are often costly and slow due to separate data platforms (warehouses and lakes) required for analytical workloads • Reduces complexity by combining features of data warehouses and lakes into a single platform • Avoids redundant storage of multiple data copies across several platforms, which helps maintain a single source of truth • Provides a unified data format across the platform to reduce errors in data handling • Facilitates near-real-time analytics and advanced data processing features like stream processing • Offers a single point of access to data for users, improving ease of use and transparency across various data analysis tasks

Timeline of Data-Lake Architecture Development
The understanding of the data-lake vision has shifted away from considering it solely as a storage solution to recognizing it as a data platform.The purpose of a storage solution is to fulfill specific requirements for holding data, whereas a data platform is designed to accommodate the needs of various applications with more generalization and support for diverse data processes, analysis, insights, and use cases [29].The objective of this section is to map the progression of proposed architectures for data lakes.This has been accomplished, as shown in Figure 9, by creating a chronological timeline graph and categorizing the evolution into four distinctive phases.Additionally, we have explored potential technological landscapes and business requirements that drive such development.The first phase of data-lake architecture, introduced between 2008 and 2010, laid the foundation for scalable data storage through the utilization of big-data technologies that came in the form of the Apache Hadoop stack.This stack is represented by its distributed file system and MapReduce, which have sat the grounds for the concept of data lakes.This new vision was first coined in the industry in 2010 by Pentaho CTO James Dixon, as a solution that handles raw data from a single source and supports diverse user requirements [7].It is a sharp contrast to data warehouses for which the structure and usage of the data must be predefined and fixed, and rigorous data extraction, transformation, and cleaning are necessary.Data lakes, on the other hand, could avoid the expensive standard processes of data warehouses, such as data transformation, by storing raw data in the original format.This enables the adaptation of the schema-on-read paradigm, which defers schema application until data retrieval, thus offering flexibility in handling diverse data types.
The initial implementation of the data lake, driven by this vision, featured a flat architecture with a mono-zone approach, as described in [30].The main focus of such a design was to fulfill business and data demands by capturing and storing massive amounts of raw, unstructured data in their original formats.This facilitated the future analysis of large datasets for insights without the constraints imposed by traditional database systems, thereby breaking down silos and providing reusable data assets for various business functions.
The original Apache Hadoop stack heavily relied on a tight coupling of storage and computing within the nodes of a cluster, meaning that each Hadoop cluster node had storage and computational power.Although this model functioned well for batch processing and large-scale data analytics, it presented challenges in independent resource scaling.For instance, if a workload required additional computer power but not additional storage, the entire node (with both storage and compute) had to be scaled up, leading to inefficient resource utilization [68].In addition, with ongoing changes in data processing capabilities and processing demand, achieving high real-time performance in specific scenarios was a challenge, regardless of improvements in batch processing.Stream computing engines addressed this limitation, pushing the data-lake processing paradigm to its next evolutionary stage.

Dl Architecture 2.0-Real-Time Data Processing (2011-2014)
In response to the growing need for both batch and streaming data processing requirements, the second generation of data-lake architectures prioritized improvements to address the challenge of combining these requirements for timely insights and faster response times.Nathan Marz proposed the Lambda architecture [23] as a data processing architecture to address these new business needs by allowing a distributed data store to be simultaneously consistent, highly available, and partition tolerant.The Lambda architecture achieves this by creating two streams for the same input data: one is processed using a real-time framework and the other one is processed with a batch framework, which are then combined into a unified view.This architecture ensures data accuracy and lowlatency processing [24].Although the Lambda architecture successfully provided a solution, its complexity and the need for separate batch and streaming pipelines constrained its maintainability.Consequently, Jay Kreps, a co-founder of Apache Kafka, introduced the Kappa architecture [25] in a 2014 blog post.This architecture utilizes a single and unified processing pipeline for both real-time and batch processing and eliminates the necessity for a separate batch layer that can cause latency and data processing delays.
In general, reliability, scalability, and maintainability are the three most important considerations in the software engineering of these approaches.Technologies such as Apache Kafka for distributed streaming, Apache Parquet and ORC for open data storage formats, Apache Storm for real-time computation, and Apache Spark for unified batch and stream processing played crucial roles in enabling these architectures' capabilities.At the beginning of 2015, there were more debates on the absence of proper data governance mechanisms that initial data-lake implementations suffer from, turning them into unusable data swamps.As data continues to grow in variety and complexity, data-lake architectural proposals focus on data organization that enhances governance practices.Most architectural models proposed during this phase were originally inspired by industrial implementations of data-lake systems in [28,29], with a set of remarkable additions to datalake research.Among these was the suggestion of having a shared platform equipped with subsystems that support the distinct needs a data lake should deliver, including acquiring data, processing data, and delivering data for use.Each sub-system could be represented by a set of technology components, rather than prescriptive blueprints or products.These proposals emphasized that the core of the data lake is its data architecture, which is the structure of the data arranged to permit its three primary functions: acquiring data, managing data, and utilizing data.This arrangement minimizes reprocessing and user effort by separating components that change at different rates due to varying demands.Separating the storage and models used when the data are collected from those used to normalize or deliver the data facilitates this adaptability.This pattern, known as zonedata architecture, establishes discrete zones for storing data that correspond to these component functions.
There are several other high-level proposals for data-lake architecture that establish delineations for datasets.One proposal segments ingested data by type and use [26], while another organizes data by the extent to which they are refined [31].For instance, there may be zones for loading and quality checking, raw-data storage, refined and validated-data storage, discovery and exploration, and business or scientific analysis of data.This zone architecture, when applied to the latter proposal, enables incremental and progressive refinement of data quality and structure as data pass through each zone.By standardizing and cleansing data in a progressive manner, the architecture facilitates easier management of the data lifecycle and governance and security policies by applying distinct rules and permissions to each zone.However, this results in multiple copies of data, which complicates data lineage and analysis and becomes increasingly costly as the number of zones is increased.Additionally, this proposal lacks technical detail concerning the functions, which impedes modular and reusable implementations.Addressing both of these deficiencies, a function-oriented architecture proposal [49] establishes a four-layer mapping between data-lake users and a storage layer consisting of potentially multiple technologies, including data ingestion, storage, transformation/processing, and access.This architecture is more transparent in matching the required technologies since each layer has clearly defined functions.During this phase of the evolutional journey of data-lake architecture, businesses encountered increasing regulatory demands and needed data governance frameworks to ensure that data quality, security, and compliance are maintained.This historical development resulted in the recognition of the need for controlled and structured data environments to enable advanced analytics while maintaining regulatory compliance.Governance tools like Apache Atlas for data governance and metadata management, Informatica Axon Data Governance, and Aletion Data Catalog have enabled this structured paradigm for data management in data lakes and addressed these concerns.While the this phase established the foundational technologies and processing frameworks for handling large volumes of streaming and batch data, the rising demand for effective data governance and quality standards set the stage for the subsequent evolution, which emphasized creating more manageable and compliant data environments.The growing demand for machine-learning (ML) and large language-model (LLM) workloads was the primary driver pushing businesses to enhance their data-lake architectures and adopt robust data infrastructures.Such infrastructure must prioritize provisioning high-quality, voluminous, and diverse data, recognized as the raw material for training effective models.Furthermore, it is crucial to enforce data governance policies, including security and privacy measures, across the data lifecycle to ensure data integrity and compliance.
Early on-premises, Hadoop-based data lakes suffered from poor performance due to the need to replicate data from an external repository to another storage-compute node for analytics.Additional common problems with traditional data lakes included management complexity, scalability limitations, performance constraints, and data inconsistency.Building and maintaining data pipelines within these on-premises environments was complex, as it involved managing both hardware infrastructure (such as provisioning and configuring servers, scheduling batch Extract, Load, Transform (ELT) jobs, and handling outages) and software components (such as integrating divers tools for data ingestion, organization, preprocessing, and querying) [68].Furthermore, scaling on-premises data lakes was a manual process that required adding and configuring new servers, demanding constant monitoring of resource utilization, and resulted in increased maintenance and operational costs, primarily related to IT and engineering resources.Performance limitations often arose under high concurrency or when dealing with large datasets.The complex architecture could lead to broken data pipelines, slow data transformations, and error-prone data movement outside the data lake.These issues, coupled with the rigid architecture, also increased the risk of data governance and security breaches [68].The absence of ACID transaction support in traditional data lakes resulted in the potential for partial updates or data corruption during concurrent writes.As a result, the reliability of data queries and analyses was compromised, impacting the effectiveness of machine-learning models trained on this data [34].
Consequently, the migration of on-premises data lakes to the cloud as infrastructureas-a-service offerings gained significant momentum and accelerated considerably.The increasing volume of data, the need for scalable storage solutions, and the maturation of cloud service offerings for handling both structured and unstructured data have motivated this shift.Cloud platforms now deliver comprehensive services for the entire data journey, from ingestion and storage to processing and analysis [68].Moreover, cloud service providers continually enhance security features to guarantee high levels of data protection [69].End-to-end cloud-native data lakes offer several benefits, including no data restrictions, a single storage layer without silos, the flexibility to run diverse compute tasks on the same data store, independent scaling of compute and storage, reduced costs through elastic scaling, and the ability to leverage advanced analytics and machine-learning tools [70].
Modern cloud data-lake architectures leverage two key hybrid approaches.The first approach, known as multi-zone functional architecture, integrates zone-based data organization (raw, curated, processed) with dedicated functional layers for tasks like ingestion, processing, and analysis for a more comprehensive data-lake model [30,64,71].The second approach, the data lakehouse, focuses on unifying data access by integrating the cost-effective and flexible storage of a data lake with the transactional consistency (ACID guarantees) offered by a data warehouse [34].The data-lakehouse model is gaining significant traction in both academic research and industry applications.This is attributed to its capability to unify schemas and seamlessly combine both streaming and batch data processing, ensuring reliable and consistent outcomes.The data lakehouse also utilizes the Unity Catalogue for data governance to handle permissions and access controls across workspaces.Other data-lakehouse benefits comprise machine-learning support, data openness promotion, and performance optimization.Hudi, Delta Lake, and Iceberg are table formats that enable ACID transactions, which serve as the foundation for building data lakehouses.These transactions rely on detailed transaction logs that record all data modifications within the lake.In the event of failures or errors, these logs enable the rollback or replay of operations.This feature simplifies data pipelines and accelerates SQL workloads by eliminating the need for complex error handling and data reconciliation processes typically required to ensure data consistency in non-ACID environments [51].Businesses now require scalable, flexible, and cost-effective solutions that provide robust performance and governance to enable them to leverage data as a strategic asset across all operational aspects.This evolution underscores the ongoing refinement of data-lake architectures to meet modern data management challenges and business needs.

Conclusions
This section addresses the importance of data lakes in modern data architecture, along with the challenges organizations may encounter while implementing and managing them.Furthermore, we have discussed the future development trends of data-lake implementation and demonstrated how integrating AI and ML directly within data lakes revolutionizes their functionality.Finally, we have suggested some avenues for future research that build upon our current understanding and expand the scope of data-lake implementations.

Significance of Data Lakes in Modern Data Architecture
The diversity of data sources requires new integration solutions that can handle the data volume, velocity, and variety created by distributed systems.Schema-first integration, utilized by data warehouses and implemented using ETL (Extract-Transform-Load) frameworks, is not agile or dynamic enough to cope with the data management cycle, making it unsuited to the integration of modern data sources.This is where a data lake excels, offering the ability to integrate, manage, and analyze any data from different sources [4].
The advent of large data lakes represented a paradigm shift in the way that organizations would manage and analyze their data assets.First generation data lakes were built in corporate data centers using large nodes and, while ambitious in intent, these early initiatives often conflicted in their storage and serving requirements.For example, the deep analytic queries required by a data scientist working on predictive analytics for revenue modeling clashed with the fast API responses needed for customers facing applications using edge computing.This was largely due to the tightly coupled storage and computing architecture of the Apache Hadoop stack, upon which most of these implementations were based [51].
While first-generation data lakes struggled with these inherent conflicts, the underlying principle of the data lake represented a step change in data management.As data lakes matured, they soon became perceived as a major pillar in modern data architecture because they enabled enterprises to handle the vastness of the data they wished to capture, store, and process [42].Data were landing in their native unstructured format, thus maintaining the full integrity of the data and avoiding unnecessary data loss due to pre-conceived pre-processing and transformation steps.Instead, the data could then be integrated and analyzed when required in a schema-on-read layer, thus enabling agility in adapting to new data sources and changing business needs.
Data lakes also simplify the integration of multiple data sources, thereby enabling the enterprise to insightfully discover patterns and relationships between differing data types, in particular unstructured data, which constitutes many big-data sources.Due to the heterogeneity of the data types and formats that a data lake supports, the analysis is quicker and large-scale data analysis is more precise, leading to improved and informed decisions [12].
Another key advantage of a data lake is that it reduces reliance on proprietary datawarehousing solutions, which typically involve heavy investment in hardware and software.Data lakes instead leverage open-source tools and commodity hardware, making them far more cost-effective for storing and managing extremely large volumes of data.
Lastly, data lakes facilitate real-time data ingestion, with no transformations, thus removing the latency between extraction and ingestion and enabling timely decisionmaking [27].A data lake and the analysis performed around it are typically implemented using distributed systems, which provide high scalability and resilience to hardware and software failures.

Data Lake Challenges
Despite all the benefits of data lakes discussed in Section 7.1, there are also some serious issues that organizations may encounter when deploying and maintaining their data lakes.First of all, data quality is considered a major concern [70].Since data lakes store raw, unstructured data, it is hard to control the consistency and accuracy of such data.Therefore, organizations need to define policies for data governance to maintain data quality, which might take time and incur high costs.Another issue is data security, as data lakes accommodate users who have easy access to data, making them prone to security attacks [72].Hence, proper security features should be developed to safeguard sensitive data from unwanted users.Moreover, data lakes do not possess any standardized schema or data models, making it hard to integrate data from various sources.Lack of visibility into the data stored in the data lake is another issue.Data warehouses, on the other hand, came with governance built-in, while early data lake use cases often had no attempt to manage the data.This led to a data swamp with poor visibility into data type, lineage, and quality [4].
Similar to data warehouses, governance practices are often hard to deploy and enforce as metadata is not automatically attached to the data when ingested into the data lake.As a result, it is not possible to structure the data lake and apply governance policies [48].Metadata helps organizations implement data privacy and role-based security, which are essential for organizations operating in heavily regulated industries.To effectively utilize metadata, organizations need to integrate the data lake with existing metadata utilities in the overall ecosystem to track data usage and transformations outside the data lake.
Lastly, creating a big-data-lake environment is not simple and involves integrating many different technologies.As a consequence, the strategy and architecture are also complex, as organizations need to integrate existing databases, systems, and applications to break data silos, automate and operationalize certain processes, and implement and enforce enterprise-wide governance policies.However, most organizations lack the necessary in-house skills to successfully execute an enterprise-grade data-lake project, which might lead to expensive errors and delays [5].

Future Data Lake Developments Trends
The future of data lakes is dynamic and constantly changing, influenced by several important trends.One major shift is the increasing integration of AI and ML capabilities directly within data lakes, allowing for end-to-end workflows and real-time insights from vast amounts of structured and unstructured data.This integration requires APIs to improve data accessibility for downstream AI/ML applications and scalable storage solutions optimized for AI, such as vector and tensor storage [73].Such storage solutions are necessary for deep learning because tensors can manage not only embedding vectors but also input datasets and model parameters like weights and biases.It is a unified approach to handling multi-dimensional data.The democratization of data is also gaining momentum for empowering business users with self-service analytics tools and intuitive interfaces to enable non-technical users to perform analyses and generate insights independently.Some direction in this area focuses on developing NLP-based interfaces and low-code/no-code platforms to make data accessible to a wider audience [74].Furthermore, organizations are striving to create unified data ecosystems that seamlessly integrate data lakes with other data management and analytics tools.This involves integrating with data warehouses, metadata management systems, and AI/ML platforms, with research focusing on metadata exchange standards and data-lakehouse architectures for enhanced interoperability and a unified view of data [75].Advances in AI can also provide opportunities to revolutionize data management within data lakes.AI-driven automation can simplify data ingestion, ensuring data are loaded accurately and efficiently into the lake.AI algorithms also help in identifying and resolving data quality issues such as error detection and correction [76] and data cleaning [77] so that analysis can maintain high standards.Automated metadata generation and management through AI tools simplifies data understanding and utilization.The augmented data catalog represents one of the innovations focusing on this trend.The augmented data catalogs leverage machine-learning capabilities to automatically manage the discovery, inventorying, profiling, tagging, and creation of semantic relationships among distributed and siloed data assets.The goal is to create active metadata that can guide and automate data management tasks.

Future Research Directions
Future research work that expands on our current work may include conducting a multidisciplinary literature review that integrates insights from diverse fields such as bigdata analytics, data governance, distributed computing, cloud computing, data integration, AI and machine learning, edge computing, semantic web and ontologies, cybersecurity, and human-computer interaction, which could significantly enhance data-lake research.This comprehensive approach could lead to the development of advanced data-lake solutions that address challenges related to data management, security, interoperability, and user experience.Another important research initiative could involve developing standardized reference architectures and implementation frameworks, which are crucial to guide organizations in setting up and managing data lakes effectively.These frameworks should encompass best practices and guidelines for data modeling, metadata management, and integration with existing systems.Additionally, establishing industry standards and compliance requirements will ensure data quality, security, and interoperability across different data-lake implementations.Future research should prioritize a thorough, in-depth evaluation of the technological tools used in data-lake implementations.This evaluation should include feature analysis, performance benchmarking, and suitability assessment for various use cases, such as real-time analytics and machine learning.Practical case studies demonstrating real-world implementations and a detailed comparative analysis of different tools can help practitioners make informed decisions when choosing the most appropriate solutions for their specific needs.

Table 1 .
Comparison of data-warehouse and data-lake characteristics.

Table 2 .
Data-lake architectures model comparison.