Introduction

Nowadays, the volume of data generated from different sources such as hospital information systems, medical sensors, medical devices, websites and medical applications increase rapidly by terabytes and petabytes. Moreover, users of such data may be distant or inside of hospitals. Integrating and handling this huge amount of data known as “Big Data” is a real challenge. Since health data involves large collections of structured and unstructured datasets, and as demonstrated in [1], integration, storing and managing efficiently this huge amount of data in a single hard drive using the traditional data warehousing platforms is very difficult. As well, such traditional data management systems can not handle unstructured data of medical applications, they are limited to handle only data in the order of megabytes, their performances go down when data increases and they fail in scaling-up. Indeed, several issues and problems arising over big data when using OLAP and data warehousing as detailed in [2]. Among these problems, we quote:

  1. 1

    The fact table can easily grow in size leading to severe computational issues.

  2. 2

    Complexity and hardness of designing methodologies for OLAP and data warehousing.

  3. 3

    Issue in computing methodologies of OLAP data cube.

  4. 4

    Complications of the integration process of models, techniques, algorithms and computational platforms on OLAP over big data with classical platforms.

  5. 5

    Query languages and optimization.

An alternative emerged solution that is able to handle and manage such volume of the structured and unstructured amount of data is the Hadoop ecosystem. It is the best alternative since it scales up well by distributing processing across several host servers which are not necessary high-performance computer; it is based on a distributed file system [3]. Indeed, a major advantage of Hadoop framework apart its scalability and strength, is its implementation low cost, by utilizing only multiple existing ordinary computer servers instead of one high-performance cluster. Hadoop pushes processing to the data instead of data to the processing. Furthermore, it supports ETL (Extract, transform, and Load) processes in parallel. However, several issues arise when integrating and warehousing of medical big data such as the lack of a standard and powerful architecture for integration, which makes implementing big data warehouse a big issue. Moreover, former design methodologies for data models can not meet all needs of the new data analysis applications over big data warehousing given the new constraints such as the nature and complexity of medical big data.

The purpose of this paper is to address the arising problems when using OLAP and data warehousing over big data especially medical big data. The contributions of this paper can be summarized as follows:

  • Firstly, we give an overview of some previous traditional medical DWs (Data Warehouses), their limitations, and some recent Hadoop-based DWs.

  • Secondly, we propose a system architecture and a conceptual data model for a MBDW (Medical Big Data Warehouse).

  • Thirdly, we offer a solution to overcome both the growing of fact table size and the lack of primary and foreign keys in the framework Apache Hive required in the conceptual data model. This solution is based on nested partitioning according to the dimension tables keys.

  • Finally, we apply our solution to implement a MBDW to improve medical resources distribution for health sector in Bejaia region (in Algeria).

The remainder of this paper is organized as follows: A brief review related to the medical DWs traditional and modern ones is given in Section 2. Section 3 details some concepts and tools which are called to be used in the rest of this work. A system architecture of the MBDW and a conceptual data model which exploits partitioning and bucketing are presented in sections 4 and 5. The latest section discusses our implementation and experimental results of our case study.

Related works

In this section, we introduce some medical DWs, then we give their limitations and their common drawbacks.

Traditional medical data warehouses

Despite the widespread interest in data integration and historization in the medical field [4], this field has been slow to adopt data warehousing. However, once adopted, several studies and projects have been conducted on medical DWs. Among the first ones, Ewen and al. [5] highlighted the need for a DW in the health sector. Authors in [6] shown the main differences between conventional business oriented DWs and clinical DWs, they also identified key research issues in clinical data warehousing. GEDAW [7] is a bioinformatics project consisted of building a DW on the hepatic transcriptome. This project aimed to bring together within a single knowledge base, complex, varied, and many data of liver genes for analysis. Its objective was to provide researchers with a comprehensive tool for integrating transcriptome and provide decision support to guide biological research. DWs of different health insurance organizations in Austria were merged in an evidence-based medicine collaboration project [8] called HEWAF (Healthcare Warehouse Federation). Kerkri and al. [9] presented EPIDWARE architecture to build DWs for epidemiological and medical studies. It improved the medical care of the patients in care units. Pavalam and al. [10] proposed a DW based architecture for HER (Electronic Health Record) to the Rwandan health sector. It can be accessed from different applications and allow data analysis for a swift decision-making process and also in alerting the epidemic. A data warehouse which gives information on epidemiology and public health, about various diseases, their concentration, and resources repartition in Bejaia department is proposed in [11,12,13]. The aim of this DW is to improve medical resources allocation of Bejaia region.

Issues and limitations

The dramatic increases in devices and applications generating enormous medical data have raised big issues in managing, processing, and mining these massive amounts of data. New solutions are needed since traditional frameworks can no longer be used to handle the volume, variety, and velocity of current medical applications. As a result, the previously presented DWs and several others face many common issues and limitations over medical big data. Among these issues, we highlight the most important ones. Firstly, impact of the huge volume and fast growth of medical data. Indeed, in the medical field, a large amount of information about patients’ medical histories, symptomatology, diagnoses and responses to treatments and therapies are generated, which necessitates their collection. Secondly, the unstructured nature of medical data is another important issue. In fact, unstructured documents, images, and clinical notes represent over 80% of current medical data [14], such unstructured data should be converted into analysis-ready datasets. Thirdly, the complexity of data modeling, it is not easy to deal with computing OLAP data cubes over big data, mainly due to the explosive size of data sets and the complexity of multidimensional data models [2]. For instance, if we have 10.000 diagnoses, this would amount to 210.000 dimensions, therefore, this solution is practically unusable [6]. Fourthly, the integration of very complex medical data as this aspect not only conveys in typical integration problems, mainly coming from the literature on data and schema integration issues, also it has deep consequences for the kind of analytics to be designed [15], since effective large-scale analysis often needs the collection from multiple strongly heterogeneous data sources. For instance, obtaining overall health view of a patient requires integrating and analyzing medical health record along with readings from multiple meters types such as accelerometers, glucose meters, heart meters, etc. [14]. Finally, the temporal aspects of medical data known as valid time and transaction time must both be supported to provide bi-temporal support. Indeed, it is important to know when data is considered to be valid in the real world and when it is stored and changed in the database [6]. For instance, a blood test can be made multiple times for the same patient, however, it considered being valid only for a small period of time. For more details about medical big data warehousing issues see [16].

Hadoop-based data warehousing

Aside from the above issues and limitations of traditional data warehousing technologies, health care systems have rapidly adopted electronic health record which has dramatically increased the size of clinical data. As a result, there are opportunities to use a big data framework such as Hadoop to process and store medical big data since the most important aspect of big data analysis is to process the data and obtain the result within a time constraint. Researchers and industrials tried to follow such new platforms. Thus, a small number of studies were undertaken as depicted in Table 1. Yao and al. [17] built a five-node Hadoop cluster to execute distributed Map-Reduce algorithms to study user behaviors regarding various data produced by different hospital information systems for daily work. They have shown that medical big data makes the design of hospital information systems more intelligent and easier to use by making personalized recommendations. In order to interact with structured and unstructured medical data in the epilepsy field in an efficient way to fulfill related queries, Istephan and Siadat [18] proposed a framework which dynamically integrates user-defined modules. Hadoop-GIS [19] is a warehousing system of spatial data, built over Map-Reduce for medical image processing. It supports various analytical queries. Saravanakumar and al. [20] developed a predictive analysis algorithm over Hadoop environment to predict and classify the type of diabetes and the type of treatment to be provided. According to the authors, this system helps to make an effective cure and care the patients with enhancing outcomes like affordability and availability. Rodger [21] used Hadoop and Hive as a data warehouse infrastructure to improve prediction of traumatic brain injury survival rates using data classification into predefined classes. Indeed, to determine how the set of collected variables relates to the body injuries, he collected data on three ship variables (Byrd, Boxer, and Kearsage) and injuries to different body regions such as head, torso, extremities, and abrasions. He proposed a hybrid approach on multiple ship databases where various scalable machine-learning algorithms have been employed. Raja and Sivasankar [23] proposed a framework based on Hadoop to modernize the healthcare informatics systems to inferring knowledge from various healthcare centers in different geographic locations.

Table 1 Big data medical frameworks

Yang et al. [24] build a cloud-based storage and distributed processing platform to store and handle medical records data using HBase of the Hadoop ecosystem providing diverse functions such as keyword search, data filtering, and basic statistics. Based on the size of data, authors used the Put with the single-threaded method or Complete-Bulkload mechanism to improve the performance of data import into a database.

The different listed works [17,18,19,20,21,22,23,24] used high-performance computers and very powerful IT infrastructures that required a very substantial investment to use the great processing capabilities of Hadoop technology. In this work, our goal is to re-use their own IT system infrastructure and workstations of the health institution without having to make a great investment in its material capabilities and also to adopt data warehousing and OLAP methodologies. However, as indicated in [25], data warehousing using Hadoop and hive is faced with many challenges. Firstly, Hadoop does not support OLAP operations and the multi-dimensional model can not meet all needs of the new data analysis applications. Secondly, the limitation of Apache Hive query language called HiveQL which does not support many standard SQL queries. Overcoming such limitations and challenges in the medical field is the objective of this work.

Background review

This section details some concepts and tools called to be used in this study.

Big data concept

The ‘big data’ as a concept is often described by the three V then, recently by the six V, which mean Volume, Velocity, Variety, and Veracity, Variability, Value of data. Indeed, the high-volume of data generation refers to the volume (1st V). The rate of data generation and the speed at which it should be processed refers to the velocity (2nd V). The heterogeneity and the diversity of the data types refer to the variety (3rd V). The degree of reliability and quality of sources of data refers to the Veracity (4th V). The disparity and variation of the data flow rates refer to the variability (5th V). Finally, value (6th V) of data depends on the volume of analyzed data. Traditional data management systems can not handle data with such characteristics. Therefore new platforms and technologies were developed.

Big data technologies

To manage the volume, velocity, variety, and variability of big data, several technologies, tools, and framework were developed. The most important ones are The Map-reduce based system, for instance, BigTable, HadoopDB, Hadoop ecosystem and its most important components such as HBase, Pig, and Hive. The NoSQL databases, for instance: MongoDB, Cassandra, and VoldemortDB. The In-memory Frameworks, for instance: Apache Spark. The Graph databases, for instance: Neo4J, AllegroGraph, In_niteGraph, HyperGraphDB, InfoGrid and Google Pregel. RDBMS-based system, for example, Greenplum, Aster, and Vertica. The stream data processing frameworks, for instance: Apache S4, Apache Kafka, Apache Flume, Twitter Storm and Spark Streaming.

Map-reduce paradigm

The Map-Reduce paradigm [26] is a programming model allowing parallel computation of tasks using two phases primitive: map and reduce phases defined by users. In the first “Map” phase, a mapper is defined by the user to load data chunk from DFS and transforms it into a list of intermediate key/value (ki, vi) pairs. Then, the key-value pairs are buffered as (r) of files, and all key-value files are sorted by keys. The second “Reduce” phase, a reducer is defined by the user to combine files from different mappers together. The final results are written back to DFS [27].

Hadoop ecosystem

Hadoop ecosystem consists of several projects and libraries such as massive scale database management solution called HBase, data warehousing solution called Hive, the machine learning library called Mahout, the job tracking and scheduling suite called ZooKeeper and Pig and other related libraries and packages for massively parallel and distributed data management.

Hadoop

Apache Hadoop [28] is an open source software allowing large-scale parallel distributed data analysis. It replicates and collocates automatically data across multiple different nodes, then allows parallel processing across clusters. It is an open-source implementation of Google’s Map-Reduce computing model. It is based on HDFS (Hadoop Distributed File system) which provides high-throughput access to application data. Hadoop provides the robust, fault-tolerant HDFS, and reliability. Doug Cutting implemented the first version of Hadoop in 2004 and becomes an Apache Software Foundation project in 2008 [29]. On November 17, 2017, the release 2.9.0 of Apache Hadoop is available [28].

Hive

Apache Hive [30] is initially a subproject developed by the Facebook team. Hive is used in conjunction with the map-reduce model of Hadoop to structure the data, to run queries, allowing the creation of a variety of reports and summaries, and to perform historical analysis over this data. Over Hive, tables are used to store data, such structure consists of a number of rows and each row comprises a specified number of columns. In Hive, queries are expressed in an SQL-like declarative language called HiveQL, these queries are compiled into map-reduce jobs and executed using Hadoop. It supports primitive and complex type. It processes data for analysis but not to serve users, so it does not need ACID guarantees (as for traditional relational database) for data storage.

System architecture and data flow

Implementing efficient medical data warehouse based on the Hadoop ecosystem requires a flexible and modular architecture for big data warehousing. This section describes the proposed architecture along with the functionalities of each layer.

The overall architecture is depicted in Fig. 1. It is a scalable, reliable, and distributed architecture to extract, store, analyze, and visualize healthcare data extracted from various resources HIS (Hospitals Information systems). In the following, we give some details about components and levels of the proposed architecture and data flow within these components.

Fig. 1
figure 1

Hadoop-based system architecture of medical big data warehousing

  1. (1)

    Medical data source

In each region (provinces of a department, according to the geographical division), medical data is extracted from multiple distributed computer of HIS (Hospitals Information Systems), information systems, laboratories software, radiology data and from the regional directorate of Health.

  1. (2)

    Distributed ETL

An ETL is the component responsible to collect data from different data sources, to transform, and to clean based on business rules and requirements defined by the final user [31].

Our ETL approach is based on the partitioning of the input data horizontally according to the nested partitioning technique (section 5.1). In order to run in a distributed and parallel way, the extraction phase is achieved in each medical establishment of each geographic division (region) to capture data. During this phase, all tables’ sources are extraction from appropriate data sources then converted into columns structure (CSV format). The transformation phase involves data cleansing to comply with the target schema based on the description of the multi-dimensional data model (partitioned schema) which is stored as meta-data in an HDFS file). Finally loading phase, the data is propagated into the regional storage system (database server). As we will see in the next section, the data model is partitioned according to the dimension tables (in our case study keys of ‘Region’ dimension table, then ‘Health-establishment’ dimension table and then ‘Date’ dimension table). The integrated and stored data of database server are then replicated in DataNodes of Hadoop cluster according to a replication strategy.

New database servers and automatically new data-node on Hadoop cluster will be created for new sources (for instance new hospitals). Such feature ensures the scalability of this architecture.

  1. (3)

    Hadoop ecosystem

The core of our architecture is the Apache Hadoop [28] framework. It is an open-source implementation of Google’s Map-Reduce computing model. It is based on HDFS (Hadoop Distributed File System) which provides high-throughput access to application data.

  1. (3.1)

    Hadoop. Hadoop cluster is the corpus of server nodes with different physical locations within a group on Hadoop [32]. It replicates and collocates automatically data across multiple nodes, then performs parallel processing across clusters using the Map-Reduce paradigm. Thus, Hadoop reduces infrastructure cost. Its most important components:

  • NameNode act as a metadata repository describing the location of data blocks for each file stored in HDFS. This makes the massive data processing easier.

  • Secondary NameNode is in charge periodically to check the persistent status of NameNode and to download its current snapshot and log files.

  • DataNode runs to every machine in Hadoop cluster separately. In the proposed architecture, data servers in which data integration is done (point 2 of section 3) are automatically replicated in one or more DataNode. Furthermore, it is responsible for storing the structured and unstructured data and to answer the data reading and writing queries.

  1. (3.2)

    Hive. Apache Hive [30] is a data warehousing structure. It is used in conjunction with the map-reduce model of Hadoop to process the data. Indeed, Hive stores data in tables, which consists of a number of rows, each row comprises a specified number of columns. HiveQL is used to express queries over Hive. The execution of user queries over Hive are compiled into map-reduce jobs, then Hadoop (point 3.1 of section 4) execute these jobs. Using such queries allow creating a variety of reports, summaries, and historical analysis. The main functional components of Hive are:

  • Metastore: it contains metadata about tables stored in Hive such as data formats, partitioning, and bucketing and storage information including the location of table’s data in the underlying file system. It is specified during table creation and reused every time the table is referenced in HiveQL.

  • Hive engine: it converts queries written in HiveQL into a map-reduce code and then executes on the Hadoop cluster.

  1. (4)

    Web-based Access

Our system allows trusted users (point 6 of section 4) access to data in order to create a variety of reports and data summaries after authentication through the Web browser.

  1. (5)

    Analysis and reporting tools

This includes several tools for reporting, planning, dashboards and data mining. These tools allow data to be analyzed statistically to track trends and patterns over time, and then produce regular reports tailored to the needs of various users of the decision support system.

  1. (6)

    MBDW users

They can be doctors, medical searchers, hospital managers, health administrators, and governments. All such users can interact with the system.

Design and data model adaptation

It is necessary to design an efficient data model allowing a better understanding of information set about health system and providing better performance for the execution of complex analytical queries. In this section, we describe an optimized strategy for data modeling and its implementation in the Have framework.

The multi-dimensional modeling technique has been adopted widely in data warehouse modeling [5, 12, 13, 20]. Such technique can be extended for big data warehousing. Usually, the multi-dimensional data models-based data warehouses constituted from fact and dimension tables, whether for star, constellation or snowflake schema. Fact tables contain keys of dimension tables -as foreign keys- and measurable facts to examine. Dimension tables describe dimensions and contain attributes and key of dimension. However, directly adopting the multi-dimensional modeling technique in big data scenarios is not very relevant because the tables and especially fact tables grow in size very large, with millions or even billions of rows [2], and query performance become unacceptable to end users.

To overcome the previous issues: growing size of the fact table and query performance. We propose nested partitioning technique of the fact tables.

Nested partitioning technique

Our technique consists to apply a nested partitioning where the data warehouse fact table file is split into several tables having fewer sizes but keeping the same multi-dimensional diagram semantic.

Let (F, D1, D2,…, Dn) be a multi-dimensional diagram (in star, snowflake or constellation) where F = (d1,…,;d n ) be the fact table, and D1 = (d1 1 ,…, d1 m ), Dn = (dn 1 ,…, dn k ) be the dimension tables, such as dj 1 (1 ≤ j≤ n) be a primary key of the dimension table Dj, and d i (1≤ i≤ n) be a foreign key of F referring to dj 1 . Let c, such that 1 ≤ c ≤ n, the number of foreign keys with which the fact Table F is partitionable.

We define Nested partitioning on fact Table F based on c dimension tables Di (1 i≤ c) by using their primary keys as the process which takes as input the diagram composed of data warehouse tables files (F, D1, D2,…, Dn) and outputs k diagrams:

$$ \left({\boldsymbol{F}}_1,{\boldsymbol{D}}^1,{\boldsymbol{D}}^2,\dots, {\boldsymbol{D}}^n\right),..,\left({\boldsymbol{F}}_k,{\boldsymbol{D}}^1,{\boldsymbol{D}}^2,\dots, {\boldsymbol{D}}^n\right) $$

The Table F is fragmented in k fragments F 1 ,.., F k and they are computed in the following way:

Such as the number of fragments k is: \( 1\le k\le {\prod}_{i=1}^c{n}_i \). Where n i is the number of values of keys of the partitionable dimension tables and ⋉ is the semi-join operator.

Data partitioning techniques in Hive

Using Apache Hive implies that we don’t have primary and foreign keys during the implementation and management of data. This is largely due to the fact that Hive focuses on analytical aspects, bulkiness, and diversification of data nature. In addition, Hive is not intended to run complex relational queries. However, it is used to get data in simple and efficient manner. Nevertheless, these two concepts (primary and foreign key) are important if we want to store and manipulate data in many facets and points of view. It is essential to be able to uniquely identify each tuple, manage relationships between tables, and ensure data consistency.

To overcome the previous issues: lack of primary and foreign keys in Hive. We propose to use two Hive concepts: Partitions and buckets.

Hive allows dividing tables into partitions and buckets. The first technique called partition which is a way of decomposing a table into parts based on a value of data. The second technique called bucketing, where tables or partitions may further be subdivided into buckets, an extra structure to the data [3].

Partitioning

In Hive, each table can have one or more partitions which determine the distribution of data within sub-directories of the table directory where each partition has its own directory, according to the partitioning field and can have its own columns and storage information [33]. Using partitions can make queries faster. For instance, a table can be partitioned by date, and records with the same date would be stored in the same partition.

Bucketing

It is another technique to decompose hive table (data-sets) in a defined number of parts, which will be stored in files. Indeed, the data in each partition can be divided into buckets based on the hashing of a column in the table. Each bucket is stored in the partition directory as a file. Metadata about bucketing information of each table are stored in the Meta-store [33]. Among bucketing advantages, the number of buckets is fixed so that it does not vary with data.

As shown in Fig. 2, nested partitioning means to divide the fact table into many levels of partitions based on the different dimensions. Indeed, fact table will be partitioned using Hive partition based on a first dimension (based on the attribute which should be considered as the primary key of the first dimension table). Then, the partitioned tables will be divided using buckets based on a second dimension, and so on.

Fig. 2
figure 2

Nested partitioning

Using partitioned tables will allow to distribute and to reduce the data volume from a fact table to many distributed fact tables. Therefore, these partitioned fact tables optimize data warehouse resources use (distributed processing, memory) and improve queries execution performance.

Case study: Improving healthcare resources distribution in Bejaia health sector

In the two previous sections, we presented our proposal for medical big data warehouse (MBDW) architecture and data model built on Hadoop cluster to perform analytical processing. The purpose of this section is to apply and validate our solution to a case of medical data. An MBDW is designed for the health sector of the Wilaya of Bejaia in Algeria to help the physicians and health managers to understand, predict and improve availability and distribution of healthcare resources.

In this section, we first describe the study objectives and settings. Then we give the implementation detail of the proposed MBDW solution, we give some preliminary results and finally a discussion about our solution compared to the previous ones.

Study objectives

Decision-making about the alternative uses of healthcare resources is an issue of critical concern for governments and administrators in all health care systems [34]. In the other hand, data of medical sources can be used to identify healthcare trends, prevent diseases, combat social inequality and so on [1]. Thus, using medical data to improve health resources distribution is an important challenge in emerging countries.

The purpose of this case study is to provide decision makers a clear view of Bejaia health sector. It will guide leaders to make better decisions leading for equity in the distribution of medical resources, and for a significant reduction of the cost to offer a more efficient method to accurate health management, to improve the availability of clinical material and human resources and to increase the quality of services and patient safety. It will allow managing different orientations during the planning stages, to have a complete predictive vision, using a better repartition of care offerings (the hospital specialization, the number of health centers by town, allocation of hospital equipment, the number of specialist per hospital, the number of beds by specialty and service). These facilities have to be regularly adapted according to the changing of demand (changing health techniques, diseases, health structures, population age and geographical location of the population).

Our decision support system gives information about: - the place and date of health centers building and the required specialty, − the best strategy for medical professionals’ recruitment for each hospital; − the required equipment and the most appropriate for a hospital (e.g. CT scanner or radiological unit); − screening date and place (e.g., breast cancer screening, cervical screening, etc) and also, it aims to give information to help medical research.

Study settings

The plan of Bejaia Health Sector is a part of the Algerian Health master plan. The latter provides for the period 2009–2025 investments of 20 billion Euros for constructing of new health facilities and modernization of existing hospitals. Such investments have been initiated for construction and maintenance of infrastructure and hospital equipment and education of medical professionals [35]. The outline of this program project is to achieve 172 new hospitals, 45 specialized health complex, 377 polyclinics, 1000 outpatient rooms, 17 paramedical training schools, and more than 70 specialized institutions for persons with disabilities.

Currently, the Bejaia Health Sector manages healthcare for a population of around 1 million people of the Wilaya (department) of Bejaia. The Wilaya of Bejaia is located in the north of Algeria, on the Mediterranean coast. It is administratively divided into 19 Daïras (a set of municipalities) and covers 51 municipalities which stretch over an area of 3268 KM2. As shown in Fig. 3, it includes the following health structures: one university hospital, five public hospital institutions, and one specialized public hospital establishments, with a total of 1533 technical beds, 8 nearby public establishments of health with 51 polyclinics, one paramedical school, and several laboratories. It employs 33 hospital-university practitioners, 245 specialist medical practitioners, 734 general medical practitioners, and 2742 paramedical staff.

Fig. 3
figure 3

Geographic location of Bejaia medical institutions

To achieve analytical processing of medical data, we developed the MBDW by extracting, transferred, and collecting data from different operating systems, and software of Bejaia health structures which includes: PATIENT 3.40 (patients data management software), Microsoft EXCEL, and EPIPHARM (a software for the management of drug stock) and automatically stored in the DW.

MBDW implementation

The implementation of the Hadoop based architecture is built under the following hardware resources of the computer equipments of the medical institutions: memory (RAM) capacity ranging from 2 to 10 GB, processor speed ranging from 1.6 GHz to 2.4 GHz, disk space ranging from 250 GB to 2 TB for data storing. Computer nodes are connecting and networking with RJ45 LAN cable and switches.

Several software were used to build MBDW platform by deploying Apache Hadoop 2.6.0 and Hive 1.2.1, ZooKeeper HDFS under Ubuntu Linux operation system as depicted in the Table 2.

Table 2 Software’s specification

Table 3 shows some the used Hadoop cluster configurations, such configuration parameters are essential for our system operation and performances.

Table 3 Configurations of the Hadoop cluster

MBDW data model

In this section, we detail the conceptual data model of our case study and its optimized version based on nested partitioning. We explain the used data model considering two main subjects: hospitalization of patients and outpatient visits. In our proposed data model, we used the constellation schema.

MBDW data model

Figure 4a describes most important dimensions of the constellation schema that consists of two fact tables (Hospitalization and Consultation) which share the following dimension tables: Patient, Illness, Date, Doctor, Region, Health-Establishment, Equipment, Service, and Consultation-Center. The details of the two fact tables are:

  • Fact table ‘Hospitalization’. It represents all information about the patients’ hospitalization period. It holds the primary key Id-hospitalization, foreign keys of dimension tables: Id-patient, Id-illness, Id-doctor, Id-service, Id-equipment, Id-H-establishment, Id-region and Id-date, and several measures.

  • Fact table ‘Consultation’. It exposes all information about the outpatient visit in a health facility. It contains the primary key Id-consultation, foreign keys of dimension tables: Id-patient, Id-illness, Id-doctor, Id-C-center, Id-H-establishment, Id-region and Id-Date and also several measures about the outpatient visit.

Fig. 4
figure 4

Constellation schema

MBDW adapted data model

A nested partitioning is used in both fact tables: Consultation and Hospitalization. As shown in Fig. 4b, we have used the two concepts that are offered by Hive: Partitioning and Bucketing.

Each partitioning level used to partition the fact tables corresponds to a foreign key (which is not reported in Hive) of the multi-dimensional model given in the previous section.

Thus, the first level of partitioning is applied according to the values in the column of the foreign key Id-region of the dimension table “Region” which is the most overall dimension. The dimension with underneath level is “Health-establishment” which represent the second partitioning level (i.e., partitioning is applied according to the values in the column of the foreign key “Id-H-establishment”). The dimension with another underneath level is “Date” represents the third partitioning level. Indeed, we use bucketing with “Date” dimension table. Especially bucketing is used on the values of the column Id-date since it belongs to a fixed interval. At this level, we can add a third level of partitioning according to another dimension table but only if necessary. In our case, we have not applied partitioning based on the foreign keys of other tables: Patient, Doctor, Service, Equipment, and Consultation-center to avoid a large number of partitions with little data, which will produce a large number of subdirectory and unnecessary overhead for NameNode of HDFS.

In the next sub-section, we give some important information about keys of dimension tables. Such informations are essential for the system users.

Keys of dimension tables

“Illness” dimension table key

The proposed “Id-illness” code is an attribute identifier of illness dimension table designed using group coding which involving several fields that possess specific meaning. “Id-illness” consists of three fields: the technical use field (3 + 4 digits and characters), the administrative use field (3 characters), and a future-use field formed to be used in the future (2 characters), as shown in Fig. 5. The detail of each field is as follows:

Fig. 5
figure 5

“Id-illness” code signification

  1. 1)

    Technical use field: This field matches with the current medical classification advocated by the World Health Organization ICD-10 to make it easier for specialists and doctors to identify diseases.

  2. 2)

    Administrative use field: This field consists of three parts: α, β and δ as shown in Fig. 5.

  • Occupational disease (α): allow identifying the occupational disease which is a health problem that occurs during working or occupational activity and contracted under some conditions. The Algerian social security system deals with the identification of these diseases.

  • Notifiable disease (β): identify notifiable diseases which are diseases under national surveillance, subject to a compulsory declaration to the national health authority in accordance with the procedure laid down in the order number. 179 of 17 November 1990 and also diseases under international surveillance, subject to mandatory reporting to the national health authority and mandatory notifiable to the WHO (World Health Organization). Any doctor whatever his type of exercise is required to declare the notifiable disease.

  • Chronic disease (δ): This part of the field identifies and indicates the chronic diseases. Indeed, patients with chronic conditions benefit broader rights from Algerian social security.

  1. 3)

    Future use field: is a reserved field for future needs that may occur in the future.

“Doctor” dimension table key

The proposed “Id-doctor” code illustrated in Fig. 6 is an attribute identifier of “Doctor” dimension table (medical staff information) designed also using group coding. “Id-doctor” consists of four fields. The three fist fields (2 + 3 characters +4 digits) are used to indicate the category and sub-category of the medical staff and the fourth field indicates the recruitment date (6 digits). Indeed, the Algerian medical profession consists of several categories which are: university hospital, general medical practitioners, specialist medical practitioners, paramedics, laboratory staff, administrative staff and technical staff. We detail the most important:

Fig. 6
figure 6

“Id-doctor” code signification

  1. 1)

    University hospital staff: they are physicians in a position of acting in public institutions of a scientific, nature providing training in the medical sciences and also in medical institutions and hospital-university centers. There exist three sub-categories (assistant, lecturer, and university hospital professor).

  2. 2)

    Public health general medical practitioners: they are medical practitioners without a specialty. There are three categories of general medical practitioners who are: general practitioners, general pharmacists, and general dentists. For example, general practitioners include three (3) subcategories: the general practitioner, the primary general practitioner, the general practitioner-in-chief. They can also take senior positions.

  3. 3)

    Public health specialist medical practitioners: they are specialized physicians. There are three (3) subcategories of specialized medical practitioners which are: assistant specialist, senior specialist, chief specialist). They can also take senior positions.

“Patient” dimension table key

The proposed “Id-patient” code illustrated in Fig. 7 is an attribute identifier of “Patient” dimension table designed using group coding. It consists of three fields. The first field “patient type” (1 character) is used to identify the category of the patient (insured, uninsured and stranger), the second field indicates the social security registration number (12 digits) if it exists otherwise a number is attributed and the third field (2 characters) is used to identify the insured’s rightful claimants. We note that 80% of the Algerian population is covered by insurance and therefore they have social security registration number.

Fig. 7
figure 7

“Id-patient” code signification

After the partitioning and bucketing operation over data performed as explained in subsection 5.4.B. The dimension tables are created and loaded by using Hive, and then fact tables are loaded by joining necessary dimension tables. Table 4 shows the estimated size of tables stored during the year of 2015. The nine first table rows show the minimum and maximum (according to the storing DataNodes) size of dimension tables (Illness, Patient, Doctor, Region, Date, Health-Establishment, Equipment, Service, and Consultation-Center).The two last ones show the minimum and maximum size of fact tables (Hospitalization and Consultation).

Table 4 Fact and dimension tables’ size

Data replication strategy

Our proposed Hadoop-based warehouse must handle the case of node failures, which can be frequent in such large clusters. To address this problem, we implemented a data placement and replication strategies –depicted in Table 5 to improve data reliability, availability, network bandwidth utilization, and reduce the effect of certain failures such as the single-node and whole-rack failures. As shown in Table 5, each establishment data is stored in its own DataNode server and replicated in two others Data-Node servers. For instance, the data of CHU-BJA establishment is stored in its own DataNode (DataNode 1) and replicated in two others Data-Node servers (DataNode 9 and DataNode 15).

Table 5 Data replication strategy

Preliminary results

In this sub-section, we give example of how to use the framework to address the problem of medical resources distribution. Indeed, we used the set of data described in Table 4. We give two reports; the first one is based on ‘Hospitalization’ fact table, and the second one is based on ‘Consultation’ fact table.

Figure 8 shows one of the first results of the reporting phase which consists of a comparison between the daily average of patients requiring hospitalization and the number of available hospitalization places (empty beds) in the university hospital and the five public hospital institutions of different cities.

Fig. 8
figure 8

Graphic representation of the patients requiring hospitalization and hospitals empty beds

From the report of Fig. 8, we notice the lack of sufficient empty beds in the hospitals of both cities: AMIZOUR and AOKAS for patients requiring hospitalization (usually patients’ hospitalization is postponed if possible or transferred to other bordering hospitals). From the previous result, managers can generate analytical and informative reports to enhance the health sector in Bejaia yield and make right decisions with the aim of ensuring optimal distribution of health care system component. For instance, to increase the capacity of hospitals in both cities: AMIZOUR and AOKAS, by giving them a high priority in the regional health master plan.

The report presented in Fig. 9 illustrates a comparison between the daily average of outpatient visits and capacity of the health outpatient centers per city.

Fig. 9
figure 9

Graphic representation of rates of the required medical visits and outpatient centers capacity

Figure 9 shows that in both cities: BEJAIA and TAZMALT, the rate of outpatients’ visits exceeds the outpatient centers capacity.

This situation is an inequitable distribution of the health resources. To address this shortfall, the decision-makers to take the necessary action by increasing capacity of the outpatient center to meet the need of both regions. Therefore, in this situation, the decision to take is to add new consultation rooms in both cities: BEJAIA and TAZMALT.

Discussion and comparisons to previous work

Several studies have suggested to use medical data to ensuring equity and equality in healthcare. For instance, Kuo et al. [1] argue that medical data can be used to identify healthcare trends, to prevent diseases, to combat social and health inequality, to unlock new sources of economic value, to provide fresh insights into science and hold governments accountable. However, few studies have actually attempted to address the issue of an equitable and equal healthcare resources distribution using data warehousing and big data technologies. This could be explained by the slow adoption of data warehousing technology in the clinical field and that most of the studies on clinical data warehousing are referred towards specific diseases as detailed in section 2.

Indeed, to ensure availability and equitable distribution of health resources to peoples, most countries use the WHO guideline expressed by a ration of the resource by a number of populations. For instance, the WHO recommended critical threshold for personal health ratio (doctors and nurses) providing patient care per 1000 population is 2.5. Although this option is very important, it is not sufficient for ensuring a greater equity in the distribution of medical resources, since it does not take into account the specificities of each region and each population as is well expressed in Fig. 10.

Fig. 10
figure 10

Equality and equity of medical resources distribution

In our previous work [11,12,13, 16], we proposed a data warehousing based framework to address the problem of medical resources allocation. However, the framework fails in scaling-up and does not consider unstructured medical data. Through this work, we have demonstrated that using a Hadoop-based architecture combined with our nested partitioning technique solve the scaling, heterogeneity, and data size issues. We proposed a scalable, cost-effective, high availability, and fault tolerance solution, through a scalable architecture in such a way, to allow extending the nodes of the cluster as per requirement. Cost effective since nodes are not necessarily high-performance computers so there is no need to invest much on the hardware. The availability and fault tolerance are guaranteed through a replication strategy.

Conclusion

The recent work and projects on Hadoop-based medical data warehousing described in this study show that Hadoop community in the medical field is growing. This is essentially because of the cost-effectiveness of Hadoop-based solutions, which also addressed the traditional medical DW issues. In this paper, we have developed a Hadoop-based architecture and a conceptual data model for medical big data warehouse based on current research on big data modeling and tools. We have shown that the problem of primary and foreign keys in Apache Hive can be resolved using nested partitioning. The proposed solution was applied to the presented case study by designing and implementing a DW platform to ensure equitable distribution of health resources.