Big Data Analytics in Healthcare Systems

Big Data analytics can improve patient outcomes, advance and personalize care, improve provider relationships with patients, and reduce medical spending. This paper introduces healthcare data, big data in healthcare systems, and applications and advantages of Big Data analytics in healthcare. We also present the technological progress of big data in healthcare, such as cloud computing and stream processing. Challenges of Big Data analytics in healthcare systems are also discussed.


Introduction
Precisely matching patients with medical treatments for specific diseases can reduce unnecessary side-effects, improve the treatment quality, and avoid improper treatment or waste in medical services. Also, it can bring new medical treatments through the exploration of new drugs or using existing drugs for innovative or more targeted uses (Price and Nicholson, 2015). Systems biology is a successful method that integrates multiple data resources and studies of biologic processes. Much study deals with network models in describing etiopathogenesis and immune responses, which help discover novel biomarkers for early diagnosis, however, the bias of clinical data should be avoided when such models are used (Ren and Krawetz, 2015).
Many types of medical equipment, especially wearable devices, capture data continuously; the high velocity of the generated data often requires fast processing in an emergency. The value hidden in an isolated data source may be limited, but the deep value could be maximized from healthcare data (e.g., public health warning and personalized health guidance) through the data fusion of electronic medical records(EMRs) and electronic health records (EHRs) . Structural MRI, a method of visualizing a patient's brain, is a rich source of highdimensional data and provides brain maps with details in a high spatial resolution, which is very useful in both research and clinical settings for uncovering structural features of the brain (Ulfarsson et al., 2016).
Mobile/web applications in healthcare have been developed, which allows for patients to send a symptomatic query to providers through a server. These mobile applications may be equipped with first aid instructions; patients may be given emergency help for further treatment or directed to respective departments (Panda et al., 2017). A healthcare system based on mobile cloud computing (MCC) was created for collecting and analyzing real-time biomedical signals (e.g., blood pressure, ECG) from users in various places. A personalized healthcare application is installed on the mobile device and health data is synchronized into the cloud computing service of the healthcare system for storage and analysis (Lo'ai et al., 2016).
Big data in healthcare can be captured with the help of advanced information technology; making the exploration of information to improve policy-making possible. It is suitable to use a life table to conduct research on population aging and medical expenses, which provides evidence for policy-making . The costs associated with healthcare also increase with the increasing age of the population. Japan has already started using Big Data technologies to improve medical treatment and healthcare for elderly people. Big Data analytics can be used to achieve valuable information from large and complicated datasets via data mining (Tsuji, 2017). Literature research was conducted on the databases Scopus and IEEE Xplore to complete this paper. Specifically, the combinations of keywords including big data and healthcare, big data and health care, and big data and medical were used for searching papers that were published between January 2015 and May 2018. Duplicated papers were removed which were found from the two databases; 316 papers were selected for the literature review.

Healthcare Data
Diverse forms of healthcare data sources include clinical text, biomedical images, EHRs, genomic data, biomedical signals, sensing data, and social media (Ta, 2016). The analysis of genomic data lets people have a much broader understanding of the relationships among different genetic markers, mutations, and disease conditions. Furthermore, transforming genetic discoveries to personalized medicine practice is a task with many unresolved challenges. Clinical text mining transform data from clinical notes that are organized in an unstructured format to useful information. Information retrieval and natural language processing (NLP) are methods that extract useful information from large volumes of clinical text. Social network analysis helps discover knowledge and new patterns which can be leveraged to model and predict global health trends (e.g., outbreaks of infectious epidemics) based on various social media resources including is based on various kinds of collected social media resources such as Web logs, Twitter, Facebook, social networking sites, search engines, etc. (Ta et al., 2016).
Suitable diagnostic methods must be used before analyzing the severity of diseases. Table 1 (Verma, and Sood, 2018) shows a diagnostic scheme used for the diagnosis of the disease. Table  2 (Mendelson, 2017) outlines five layers that illustrate personal data related to health. Fundamental rights of data subjects and privacy should be protected although laws have been behind technical development.  Clinical and other data related to health in identified forms which are collected, stored, and distributed to third parties. Layer 3 Mainly private companies collect raw data from Layer1, Layer 2, and other private and public sources. The processed results are distributed or sold either in a de-identifiable or identifiable form. Layer 4 National government, international private or public entities re-process, re-distribute, or re-sell the data for various purposes. Layer 5 International agreements and treaties governing the protection of privacy for personal data related to health There are six wireless physiological sensors in a wireless body sensor network (WBSN). The six sensors are used to collect a patient's six vital signs that include body temperature, heart rate/pulse, blood glucose, blood pressure, ECG, and oximetry (You et al., 2018). Healthcare data integration has been an important issue, which ranges from personal health information to epigenomics. Various integration methods have been developed, for example, data warehouse (bringing data into a common data schema), link integration (in a webpage presentation), serviceoriented architectures (servicing data dynamically at the web in a familiar format), view integration (putting various databases together), and mash-ups (combining data from more than one Web-based resource for a new Web application) (Murphy et al., 2017). There are some critical aspects or challenges in data fusion that are summarized in Table 3 (Capobianco, 2017). Balancing data from different origins or sources 3 Dealing with inconsistent, contradicting and conflicting data 4 Establishing loss or objective functions and regularization/penalty terms 5 Differentiating between soft and hard data links, i.e. considering a random process from which the data is generated as subject to same parameters, or instead accounting just for dependencies, covariations, similarity/dissimilarity, etc.

Big Data Analytics in Healthcare Systems
As described in Table 4 (De , big data often has high values in volume, velocity, variety, variability, value, complexity, and sparseness. Big data has the potential of applications in healthcare which include disease surveillance, epidemic control, clinical decision support, population health management, etc. (Sabharwal et al., 2016). Big Data in healthcare can provide significant benefits such as detecting diseases at an early stage. The inclusion of Big Data analytics in smart healthcare systems brings innovative electronic and mobile health (e/m-health) that increase efficiency and save medical costs (Pramanik et al., 2017). Predictive analytics can be used in predicting pharmaceutical outcomes, identifying patients who benefit the most from pharmacist interventions, providing pharmacists with a better understanding of the risks of specific medication-related problems, and delivering interventions tailored to patients' needs (Hernandez and Zhang, 2017). Precision medicine deals with data ranging from collection and management (such as data storage, sharing, and privacy) to analytics (such as data integration, data mining, and visualization). Complex biomedical data with a huge volume are becoming available due to advances in biotechnologies. Big Data analytics is required to use these heterogeneous data and it covers application areas such as health informatics, sensor informatics, bioinformatics, imaging informatics, etc. (Wu et al., 2017).
Veracity is crucial for Big Data analytics. Personal health records (PHRs) may contain abbreviations, typographical errors, and cryptic notes. Ambulatory measurements are possibly completed under uncontrolled and less reliable environments compared with clinical data which is collected by trained practitioners in a clinical setting. Using spontaneous unmanaged data from social media may result in inaccurate predictions. In addition, data sources are sometimes biased (Andreu-Perez et al., 2015). 'Noise' data is a massive problem especially when it grows fast. Databases with various degrees of completeness and quality lead to heterogeneous results, which increase the possibility of false discoveries and 'biased fact-finding excursions'. Low data quality and biases due to the absence of randomization are two major problems. Efforts in increasing the value of big data are often made through linking different databases and analyzing all existing and related data (Sacristán and Dilla, 2015). Data pre-processing is a process of transforming raw data into an understandable format that often includes: 1) data cleaning, 2) data integration, 3) data transformation, 4) data reduction, and 5) data discretization. The pre-processing is an important step for Big Data analytics (Farid et al., 2016).
Systems relying on big data streams have been developed, which include patient-level hospital discharge records, electronic death certificates, and medical claims data that use International Classification of Diseases (ICD) coding. Surveillance tactics using big data streams from crowdsourcing, social media, and Internet search queries have been proposed (Simonsen et al., 2016). Big Data technologies like NoSQL databases have been used in processing healthcare information, while some features like local access and rational relationship between logical and physical data distribution are important to improve the performance of parallel processing in distributed databases (Salavati et al., 2017). A Big Data-driven approach and process was proposed that incorporates both clinical and molecular information. Candidate biomarkers and therapeutic targets/drugs are fists identified in the approach. Subsequent clinical or preclinical validation is completed by the cross-species analysis; therefore, the required costs and time in biomarker/therapeutic development are reduced (Wooden et al., 2017).
A clinical data warehouse was created for structured data; a set of modules were also built for analyzing unstructured content. The research was conducted to build an initial implementation of a framework within a big data paradigm. The framework runs the modules in a Hadoop cluster and the distributed computing capability of Big Data was used (Istephan and Siadat, 2015). A Hadoop-based architecture was developed to manage Twitter health big data. Analyzing tweets in healthcare has the potential to change the way people and healthcare providers use advanced technologies to achieve new clinical insights (Cunha et al., 2015). Open sources such as Hadoop, Kafka, Apache Storm, and NoSQL Cassandra have been used in Big Data analytics. There are a set of general primitives in Apache Storm for computing real-time big data (Vanathi and Khadir, 2017). Table 5 shows a comparison between Storm and Hadoop. Research on attribute reduction has been done using MapReduce based on the Rough Set Theory (RST). The procedures include 1) use parallel large-scale rough set methods for feature acquisition and implement them on MapReduce runtime systems such as Twister, Phoenix and Hadoop to obtain features from big datasets through data mining; 2) use the framework structure of < key, value > pair to accelerate the computation of equivalence classes and attribute significance; parallelize traditional attribute reduction process based on MapReduce (Ding et al., 2018). Traditional high-performance computing (HPC) is computation (CPU) oriented with intensive computing through internal (supercomputing) or external high-performance networking (cluster or grid computing), while Hadoop-enhanced computing is intensive computing for largescale distributed data through internal and external networking. Hadoop-based Big Data has three advantages: efficiency, reliability, and scalability (Ni et al., 2015). Table 6 (Olaronke and Oluwaseun, 2016) shows a comparison of tools used for analyzing big data in the healthcare system. Does not support indexes.

Microsoft
Windows Azure Relational database Public cloud based platform Allows users to make relational queries against structured, semistructured and unstructured files.
The size of the database is limited; it cannot handle huge databases.

Jaql
It is a query language for JavaScript object notation.
It is a proprietary query language.
Supports both structured and semi-structured data.
No user defined types; schema information only for possible values of a domain Industry 4.0 is a strategic plan in manufacturing and custom manufacturing of medical devices and drugs are included in Industry 4.0. Precision medicine is a kind of Big Data application in health, which benefit from multi-omics, IoT, Industry 4.0, etc. Industry 5.0 has been proposed which make sense of Big Data with artificial intelligence, IoT, and next-generation technology policy (Özdemir and Hekim, 2018). An intelligent healthcare framework has been developed based on IoT technology to provide ubiquitous healthcare for a person during his/her workout sessions. An artificial neural network model was used to predict the person's health related vulnerability using Bayesian belief network classifier. Data management, model development, visualization, and business models have been listed as four key areas of Big Data analytics (Verma, and Sood, 2018). Some data mining methods for complex EHR big data are summarized in Table 7 (Wu et al., 2017).

Challenges of Big Data in Healthcare Systems
There are some challenges of healthcare big data in capturing, storing, sharing, searching, and analyzing health data. The organization of data after extraction from different layers and the integration of the data is another challenge (Reddy and Kumar, 2016). Integration of physiological data with high-throughput "-omics" techniques for clinical recommendations is also a challenge. The continuous increase in available genomic data and related effects of annotation of genes and errors from analytical practice and experiment have made the analysis of functional effects using high-throughput sequencing methods a challenging work (Belle et al., 2015). The issue on consent to using healthcare data such as genetic data has been a concern. Creating databases based on large and national population for future research with ethics approval and governance has led to academic debates on legality. There are even arguments on that Big Data is useful to improve healthcare systems (Knoppers and Thorogood, 2017). The following are general challenges of Big Data in healthcare (Mathew and Pillai, 2015):  Security and privacy: Traditional privacy and security measures work on small datasets; capability to use the same measures on massive and streaming datasets is possibly a problem, particularly when dealing with patient's health data.  Data quality: It affects reliable insights from the data and decision-making for patients' healthcare.  Insufficient real-time processing: Delay in processing complex data models can result in patient care with less quality.  Integration of heterogeneous data sources: data fragmentation across hospitals, labs, electronic health records (EHRs), and financial IT systems is a major obstacle to combining data into an integrated database system.  No fixed standards for healthcare data: Many kinds of healthcare data are generated and collected by various agents such as practitioners' notes, medical images, data from wearable sensors. There are no unified standards for these data, which brings difficulty for further processing.

Conclusion
Traditional data processing techniques are not able to handle big data in healthcare systems. Big Data analytics overcomes the limitations of traditional data analytics and will bring revolutions in healthcare. Big Data analytics has the potential in disease surveillance, epidemic control, clinical decision support, population health management, etc. Hadoop-enhanced computing is intensive computing for large-scale distributed data and Hadoop-based Big Data has advantages in efficiency, reliability, and scalability.
There are challenges of Big Data analytics in healthcare systems. Capturing, storing, sharing, searching, and analyzing data are the challenges of Big Data in almost every area. In addition, data security and privacy, data quality, real-time processing, integration of heterogeneous or disparate data, and standards for healthcare data are also challenges of Big Data analytics in healthcare systems.