Journal of Biometrics & Biostatistics

Recently, Big Data science has been a hot topic in the scientific, industrial and the business worlds. The healthcare and biomedical sciences have rapidly become data-intensive as investigators are generating and using large, complex, high dimensional, and diverse domain specific datasets. This paper provides a general survey of recent progress and advances in Big Data science, healthcare, and biomedical research. Big Data science impacts, important features, infrastructures, and basic and advanced analytical tools are presented in detail. Additionally, various challenges, debates, and opportunities inside this quickly emerging scientific field are explored. The human genome research, one of the most promising medical and health areas as an example and application of Big Data science, is discussed to demonstrate how the adaptive advanced computational analytical tools could be utilized for transforming millions of data points into predictions and diagnostics for precision medicine and personalized healthcare with better patient outcomes.


The Big Data Impact and Potentials in Health and Medical Sciences
Big Data is more than a decade old term that became very popular recently in life sciences and other fields.The healthcare industry has always been a large generator of biomedical data, with the U.S. healthcare system expected to reach the zettabyte (10²¹) scale from electronic health records, scientific instruments, clinical decision support systems, or even research articles in medical journals [1][2][3].Biomedical enterprises including the fields of human genomics (e.g., NIH 1000 Genome project), medical imaging (e.g., BRAIN initiative), the growth of mHealth, telehealth, and telemedicine, have generated trillions of data points resulting from the recent advances in biotechnology and advent of new computing sources (such as cloud) [4][5][6][7][8][9][10][11][12][13][14].
Big Data and its practices in health or medical science become even more prominent due to new social arenas/media and networks (such as Facebook and Twitter), sensory/digital technology, and mobile devices with smartphone apps and personal sensor health data with real time digital data accumulations [15,16].
The National Institutes of Health announced the Big Data to Knowledge (BD2K) Initiative with its long-term goals in 2014.As an important exemplar NIH recently announced is the "Precision Medicine Initiative", which intends to assemble a longitudinal "cohort" of 1 million Americans, and characterize extensively with including cell populations, proteins, metabolites, RNA, DNA and whole genome sequencing along with behavioral data, all linked to electronic health records, and eventually develop genetically guided therapy in the personalized and precision medicine for better preventive solution, early detections and treatment of common complex diseases [14,[17][18][19][20][21].In the healthcare public health domains, AHRQ and Patient Centered Outcome Research (PCORI) have launched the PCORnet initiative to support an effective, sustainable national research infrastructure that advances data collection from very large study populations, shares and uses of electronic health data in comparative effectiveness research (CER) and other evidence based practice/medicine research [22][23][24][25].
For the educational standard, Big Data are gradually driving higher education from data poor to data rich domain, from hypothesis driven to data driven, and the movements of the online or web based educations as "Wind Tunnels" promote more students getting involved in learning Big Data science worldwide.For example, at the University of London, UK, the Big Data Society forum, related journal, and the Big Data school certificate that trains next generation Big Data science researchers have been established [26].
Big Data science has been gradually recognized as an emerging field and discipline and could be one of the most valuable assets not only in the life sciences such as medical and healthcare, but also other domains including educational standards, government prospective, social sciences, financial industry and business opportunities [4][5][6][27][28][29][30][31][32][33][34].The lessons learned from all those related domains and fields could potentially applied to the healthcare and medical fields, e.g., from business field for the lower cost, improved quality outcomes (fewer medical errors and readmissions), increased efficiency, productivity, effectiveness, and performance of healthcare providers and associated systems.

Big Data Science Features and Infrastructure
Big Data science refers to the massive amounts of multiple digital data sets that are captured, collected, integrated, and analyzed.The important features of Big Data include: 1) size/scale in terms of Volume, Velocity, Variety (known as three V's): mass of measures increased from petabytes to exabytes, zettabytes, yottabytes; 2) evolving, varied, distributed, timeliness, dynamic [not static with real time]; 3) complexity and heterogeneity (structured, unstructured, semi-structured data); 4) data sharing and privacy [7,[35][36][37][38][39].
Due to above unique properties, in order to maximize Big Data potentials for knowledge discovery, and make it actionable and operational for better life science solutions, Big Data science infrastructure, the intelligent analytical tools, and advanced computational approaches that could conceptualize, theorize, and model the Big Data with the grounded theory method need to be established, understood and available by both Data analysts and domain researchers [40,41].Therefore, a top layer question for Big Data scientists is what the important framework for good Big Data governance and implementation is in order to make it actionable and operational in addition to blend of top down data/project management with bottom up technique innovation and creativity.There are four critical hierarchical domains/levels for the infrastructure of the Big Data governance [42].
First, in the software, hardware, and physical capacity domains, from hardware, network and platform perspective, Big Data requires parallel and distributed architectures with a high performance multicore, and cloud and clustering computing platforms that can access hundreds or even thousands of processors.The Hadoop system is an example, and is a distributed computing environment using a Map-Reduce framework.Hadoop tools and related software including HDFS distributed file systems allow for the storage, backup and computing resources for complex workloads [43][44][45][46][47][48][49].Software-defined data center or softwaredefined network is open flow application programming to interfaces or a virtual network overlay for controlling [1], understanding and dealing with Big Data, which could also create agility and automation with a centrally programmable network [50,51].From software perspective, a few examples include i) the open source R statistical language and related packages such as bioconductor has been well utilized in the past decades for analyzing Big genomic data [52]; ii) open source pbdR software is a series of R packages and an environment for statistical computing and programming with Big Data in R [53,54].Note that the difference between pbdR and R codes is that R system focuses on single multi-core machines for data analysis via an interactive mode such as GUI interface; while pbdR focuses on distributed memory system, where data are distributed across several processors and analyzed in a batch mode, and communications between processors are utilized in large high-performance computing (HPC) systems; iii) Revolution Analytics is a free and premium software and services that brings high-performance, productive, and ease-of-use to R and enables data scientists to derive greater meaning from large sets of critical data in record time; iv) Tableau Software, Tableau Desktop and Tableau Server uses visual analytics, ease-of-use approach and flexibility connecting to live data and perform visual, rapid-fire analysis.
Second, in the databases level/domain, to manage large volume unstructured (e.g., text contents in an electronic Health record (HER) systems) real time data which cannot be handled by standard database management systems like DBMS, RDBMS or ORDBMS, an innovative database structure need be placed in order to streamline and eliminate redundancy, inaccuracy, and enable to have a single version of the truth of data.One of the fundamental issue in working with very large healthcare data, e.g. in the terabyte or petabyte range, small inefficiencies in storing data can have a large effect on ability to retrieve and process these data for other analysis.
Third, in the knowledge/data process and logical capacity domain, the traditional operational focus needs to be shifted to a more analytic focus that could manipulate and convert various types of unstructured data and metadata into information context and actionable knowledge [55,56].Last, but not least, in the resources domain and from the culture perspective, an integrative level has to be reached and shifted from personal/individual level with organizational and systematic approach where data is viewed as an asset with analytical culture and high predictive value.Network and systematic based approaches for genomic research is an good example.Note that above four level hierarchical infrastructures of Big data science determines it as a connection and systematic science merging and integrating cutting edge diverse multidisciplinary fields for better informed and shared decision-making.

Big Data Science Debates, Challenges, and Opportunities
Big Data science is now considered as "interdisciplinary fields work principally in the social sciences, humanities and computing and their intersections with the natural sciences about the implications of Big Data for societies [26]".Due to its real time nature, and rich information enabled by new technologies, Big Data science has potential to offer a higher form of intelligence and knowledge with the aura of truth, objectivity, and accuracy [57,58].Currently, there is a good understanding that addressing researcher's subjectivity with Big Data sciences could make research more scientific, robust, and ethical.However, how real time features shaping the researchers' usage of Big Data during gathering, manipulating, analyzing, and visualization process could be a challenging issue, and need to be examined.
External factors, e.g., in the social media contents for the health related issues, the streaming unstructured user-generated text based qualitative data derived from subjective perceptions and personal experience may interfere and paint data with a misleading picture, and, in the end, what it quantifies does not necessarily have a closer claim on objective truth.Therefore, developing conceptual models grounded in the complex and unstructured data in the qualitative research perspective for detecting the subjectivity, the external factors, and abnormality of Big Data that may affect outcomes is really in need, and might be new research opportunities [35].Moreover, since Big Data is not a random sample, but contains all data, 'The Age of Big Data' explosion raises some debates and challenges regarding the need of new scientific computational methods, and the values of the traditional statistical inference theories that has prevailed for centuries in data sciences, but now might be outdated [59][60][61][62].We all know that the Big Data era requires exhaustive, to the plenary, unlike the random sampling based traditional statistical approaches.Should the best analytical approach in this new big data era be exhaustive using of full data with more intelligent (be specific, artificial intelligence or machine learning based) rather than random sampling the big data?
To answer why plenary exhaustive might be more valuable, we may take a look at an evidence-based practice/medicine example.Based on the BMJ online forum, seventy five percent of doctors believe that adverse consequences has led the evidence-based practice/medicine moving toward collapse, and one real challenge is not evidence-based medical system itself, but that it is being improperly used due to the fact that most patients do not meet the clinical study inclusion criteria and most real cases are being considered as outliers.
Note that a common ending for either Big Data or traditional sampling based inference in medical science is that 1) as the sample/ data size grows larger, the science gets stronger; 2) follow-up time (real) the longer, the results are closer to clinical, and the greater value for clinical significance and usefulness.It is known that statistical significance does not imply the clinical significance, and correlation doesn't conclude causal relationship.
Therefore, as an important inevitable complementary, Big Data science may overcome some challenges in evidence-based medical system (practice or medicine), and should be emphasized from research and clinical perspective with better data sharing plan, transparency, and integrity.This is because not only Big Data science allows researchers to study treatment effectiveness, and patient heterogeneity, but also the need for treatments to be allocated by randomization with continuously arriving new sample.In addition, through the integration of large data from published literatures and meta-analysis, secondary literature conclusions reached as a use of scientific methods to guide clinical practice itself could have important clinical significance and scientific value.
On the other hand, traditional statistical inference perspective, an important merit that Big Data science brings in is that it allows continuous refinement of the computational or statistical model and the associated assumptions with continuous arrival of new data for more accurate outcome and better informed decision making due to its real time, evolving and dynamic feature.More importantly, it allows applying predictive analytics to understand not only what has happened and what is currently happening, but also to predict what will happen in the future.The key challenges researchers face today in the area of Big Data is still the ability of researchers to locate, analyze, integrate, and interact with all real time data and associated software due to the lack of adaptive intelligent tools, accessibility, and appropriate training at the current stages [63][64][65].

Big Data Analytic Approaches
Ultimately, the value of Big Data is not about the Big Data, it's about how to turn big data into good research problems/questions/ hypotheses, then transform into valuable solutions that benefit society [66,67].This is rendered simpler by their applications, for instance, the rapid advance of EHRs, mHealth, eHealth, Smart and Connected Health, and telehealth devices merging with social, behavior science, genomics and economics have led to the development of new infrastructure and transformation of health care systems for precision medicine and better-individualized patient care.
One important question for Big Data scientists to ask: 1) How to transform some 300 billion data points into quantitative statistical evidence for diagnostics, therapeutics, and new insights into population health, disease and treatment?2) What are the best approaches?Do the traditionally used inference techniques continue to play some roles?For instance, should it be experimental versus computational; hypothesis driven versus data driven; traditional statistical modeling versus data mining and artificial intelligence approaches.
To make the overwhelming volume of Big Data actionable and analytics operational, several key issues of how we proceed and analyze the data requires special attentions.First, bottleneck of the Big Data: Analysis tools and the development of advanced statistical and computational techniques with pipelines that can easily scale up with the three V's (Volume, Velocity, Variety) and its complexity.
These tools make high-powered methods available to not only professional statisticians, but also to casual users.Second, creator of Big Data value is the integration and linkage of heterogeneous Big Data, which has formidable logistical and analytical challenges.Third, validation, interpretation, and visualization: are crucial to extracting actionable knowledge for decision making which require Big Data analysts to closely collaborate with domain experts.Therefore, in order to transform the billions of data points into valuable and actionable solutions require deeper learning and data analysis at both fundamental and advanced levels [25,[68][69][70].The fundamental level analysis include 1) basic online real time queries, pipeline, flow, analysis tools; 2) data pre-processing or big data reduction: detecting the missing data, errors, outliers; extracting, transforming, loading part of data preprocessing, automated filtering of non-useful data, redundancy and correlations; 3) computational techniques for summarizing the qualitative and quantitative results, unveiling trends and patterns, and generating reports; 4) data automations and generations for metadata, e.g., computer-automated analysis of blog postings; 5) visualization tools with simple and easy models: interpreting and making sense of the data.
At the advanced level data analysis: systems based and network approaches for data integration with more sophisticated models including but not limited to 1) Real time analytics and Meta-analysis that integrates multiple data sources including bedside healthcare streaming data; 2) hierarchical or multi-level model for spatial (state and national) data; longitudinal and mixed model for real time or temporal dynamic data rather than static data; 3) data mining, pattern recognitions for trends, and pattern detection; 4) natural language processing for text data mining; machine learning, statistical learning, Bayesian learning with auto-extraction of data and variables; 5) artificial intelligence with deep learning (e.g., neural network, support vector machine, dynamic state space model), automatic ensemble techniques and intelligent agent for automated analysis and information retrieval; 6) causal inferences and Bayesian approach with probabilistic interpretations.
Comparing fundamental level analytic with advanced level analytic in Big Data science, fundamental analytic including descriptive analytics serves for the purpose to summarize "what has happened" (e.g., in a simplest type that allows you to break down big data into smaller, more useful pieces of information) and focus on the insight gained from historical data to provide trending information on past or current events (e.g., looks at data and information to describe the current situation in a way that trends, patterns, and exceptions become apparent).While the advanced level computational tools listed above in Big Data science focuses on predictive analytics, which intends to determine patterns and predict future outcomes and trends, and answers "what could happen" and "what should we do?" through quantifying effects of future decisions in order to advise on possible outcomes, Prescriptive Analytics includes functions as a decision support tool by exploring a set of possible actions and suggesting actions based on descriptive and predictive analyses of complex data.It also conducts real-time analytics by using point-of-care data and analyzes the data at the point of care to present immediate and actionable information to providers

Human Genomics Application and Example
Human genomics in personalized medicine research is an important application and great example of Big Data science with applications in the medical fields.Figure 1 demonstrates this translational research scheme/process from Big genome data generating instrument/ technology, analytical pipeline, procedures and approaches in order to obtain and transform 300 billion data points of disease data into diagnostics, therapeutics, and new insights into population health and disease treatment.We can see that high performance computational analytic tools may be a more cost effective way than experimental in the big data world.
To be more specific, from thousands of genes to identify a handful of genes responded to the drug over time that could be potential drug targets could turn into a computational problem related to the "curse of dimensionality" issue in the temporal fashion.Various statistical learning and data mining techniques or statistical testing approaches could be considered and applied for addressing such to examine the reproducibility issues including: 1) Data driven (mining) versus hypothesis driven (testing); 2) unsupervised learning (clustering) versus supervised (classifications); 3) optimization versus sequential or recursive feature reduction with multiple testing: i) linear versus nonlinear model; ii) parametric, nonparametric, semi-parametric statistical model with L-norm regularization techniques; iii) univariate versus multivariate methods; iv) Bayesian with prior knowledge/ distribution versus non-Bayesian/classical statistical approaches; v) Hierarchical Bayesian with shrinkage in statistical modeling versus Automatic Relevance Determination in neural network.
Here we briefly present a simplified example of comparisons of various statistical methods for multiple sclerosis disease studies in human genomics [71].The genome data set contained gene expression data from 14 MS patients given a 30 g dose of intra-muscular IFN1a and the gene expression data available for 10 time points: before treatment, 1 h (hour), 2 h, 4 h, 8 h, 24 h, 48 h, 5 d, 7 d and 3 months.After data preprocessing and filtering from millions gene, 4324 genes measured at 10 time points on 14 patients with a total of 605,360 measures or data points were included for further data analysis.The key biological questions of this study are 1) the identifications of significant differentially expressed genes responding to the treatment, and 2) characterizing the dynamics and changes of gene expression to determine the trajectories of significantly regulated genes in responding to the treatment.
For comparison purposes, we presented the following six computational methods for the "curse of dimensionality" issue in the temporal fashion in order to identify a handful of genes responded to the drug over time from thousands of measures: 1) parametric methods with the analysis of variance (ANOVA) with bootstrapping resampling techniques; 2) semi-parametric with class dispersion method; 3) nonparametric with Pareto with permutation methods; 4) mixed effects model (non-Bayesian) with bootstrap; 5) Bayesian linear correlated/ multivariate model; 6) Bayesian nonlinear model.Figure 2 provides the condensed results of each method to demonstrate their differences, note that all are adequate in capturing and identifying the significant/ relevant genes responding to the treatment and disease progression.
For the parametric method: mixed models proved to be more conservative.For the semi-parametric with class dispersion and nonparametric with Pareto methods are appropriate in capturing variation from time to time, thereby making them more suitable for investigating significant monotonic changes and trajectories of dynamic changes.Simulation studies showed that the semiparametric with class dispersion performs best regarding robustness of rejection of hypothesis given different significance (alpha) levels, while parametric ANOVA and nonparametric Pareto perform similar.For nonlinear Bayesian versus linear Bayesian multivariate model is more conservative but more robust, and perform better with regard to  different type I error rates while linear model showed better goodness of fit than nonlinear model.Moreover, post clustering and path analysis is able to not only identify the genes that are over expressed, under-or not expressed, but to isolate trajectories of genes whose regulations appear to be interdependent, inferring the possible inter-gene-dependence pathway and network showing early, intermediate, and late gene clusters to better understand the treatment effect.In short, the combinations of these various approaches would provide us more comprehensive picture of the solutions and reliable results that illustrates the values and roles of the advanced computational tools transforming thousands of Big Data points into quantitative statistical evidence for diagnostics, therapeutics, and new insights into disease, population health, and treatment [72][73][74][75][76][77][78][79].Health/nursing and medical researchers could employ these advanced analytical tools in big genome research for either disease specific (e.g., neurology conditions, cancer, cardiovascular diseases) or domain specific such as pain, fatigue, physical functioning or multiple chronic health conditions.

Conclusions
Big Data has the potential to impact various fields from social science to political science, from financial industry to business, from medical science to public health, from health care to genetics, and from personalized medicine to patient/custom-centered outcomes.It has involved various levels of human life: individuals to community, and industrial to university to government.The emerging field of Big Data science and associated practices offered new opportunities and is promising, but it comes with many challenges in all fields, especially the biomedical and health science fields which makes improved understanding of human life, health, diseases, and behavior possible.The collaborative network, nurturing environments and interdisciplinary, team-science approach with highly trained computational skills and domain/disease expert talents are crucial, while adaptive and intelligent evolving analytic tools and smart utilization of open resources are keys for enhancing the true value of real time big data for actionable healthcare decision making and better informed patient outcomes.

Figure 1 :
Figure 1: Big data science in fields of biomedical research: Transforming big genome data into diagnostics, therapeutics, and new insights into population health, disease treatment.

Figure 1 :
Figure 1: Big data science in fields of biomedical research: Transforming big genome data into diagnostics, therapeutics, and new insights into population health, disease treatment.

Figure 2 :
Figure 2: Comparison of gene selection/filtering methods in time course gene expression data: Identifications of significant differentially expressed genes (top); characterizing the dynamics and changes of gene expression to determine the trajectories of significantly regulated genes (bottom).