J Korean Med Sci. 2024 Mar 11;39(9):e92. English.
Published online Feb 21, 2024.
© 2024 The Korean Academy of Medical Sciences.
Review

Dark Data in Real-World Evidence: Challenges, Implications, and the Imperative of Data Literacy in Medical Research

Hun-Sung Kim1,2
    • 1Department of Medical Informatics, College of Medicine, The Catholic University of Korea, Seoul, Korea.
    • 2Division of Endocrinology and Metabolism, Department of Internal Medicine, Seoul St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul, Korea.
Received September 04, 2023; Accepted February 01, 2024.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Randomized controlled trials (RCTs) and real-world evidence (RWE) studies are crucial and complementary in generating clinical evidence. RCTs provide controlled settings to validate the clinical effect of specific drugs or medical devices, while RWE integrates extrinsic factors, encompassing external influences affecting real-world scenarios, thus challenging RCT results in practical applications. In this study, we explore the impact of extrinsic factors on RWE outcomes, focusing on “dark data,” which refers to data collected but not used or excluded from the analyses. Dark data can arise in many ways during research process, from selecting study samples to data collection and analysis. However, even unused or unanalyzed dark data hold potential insights, providing a comprehensive view of clinical contexts. Extrinsic factors lead to divergent RWE outcomes that could differ from RCTs beyond statistical correction’s scope. Two main types of dark data exist: “known-unknown” and “unknown-unknown.” The distinction between these dark data types highlights RWE’s complexity. The transformation of unknown into known depends on data literacy—powerful utilization capabilities that can be interpreted based on medical expertise. Shifting the focus to excluded subjects or unused data in real-world contexts reveals unexplored potential. Understanding the significance of dark data is vital in reflecting the complexity of clinical settings. Connecting RCTs and RWEs requires medical data literacy, enabling clinicians to decipher meaningful insights. In the big data and artificial intelligence era, medical staff must navigate data complexities while promoting the core role of medicine. Prepared clinicians will lead this transformative journey, ensuring data value shapes the medical landscape.

Keywords
Big Data; Dark Data; Data Literacy; Real-World Data; Real-World Evidence; Randomized Controlled Trials

INTRODUCTION

In medical research, diverse research designs contribute to establishing clinical evidence. Among these, randomized controlled trials (RCTs) are the most prevalent approach for verifying the “clinical effect” of specific drugs or medical devices. Characterized by the random division of selected subjects into experimental and control groups, this design ensures equal conditions for both groups through “randomization.”1, 2 By contrast, real-world evidence (RWE), which uses real-world data to derive scientific evidence, includes various “extrinsic factors” that affect the RCT results to be reproduced in actual clinical practice (Table 1).3, 4 “Extrinsic (or hidden) factors” refer to factors not solely discernible through laboratory tests in clinical hospital research (or which cannot be obtained from hospitals) that are generated in various environments outside the hospital (RWE = RCT + Extrinsic factor).5 These factors sometimes resemble confounding factors in statistics. Despite the importance of these extrinsic factors in interpreting the RWE, a notable gap persists in the understanding of how these extrinsic factors affect the effectiveness of drugs or any other forms of intervention in real-world scenarios.

Table 1
Factors influencing discrepancies in glucose-lowering effects of insulin and GLP-1 RA between RCTs and RWE

Noteworthy among extrinsic factors that influence RWE research are patient compliance or adherence and socioeconomic characteristics such as education status and income.6, 7 The use of glucagon-like peptide-1 receptor agonist (GLP-1 RA) in patients with type 2 diabetes is a good example of the influence of extrinsic factors in real clinical practice. Although GLP-1 RAs have demonstrated superior glucose-lowering efficacy to oral hypoglycemic agents in various RCTs, in real-world scenarios, blood glucose levels often do not improve, even when the prescription is changed to GLP-1 RAs.8 Similarly, insulin faces comparable challenges. As it is an injectable drug, injecting is frequently skipped owing to injection spot limitations, inconvenience, pain, and dosage complexity, leading to a low blood glucose improvement effect in real-world situations.9 Similarly, RCTs objectively validate the glucose-lowering potential of insulin or GLP-1 RA, but “extrinsic factors” such as compliance limit the glucose-lowering efficacy of these drugs in RWE.10 In real-world situations, various clinical factors such as injections, frequency, side effects, pill size, meal timing, or even cost lead to decreased patient compliance, so results may be poor, unlike in RCTs.11 Obesity drugs share analogous predicaments, often exhibiting efficacy below RCT outcomes because of concerns regarding cost, side effects, and aversion to injections.12, 13 Owing to the complexity of these extrinsic factors, simple statistical adjustments are not sufficient to reconcile the RWE and RCT results. Instead, this is an area that can be accessed only with thorough knowledge and understanding of clinical situations and medical data. In addition, proactive consideration of extrinsic factors and their incorporation into study design can enhance RWE outcomes and deepen the insights for practical settings. RWE outcomes gain novel significance by identifying external influences. The capacity to discern extrinsic factors within RWE is vital, as actual outcomes wield the potential to reshape clinical applications.

SHIFTING FOCUS: HIGHLIGHTING EXCLUDED SUBJECTS AND UNUTILIZED VARIABLES IN REAL-WORLD CONTEXTS

As RCTs must secure indications for “clinical effects” unique to specific drugs or medical devices, setting inclusion criteria is critical.14 While the exclusion criteria in RCTs primarily ensure established to exclude situations that violate patient safety, a pivotal question arises for RWE: whether the distinctive effects exhibited by specific interventions in RCTs can be replicated in everyday life beyond the hospital setting.4, 15, 16 Consequently, the role of exclusion criteria in RWE becomes significant, prompting consideration of subjects not analyzed in the literature so far. In other words, both inclusion and exclusion conditions are essential. However, the exclusion conditions must be examined more closely in RWE (Fig. 1). Contemporary RWE studies frequently focus on analyzing inclusion criteria, often emphasizing the advantage of swiftly amassing voluminous datasets.16 However, the potency of RWE resides not in data quantity but in uncovering latent data value to obtain information.17 While initial data collection is extensive, actual research frequently relies on only a fraction of these data (Figs. 2 and 3). This reduction is influenced by various “operational definitions” and “data quality management” (DQM) to overcome RWE limitations and strengthen analysis reliability.17 However, while operational definitions and DQM enhance research reliability, they do not work as mechanisms to extract value from data. Consequently, simultaneous investigation of the exclusion criteria, which often results in data deletion, is essential.

Fig. 1
Contrasting study designs of RWE and RCT. The significance of inclusion criteria is widely acknowledged, whereas RCT places greater emphasis on exclusion criteria because of the pivotal role of circumstances and reasons for exclusion. In RCTs, research is conducted on groups selected through inclusion criteria. Therefore, it consists of a homogenous group. However, in RWE, excluded patients other than the inclusion group are also important. In this case, it mainly consists of heterogenous groups and actively reflects the actual clinical field.
RWE = real-world evidence, RCT = randomized controlled trial.

Fig. 2
Various research designs in clinical fields—RCT and RWE. (A) Exemplary RCT design. This conventional design involves a homogeneous sample, and the findings are extrapolated to represent the entire population. (B) Archetypal RWE design. Although the sample is heterogeneous, the similarity in sample size to the total population facilitates a representative reflection of the larger population. (C) Inadequate RWE design. Characterized by small, heterogeneous sample size, this design combines the limitations of both RCT and RWE. Such situations often arise owing to rigorous research design, leading to substantial data exclusion. (D) Ideal RWE design. Aims to minimize exclusions and employs subgroup analysis on heterogeneous samples. This design strives for personalized treatment tailored to individual patients.
RCT = randomized controlled trial, RWE = real-world evidence.

Fig. 3
Illustrations of “known-unknowns” dark data. (A) Examples of numerous data excluded from the clinical study design. Often, a mere 10% of extracted data finds utilization in research endeavors. (B) Various examples of data excluded by exclusion criteria. Exclusions, based on diagnosis and operational definition discrepancies, encompass a range of scenarios. The precision of data extraction’s accuracy becomes questionable, potentially impacting research outcomes. Notably, patients excluded owing to alternative drug use or impaired liver/kidney function frequently employ study drugs in real-world practice. Unexpectedly terminating or altering medication constitutes a shift in the patient’s condition, which could align with the study’s objective.

This introspection unveils data collected with specific intentions through diverse processes, yet left unutilized for actual analysis, called “dark data.”18, 19 These forms of data may not be apparent in clinical research, but exist in actual clinical contexts and complement overall clinical results. Broadening the scope of data reveals data generated throughout patient treatment hospital visits, which remain unutilized for treatment purposes. For instance, electronic medical records at a hospital include information such as appointment times, treatment durations, and wait times. This information is collected and reflects the medical care environment and may affect treatment outcomes. However, some variables are not used in analyzing the results.20 From a healthcare perspective, all these fall into the dark data category. However, appointment time, treatment time, and patient discomfort during reception may also affect the patient’s health. Under this assumption, the crux of dark data significance lies in its potential to effectively manage the health of previously overlooked patients through meticulous utilization and analysis.21

WHAT DARK DATA MEANS: KNOWN-UNKNOWN AND UNKNOWN-UNKNOWN

The essence of dark data’s potency lies in its diverse nature.18, 19, 20 As data volume expands and complexity deepens, analytical tools’ efficacy increases, yielding more accurate insights from dark data. Dark data constitute information that has not been used for analysis, which even experts may have overlooked, but hold the potential to unearth novel insights.21 This exploration can lead to unexpected revelations in real-world treatment contexts, greatly enhancing the treatment process.19, 22 Dark data predominantly emerge during the construction of clinical research datasets and study designs driven by factors such as the inclusion of unsuitable target groups, unavoidable exclusion of unstructured data domains, and processing difficulties. Consequently, up to 80–90% of actual data could remain untapped.19

Dark data are of various types, with the most representative types being “known-unknown” and “unknown-unknown.”19, 23 “Known-unknown” denotes instances where researchers anticipate data gaps before study commencement, encompassing cases involving operational definitions and data removal through DQM. These scenarios allow researchers to proactively devise remedies for missing data (Fig. 3). However, even when cognizant of the existence of dark data, gaining access to and comprehending it often demands expert evaluation.24 Moreover, dark data may also stem from tool errors, deliberate concealment, or asymmetrically provided information.19, 25 Therefore, the effective utilization of dark data necessitates engagement by medical experts, bridging the realms of statistics, analysis, and ethics.26 The crux of perceiving dark data lies in identifying the extrinsic factors that bridge the gap between RCT and RWE; however, this task heavily relies on the clinical insight of medical professionals, who possess the most profound understanding of medical data.17, 24

“Unknown-unknown,” another type of dark data, represents dark data whose absence remains unbeknownst and presents a more profound challenge than known-unknown. Given the predominantly unstructured nature of data, even experts are prone to errors owing to utilization limitations. If the source and authenticity of data are not specified and not suitably managed, they are exposed to problems of contamination, trust, and security. Furthermore, managing errors in such cases is arduous, and deliberate malicious approaches may arise (Fig. 3).27 Presently, clinical data standardization and accessibility have advanced, and most artificial intelligence (AI) models for medical data follow analogous methodologies, rendering them vulnerable to white-box attacks. Malicious exploitation of dark data may involve intentional or manipulated synthetic data, potentially deviating from clinical objectives. Even minor confounding signals (perturbations) can trigger unintended results, ultimately yielding adverse research results.28

Considering the importance of dark data in actual clinical contexts, being cognizant of these challenges and developing the capacity to handle dark data rooted in medical knowledge is critical. Inadequate medical knowledge can even transform “known-unknown” into “unknown-unknown” dark data. The key lies in harnessing medical expertise to convert “unknown-unknown” into “known-unknown” dark data. This process mirrors the drive toward transforming a complex black-box-type AI that experts have difficulty understanding into explainable AI.29 A technology-oriented methodology alone cannot transform numerous “unknown-unknown” into “known-unknown.” Rather, data literacy plays a pivotal role in mitigating dark data biases, and continuous professional training for medical personnel is essential for bolstering data literacy.17, 24 As such, data literacy refers to the ability to decipher the meaning hidden in data,30 an inherently specialized domain that eludes general access. Without expert training, it is an area where identical results can yield opposing interpretations.

DATA LITERACY FOR CLINICIANS: INTERPRETING AND APPLYING INFORMATION IN REAL PRACTICE

Converting “medical data” to “medical information” requires several steps. First, the collected data need to be analyzed using relevant statistical analysis methods to obtain results, and it is necessary to interpret these results in clinical contexts (Fig. 4).31, 32, 33 In other words, data analysis is advantageous for professional data scientists, but it is often enough to apply simple statistical tools rather than complex expertise to data analysis. Therefore, the technical process of extracting “information” from “data” is not too difficult, given the technological statistical advancements used to organize, structure, categorize, and calculate data.33 However, the pivotal facet is that to solve the specific clinical problem targeted through data analysis, interpreting the results based on medical concepts and experiences is essential and critical before applying them to actual clinical practice. This progression of clinical data literacy transforms information into clinically deployable “knowledge” through expert concepts, ideas, learning, and discussion.33 Presently, numerous big data or AI studies yield promising results; however, there are several cases where clinical commercialization is limited because they stopped at information extraction without clinical interpretation. Such studies often neglect to progress to the knowledge stage owing to inadequate reflection of medical expert experiences and treatment concepts.17 This junction presents a critical decision for the role of big data in the clinical domain: whether it remains a “decision-support” tool or advances to the role of a “decision-maker” from a medical personnel perspective. Unfortunately, and also unwantedly, a phenomenon prevails wherein simple information leads to forced medical explanations, breeding misinformation, and varying study outcomes. Therefore, medical personnel must actively intervene in all processes, from data to information to knowledge, and must possess the ability to perform such tasks. Failure to do so could lead to data misuse or underutilization; that is, in the process of extracting information from data and turning it into knowledge, data literacy based on clinical experiences is critical.

Fig. 4
Perspectives on data, information, knowledge, and wisdom/theory.33

Although medical professionals have historically applied empirical treatments based on data, they often lack the skills to effectively manage accumulated medical data. While data collection and analysis were conventionally the domain of data scientists or other experts, the first step toward data-driven decision-making entails self-engagement with research data.17 The easiest method for medical staff to build data literacy is not to learn data analysis techniques but to increase their understanding of the data itself. Before the operational definition or DQM, a crucial step involves discerning the possibilities and limitations inherent in the original data. Identifying and solving problems that are needed in the medical field requires a high level of understanding of the data. Here, medical data literacy mandates creating a structure that can inform data to enable problem-solving.24 A medical understanding of which data to employ, include, or exclude in studies becomes pivotal in bridging the gap between RCTs and RWEs. This endeavor is impossible without the researcher’s profound understanding of the data. Researchers must evaluate if their data align with their research objectives or if additional collection is warranted. To properly utilize data for research, the ability to structure and refine data into a form that can be analyzed is important. Once DQM is executed, statistical analysis methods appropriate for the intended purpose upgrade data to information, which, with proper translation, culminates in clinically applicable knowledge.

CONCLUSIONS

To obtain significant insights from vast datasets, a process of statistical analysis is required, often supplemented by various AI techniques owing to evolving data formats and sizes. However, medical professionals might not be proficient in statistical techniques or AI principles or possess the ability to develop AI algorithms for effective clinical application. Rather, their focus should be on interpreting data-derived information through the lens of their clinical experience and skillfully integrating it into practical medical care. Although statistical knowledge, AI principles, and development skills are pivotal, an even greater priority lies in augmenting comprehension of the underlying data.

The expanse of medical data is both vast and deep. Given its scale and diversity, understanding and managing dark data is now a key factor for the proper use of big data. Interpreting dark data necessitates a strategic perspective from experts. The essence is not solely about amassing substantial data but harnessing it for targeted objectives. The goal is not only to find the right solution but also to efficiently apply data to clinical practice, including dark data. The key to the clinical use of big data depends on the continual generation of data. Medical personnel are constantly generating data in the clinical field. At some point, if the role of medical staff stops, data is no longer created. At that moment, the clinical use of big data, including AI, becomes stagnant and redundant. As such, medical professionals are already leveraging medical data in the treatment realm, forming the bedrock of their practices. Competently utilizing data without falling victim to misinformation is an essential skill, as neglecting unknown data, dark data, and potentially hidden insights can lead to misleading consequences. The capabilities of the medical staff must be honed amid the abundance of data while understanding nuances and changes within the medical landscape. Ultimately, proficiency in comprehending data and navigating the medical environment surrounding is crucial. Consequently, the significance does not lie solely in the data handling skills of data scientists but also in the medical data literacy that unravels the medical world’s intricacies. Given their unparalleled familiarity with data and its consistent creation, medical professionals are poised to shape the value of data. In this regard, the need for medical staff with data knowledge and skills will be ever-present in the big data era for both research and clinical applications. Ultimately, clinicians must be able to make informed decision-making about the acquisition and use of data-related resources, accurate analysis, and correct interpretation. For this, clinical knowledge and experience must be the basis, and therefore, even in the era of big data, medical personnel will still play a central role in clinical care. This direction delineates the course for medical professionals in the big data era, thus expanding their roles and responsibilities manifold. Prepared clinicians will seize these role opportunities, while the unprepared ones may be left behind.

Notes

Funding:This research was supported by the Medical AI Education and Overseas Expansion Support Research Program through the National IT Industry Promotion Agency (NIPA) funded by the Ministry of Science, ICT & Future Planning.

Disclosure:The author has no potential conflicts of interest to disclose.

References

    1. Sackett DL, Straus SE, Richardson WS, Rosenberg W, Haynes RB. In: Evidence-Based Medicine: How to Practice and Teach EBM. 2nd ed. Edinburgh, UK: Churchill Livingstone; 2000. pp. 173-177.
    1. Stanley K. Design of randomized controlled trials. Circulation 2007;115(9):1164–1169.
    1. Sherman RE, Anderson SA, Dal Pan GJ, Gray GW, Gross T, Hunter NL, et al. Real-world evidence - what is it and what can it tell us? N Engl J Med 2016;375(23):2293–2297.
    1. Kim HS, Kim JH. Proceed with caution when using real world data and real world evidence. J Korean Med Sci 2019;34(4):e28
    1. Klonoff DC. The expanding role of real-world evidence trials in health care decision making. J Diabetes Sci Technol 2020;14(1):174–179.
    1. Jager KJ, Zoccali C, Macleod A, Dekker FW. Confounding: what it is and how to deal with it. Kidney Int 2008;73(3):256–260.
    1. Grimes DA, Schulz KF. Bias and causal associations in observational research. Lancet 2002;359(9302):248–252.
    1. Edelman SV, Polonsky WH. Type 2 diabetes in the real world: the elusive nature of glycemic control. Diabetes Care 2017;40(11):1425–1432.
    1. Brod M, Rana A, Barnett AH. Adherence patterns in patients with type 2 diabetes on basal insulin analogues: missed, mistimed and reduced doses. Curr Med Res Opin 2012;28(12):1933–1946.
    1. Collins R, Bowman L, Landray M, Peto R. The magic of randomization versus the myth of real-world evidence. N Engl J Med 2020;382(7):674–678.
    1. Andrade S. Compliance in the real world. Value Health 1998;1(3):171–173.
    1. Ard J, Cannon A, Lewis CE, Lofton H, Vang Skjøth T, Stevenin B, et al. Efficacy and safety of liraglutide 3.0 mg for weight management are similar across races: subgroup analysis across the SCALE and phase II randomized trials. Diabetes Obes Metab 2016;18(4):430–435.
    1. Park JH, Kim JY, Choi JH, Park HS, Shin HY, Lee JM, et al. Effectiveness of liraglutide 3 mg for the treatment of obesity in a real-world setting without intensive lifestyle intervention. Int J Obes 2021;45(4):776–786.
    1. Zabor EC, Kaizer AM, Hobbs BP. Randomized controlled trials. Chest 2020;158(1S):S79–S87.
    1. Benson K, Hartz AJ. A comparison of observational studies and randomized, controlled trials. N Engl J Med 2000;342(25):1878–1886.
    1. Kim HS, Lee S, Kim JH. Real-world evidence versus randomized controlled trial: clinical research based on electronic medical records. J Korean Med Sci 2018;33(34):e213
    1. Kim HS, Kim DJ, Yoon KH. Medical big data is not yet available: why we need realism rather than exaggeration. Endocrinol Metab (Seoul) 2019;34(4):349–354.
    1. Harrington L. New data of the digital age: big, dark, and deep. AACN Adv Crit Care 2017;28(3):239–242.
    1. Hand DJ. In: Dark Data: Why What You Don’t Know Matters. Princeton, NJ, USA: Princeton University Press; 2020.
    1. Zhang C, Shin J, Ré C, Cafarella M, Niu F. Extracting databases from dark data with deepdive. Proc ACM SIGMOD Int Conf Manag Data 2016;2016:847–859.
    1. Suto Y. In: Stepp LM, Gilmozzi R, Hall HJ, editors. Unknowns and unknown unknowns: from dark sky to dark matter and dark energy; Proceedings of the SPIE Astronomical Telescopes + Instrumentation, Ground-based and Airborne Telescopes III, Vol 7733; 2010 June 27-July 2; San Diego, CA, USA. Bellingham, WA, USA: SPIE; 2010. pp. 1-11.
    1. Faurholt-Jepsen M, Busk J, Frost M, Vinberg M, Christensen EM, Winther O, et al. Voice analysis as an objective state marker in bipolar disorder. Transl Psychiatry 2016;6(7):e856
    1. Truesdell AG, Sauer AJ, Alasnag M. Known knowns, known unknowns, and unknown unknowns. Cardiovasc Revasc Med 2020;21(12):1472–1473.
    1. Koltay T. Data governance, data literacy and the management of data quality. IFLA J 2016;42(4):303–312.
    1. Shin SI, Kwon MM. Dark data: why what you don’t know matters. J Inf Technol Case Appl Res 2023;25(2):112–118.
    1. Perini DJ, Batarseh FA, Tolman A, Anuga A, Nguyen MA. 16 - Bringing dark data to light with AI for evidence-based policymaking. In: Batarseh FA, Freeman LJ, editors. AI Assurance: Towards Trustworthy, Explainable, Safe, and Ethical AI. 1st ed. Cambridge, MA, USA: Academic Press; 2022. pp. 531-557.
    1. Qiu S, Liu Q, Zhou S, Wu C. Review of artificial intelligence adversarial attack and defense technologies. Appl Sci 2019;9(5):909.
    1. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions; Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015 June 7-12; Boston, MA, USA. New York, NY, USA: IEEE; 2015. pp. 1-9.
    1. Zednik C. Solving the black box problem: a normative framework for explainable artificial intelligence. Philos Technol 2021;34(2):265–288.
    1. Koltay T. Data literacy for researchers and data librarians. J Librarian Inform Sci 2017;49(1):3–14.
    1. Lee JA. Data, information, and knowledge. Lancet Oncol 2002;3(6):384.
    1. Georgiou A. Data information and knowledge: the health informatics model and its role in evidence-based medicine. J Eval Clin Pract 2002;8(2):127–130.
    1. Hänsel K, Dudgeon SN, Cheung KH, Durant TJS, Schulz WL. From data to wisdom: biomedical knowledge graphs for real-world data insights. J Med Syst 2023;47(1):65.

Metrics
Share
Figures

1 / 4

Tables

1 / 1

ORCID IDs
PERMALINK