Real world big data for clinical research and drug development

The objective of this paper is to identify the extent to which real world data (RWD) is being utilized, or could be utilized, at scale in drug development. Through screening peer-reviewed literature, we have cited speciﬁc examples where RWD can be used for biomarker discovery or validation, gaining a new understanding of a disease or disease associations, discovering new markers for patient stratiﬁcation and targeted therapies, new markers for identifying persons with a disease, and pharmacovigilance. None of the papers meeting our criteria was speciﬁcally geared toward new novel targets or indications in the biopharmaceutical sector; the majority were focused on the area of public health, often sponsored by universities, insurance providers or in combination with public health bodies such as national insurers. The ﬁeld is still in an early phase of practical application, and is being harnessed broadly where it serves the most direct need in public health applications in early, rare and novel disease incidents. However, these exemplars provide a valuable contribution to insights on the use of RWD to create novel, faster and less invasive approaches to advance disease understanding and biomarker discovery. We believe that pharma needs to invest in making better use of EHRs and the need for more precompetitive collaboration to grow the scale of this ‘big denominator’ capability, especially given the needs of precision medicine research.


Introduction
Access to large-scale real-world data (RWD) to support basic and translational science in clinical research and development is a significant opportunity and challenge for life sciences and the pharmaceutical industry. It is well recognized that randomized, controlled trials provide high-quality data on restricted patient populations (little co-morbidity, not including older patients, etc.) and that high-volume, routinely collected data have the potential to provide insights into the health situation and treatment effectiveness in a more representative diversity of patients, as well as to permit hypothesis generation regarding rare conditions, rare effects and rare biomarkers. Population health data have been used as a source of knowledge discovery for decades, therefore using RWD in analysis is not new. Numerous population cohorts, usually operating on a national or regional basis, and usually with many thousands of patients each, have generated a vast body of epidemiological literature. Disease, procedure and other health registries, often curated at national or regional levels, have similarly resulted in an expansive volume of scientific literature. At the opposite end of the RWD spectrum, individual care organizations such as hospitals and general practitioners (GPs) have long used their locally held data for quality and safety monitoring (e.g., via audit), and many have established clinical data warehouses for internal research use. The incorporation of electronic health record (EHR) data for research at a care site level is now well recognized and supported [1,2]. Claims databases are widely available on a large scale and are used for population health research, but have recognized selection and up-coding bias that questions the scientific validity of real world evidence (RWE) derived from them [3,4]. The focus of this review is on the novel large-scale use of routinely collected health record data; therefore, claims databases were not included.
Aggregations of data across multiple organizations also exist in practice and this has proven to be a valuable scientific approach to working with RWD. One of the best known in Europe is the Clinical Practice Research Datalink (CPRD) [5], a governmental, not-for-profit, research service, jointly funded by the NHS National Institute for Health Research (NIHR) and the Medicines and Healthcare products Regulatory Agency (MHRA) in the UK. Providing anonymized primary care records for public health research since 1987, research using CPRD data has resulted in >1700 publications in drug safety, best practice and clinical guidelines [6]. As another example, the Italian Medicines Agency, Agenxia Italiana del Farmaco (AIFA), has set up a system of registries for RWD collection as part of the reimbursement and pricing process to ensure the licensed medicines meet pre-agreed effectiveness targets [7]. According to this publication, there are currently >120 registries, through which >80 medicines are monitored in >50 therapeutic indications. Different stakeholders have different access rights to the system across 21 regions, >1000 hospitals, >24 000 clinicians, 1500 pharmacists and 32 marketing authorization holders.
A further important contribution to the conduct of research on big health data is the Observational Health Data Sciences and Informatics (OHDSI) collaboration, which enables the scaling up of research through the adoption of common data model and tools. For example, Hripcsak et al. used OHDSI to combine data from 11 data sources, a total of 250 million patients, to examine treatment pathways in type 2 diabetes mellitus, hypertension and depression [8]. Substantial R&D investments are currently being made to develop tools, platforms and governance processes to enable the distributed analysis of multiple EHR systems [9,10]. One of the most ambitious projects is the 5-year, s56 million EUfunded European Medical Information Framework (EMIF), which is a multi-stakeholder platform creating an EU technology and governance framework that will enable the re-use and management of existing health data [11].
The objective of this paper is to identify the extent to which RWD is being utilized, or could be utilized, at scale in drug discovery, such as the identification and targeting of novel therapeutic areas. Through a comprehensive screening of peer-reviewed literature, we have cited specific examples where RWD can be used for biomarker discovery or validation, gaining a new understanding of a disease or disease associations, discovering new markers for patient stratification and targeted therapies, new markers for identifying persons with a disease and pharmacovigilance. In the context of this article, the term 'real world data' is used to describe data sources that are collected or measured outside of the randomized, controlled trial, and reflective of clinical management or naturalistic care [12]. These can include cohort studies, patient registries and data generated by patients directly [13,14]. The growing body of data held within high quality EHR systems, and the adoption of interoperability standards and harmonization methodologies, has made the large-scale analysis of EHR data more attractive and viable to aid in the development of needed new therapies [15]. To our knowledge, there has been no formal literature review examining the use of RWD sources to successfully generate new evidence in support of drug development. There is no consensus definition of big health data and, in the context of this article, we have opted to define 'big' as analyzing the data on 1 million or more subjects within RWD sets, either in one dataset or distributed over several datasets, to profile a relevant subpopulation for the published research (we later discuss the limitations of this definition).
We have sought empirical studies that have required a large population denominator to identify sufficient relevant patient numbers to generate robust results. We recognize that this inclusion criterion is not based on any authoritative or widely adopted definition or convention, and we hope that, by examining the success of evidence generation based on this definition, we will stimulate community debate on how the use of RWD for clinical research and drug development should best be characterized and differentiated from well-established epidemiological and health services research uses of data sources such as registries and CPRD. We reviewed the literature using the search string with different inclusion terms and a keyword: 'million', which was hard-wired into the title or abstract to capture the scale of study that we would consider to be suitably 'big' (Fig. 1). It was a deliberate decision to search for specific mention of a large dataset size, because our preliminary exploration of the literature revealed many studies that were conventional in scale but utilized terms like 'big data' rather indiscriminately. However, we recognize that the RWD community still needs to agree on a precise term that could be used for future literature reviews of this kind, as discussed later, especially when considering rare conditions where a large population database is required to find relatively small numbers of precisely specified patients. Without setting any date limitations, the above search string identified 534 publications in PubMed. However, as can be seen in Fig. 2, the clear majority of the publications that were screened were published within the past 10 years. We adopted a manual title and abstract screening to characterize the retrieved results against those kinds of evidence that are most relevant to clinical research and development, and the subject of this literature review. A total of 32 publications were retained, and subjected to independent fullpaper review by three authors. Given our focus on RWE supporting clinical research and development, our full-paper screening sought to verify our initial inclusion criteria and select only those papers with novel techniques and findings that were directly applicable to current needs of biopharmaceutical R&D on the basis of five kinds of evidence, under which our findings are grouped. Twenty publications were retained for inclusion in this review; see Fig. 3 for the PRISMA diagram [16]. Most of the publications we eliminated in screening were describing the potential of RWD as an opportunity, sometimes with examples of knowledge gaps that might be filled through RWD. However, this large body of publications did not offer any actual findings from data. Our screening clearly demonstrated a diminishing number of studies reporting the practical application of RWD to drug development as one went back in time. The greatest alignment to the objectives of the study was demonstrated in the past five years, where we limited our screening for final inclusion.
During the manual title and abstract screening, we excluded publications that described the needs, opportunities or challenges of using big data, or that described databases that had the potential to be used for big data research but did not include any concrete empirical research findings. We also excluded editorials and publications describing big data research methodologies without offering any empirical findings. Studies that were literature reviews, not empirical studies, were used for source material and relevant articles that fitted our criteria were incorporated into the abstract search and screening, but the review articles themselves were not included unless they reported original empirical findings. The 20 selected studies fall into five basic applications of RWD for clinical research and development (Fig. 4). Consensus was reached on any papers that did not have a unanimous decision at a dedicated faceto-face meeting. Details about the size and source of the datasets are presented in Table 1. Below, we have included papers that provide a concrete example of a use of RWE that could be harnessed and repurposed for drug discovery. We give an in-depth background on one example that we feel best exemplifies a use case for RWE, and also highlight several other examples that met our criteria.

Case study Biomarker discovery or validation
Only one publication highlighted novel uses in the development or application of RWD techniques in the identification or validation of new biomarkers. Of particular interest was Zodiac, the use of a Bayesian model to create a more effective map of cancer outcomes based on the analysis of genetic interactions as biomarkers found in the TCGA database of 200 million patient records [17].
A new understanding of a disease or a disease association Six of our selected papers demonstrate how novel uses of RWD can foster new understandings of disease associations and or comorbidities that would be particularly useful when trying to target new populations or indications for research. Of note was a study that used the Taiwan National Health Insurance Database of >782 million outpatient visits to develop the Cancer Associations Map Animation (CAMA). By tracking previously unmapped cancer-disease associations across ages and genders, CAMA can effectively detect cancer comorbidities earlier than is possible by manual inspection and identify potential effect modifiers or new risk factors [18].
Other studies: An analysis of 25 million patient records of the US Veterans Administration discovered that those with periodontal disease were more likely to have rheumatoid arthritis [19].  Graphic representation of search strategy. Literature was reviewed by combining a set of inclusion criteria for kinds of health record source, combined only with the keyword 'million' to generate the actual search string. The term million was hard-wired into the title or abstract as an indication of the scale of study that we would consider to be suitably 'big' .   Preferred reporting items for systematic reviews and meta-analyses (PRISMA) diagram for this review.
The long-held belief that an increase in the risk of mortality is encountered using long-acting b 2 -agonist (LABA) monotherapy in the treatment of asthma could not be proven in a RWD study of a cohort of 994 627 patients [20]. Primary open angle glaucoma (POAG) is associated to a moderate increase in risk for vascular dementia. Further, the likelihood of a hospital record of POAG following Alzheimer's disease or vascular dementia was very low [21]. Linking EHRs in CALIBER (cardiovascular research using linked bespoke studies and electronic health records) found the assumption that blood pressure has an impact on all cardiovascular diseases, and diastolic and systolic associations are concordant, which is not supported by outcomes data [22]. By modeling temporal relationships between 41.2 million time-stamped international classifications of diseases in 1.6 million patients, researchers discovered that diabetes usually preceded the diagnosis of Helicobacter pylori (bacteria linked ulcers), leading to questions of cause and effect of the two conditions [23].

Discovering or validating new markers for patient stratification and targeted therapies
With the continuing push toward stratified, targeted therapies, the use of RWD has immediate implications for drug development and efficacy, and our research identified four examples where the use of large datasets created novel approaches to stratification of patient populations. One of the selected studies developed a novel approach that could be generalized across multiple disease areas. By using a flexible framework called generalized low rank models (GLRM), the researchers could successfully capture known and putative phenotypes using vastly different datasets including text from physician notes [24].
Other studies selected included: The use of RWD to inform and verify the use of concomitant corticosteroid in the treatment of patients with metastatic castration-resistant prostate cancer [25]. A study of 27 million patient records that accurately determined individual risk factors post knee arthroplasty [26]. An analysis of EHR data taken from multiple healthcare systems over the period 1999-2011 found that patient weight had more effect than height on venous thromboembolic events [27].
New markers for identifying persons with a disease (e.g., formerly undiagnosed patients) Drug development will increasingly require identification of new disease markers that can better identify previously undiagnosed patients, and our research found five examples of this. Specifically, given the lack of treatments in neurological disorders, the use of algorithms in the identification of new patents is a pressing need for the biopharmaceutical sector. One study outlined the effective application of semiautomated mining of EHRs to ascertain bipolar disorder patients and control subjects with high specificity and predictive value when compared with diagnostic interviews [28]. This technique could have broad applicability across many research areas in neurology.
Other studies selected included: The outcomes of 2.8 million data points taken from the real world pragmatic use of the therapy ranibizumab in the treatment of age-related macular degeneration when compared with the results of the randomized clinical trial [29].   The risk of lung cancer patients developing pulmonary embolism when compared with cancer-free controls when analyzing 3 million Dutch hospitalizations [30]. A nationwide cohort study of the incidence and mortality of acute and chronic pancreatitis in The Netherlands found that disease burden and healthcare costs will probably increase, linked to the ageing Dutch population [31]. An algorithm developed at Vanderbilt University that enabled the rapid searching of an EHR database of 2.5 million subjects to accurately identify systemic lupus erythematosus [32]. The ability to use algorithms and large datasets to rapidly identify previously undiagnosed and unknown patient populations would not only have a direct impact on lupus research but also has the potential to be applicable to autoimmune disorders more broadly.

Drug safety studies
Several examples of RWD can be found in the application of drug safety. Naturally, speed and accuracy of discovery are of vital importance in safety-related situations. An accurate understanding of adverse events would be of enormous benefit to regulators, patients and industry. The ability to utilize social media to automate drug safety monitoring could radically reduce its costs and accelerate results and we selected four papers as exemplars of best practices. The most unique of these explored the practical use of 11 million 'Tweets' to determine the frequency of prescription drug and polydrug abuse using unsupervised machine learning. The study concluded that social media could be a viable methodology for drug abuse surveillance [33]. Other studies selected: Aggregated, de-identified EHR data for multivariate pharmacosurveillance of 10 million individuals could provide sufficient insight and statistical power to detect potential patterns of medication side-effect associations [34]. Claims-based surveillance of >14 million vaccinations did not indicate a statistically significant elevated Guillain-Barré syndrome rate following seasonal or H1N1 influenza vaccination [35]. Nine million clinical notes for >1 million patients were used to detect statistically significant drug safety signals at co-occurrences of drug-disease mentions [36].

Discussion
Whereas our papers selected were all in unique areas of clinical applications, there are several overarching themes that they share. First, the use of large datasets broadly enables a far better understanding of treatment pathways in diagnosis and efficiency of treatment, as well as drug safety. Second, with the ability to harness multiple EHR systems, it is now possible to sift for rare indications, and develop unique algorithms to find therapeutic 'diamonds in the rough', as well as uncover previously missed or early indications of disease incidents that might have previously been undetectable without the judicious harnessing of RWE. The fact is, although there is a tsunami of sky-high rhetoric related to big data being promulgated, our selected papers show that this work is still in an early phase of practical application, and is being harnessed broadly where it serves the most direct need in public health applications in early, rare and novel disease incidents. RWE is delivering results, but it is not yet ubiquitous outside of a few areas in public health. Additionally, one of our key questions this paper set to answer, that RWE can be used to assist in the targeting of novel therapeutic areas in drug development, has yet to be supported in the papers we have selected. None of the papers we finally identified was specifically geared toward new novel targets or indications in the biopharmaceutical sector. The majority of the studies were focused more generally in the area of public health, often sponsored by the universities themselves, insurance providers or in combination with public health bodies such as national insurers. Given that the current ownership of large public health data is often at the hospital system or national level, this does make sense in hindsight. Much of the usable RWD is housed in large EHRs owned by public health bodies or insurance organizations responsible for reimbursement. It stands to reason that the goals of most public EHR owners are not currently focused in the discovery and development of new molecular entities in the pharmaceutical sector, and could be a reason why our initial goal of finding best-case examples for drug targets has gone unmet by this exercise. As well, given our search strategy has been focused on publicly listed peer-reviewed literature, studies that were businessdriven and pharma-sponsored could be largely unpublished and treated as commercially confidential intelligence even if the source data are widely accessible. Because many of the contributors to this paper are currently collaborating with several industry partners in the use of RWE for drug discovery applications, we do know that the research is occurring but it apparently is not yet appearing in the open body of knowledge as peer-reviewed literature. Given this lack of specific drug discovery examples in the final papers of our screening, we chose studies that clearly demonstrated uses of RWD techniques or applications that can be re-appropriated or reverse-engineered for commercial, unmet medical needs in clinical research and drug development, often in areas that drug companies are currently focused; namely, oncology, neurological disorders, cardiovascular disease (CVD) and autoimmune diseases. As highlighted in our results, any discoveries employing large datasets will need to be investigated to minimize confounding variables and establish their clinical validity to pharmaceutical applications. We noted that data quality was rarely discussed but it is an important consideration for RWD. However, we are confident that these exemplars provide a valuable contribution to insights on the use of RWD to advance disease understanding and biomarker discovery.

Strengths and limitations
There are several important limitations to this literature review that might have impacted the findings. Owing to the rapidly evolving nature of technology, we limited our results to papers from 2012 to 2016 to capture the state of the art; this could exclude some relevant earlier examples. Our EHR search is limited to health records, and excludes other databases such as genomic data, immunochemistry and claims databases. As well, our search only sought evidence within the peer-reviewed literature and there could be examples currently being investigated privately within industrial R&D that are considered proprietary. We therefore recognize that RWD work published in journals listed in PubMed is no more than the tip of the iceberg regarding the use of RWD for drug R&D.
In specifically seeking out large-scale uses of RWD (i.e., big data), we limited our data sources to those that had the keyword 'million'. Whereas we assume that 'million' will capture large datasets; we are aware that this might not be the case: the Italian Government and OHDSI examples cited in our introduction were not captured; even though they are best-practice examples in the application of RWD. There is a need for the field to agree on what defines big data and RWD to facilitate consistent empirical research in this area going forward. Discussion is also needed on other measures of signals that should be considered when evaluating a RWD-based study: effect size; number needed to treat; sample size; among others. Consequently, we accept that this literature review offers an exemplary insight rather than a comprehensive examination of the present state of empirical research in the field. We have deliberately not claimed this to be a systematic literature review. We recognize that smaller-scale RWD could also be useful [37].

Implications of this work to pharma
Despite this, a methodology exists for focused literature review that can provide insights for clinical research and drug development pathways utilizing RWD. Targeting of real world studies can elucidate possible partners and collaborators with whom pharmaceutical companies could explore the opportunity to work together on gathering real world insights from their external data sources. In identifying a study, only empirical results are known, and a pharmaceutical researcher will need to establish a partnership to be able, at a minimum, to have an opportunity to review original data (within the ordinary and current constraints of such a task). Beyond this, working within a quid pro quo relationship, researchers and original real world study authors have an opportunity to support drug development and work beyond the original author's study remits. It is envisaged this might be a premise for such a collaboration to mutual benefit.
By its very definition, RWD is not necessarily accessible by pharma, requiring local provenance and governance to be protected. Because such remote connectivity to RWD, whether through, for example, federated data networks, common data models as intermediaries or indirect analytical outputs, might be the only agreeable contract with pharma for data custodians of studies as described in this manuscript, such an undertaking is a very different relationship with data than pharma is necessarily conversant with, for instance with its own randomized clinical trial data. Longitudinal collaborations must have a mutual relationship based on trust and transparency of intended use paramount to successful research. To reciprocate with regard to transparency, a call to action is for pharma to expand on adding to the body of evidence in this domain via peer-reviewed publications. As the use of RWE within R&D increases in prominence, evidence will be required not only by pharma as to its veracity but also by regulatory authorities and others who are also needing to understand the role of RWE in 21 st century drug development.

Concluding remarks
We have observed a steady, almost exponential, increase in the publication of empirical research that is within the scope of our review. From the early 2000 s we have seen a steady growth in papers about the opportunity of using big health data, methodology papers and papers describing various solutions such as data warehouses and analytics platforms. During the past 5 years, and especially over the past 3 years, we have seen a growing number of actual empirical findings from using big health data relevant to clinical research and drug development. We anticipate continued growth in the quantity, sophistication and scale of this research area.
To accelerate the generation of RWE relevant to clinical research and drug development, we believe that pharma needs to invest in making better use of EHRs and their linkage to molecular databases (within the right governance and technology frameworks). We see the need for more precompetitive collaboration to grow the scale of this 'big denominator' capability, especially given the needs of precision medicine research. We also foresee the need for richer academic-industry-government partnerships, which will depend upon the willingness of governments to provide industry with access to anonymized health data and work collaboratively across academic centers, to reach the necessary population scale. Finally, the authors hope that these opportunities to scale up RWE will help to stimulate improvements in the data quality and interoperability of RWD sources across healthcare and academia.