Feasibility of Using Oncology-Speciﬁc Electronic Health Record (EHR) Data to Emulate Clinical Trial Eligibility Criteria †

: We examined eligibility criteria from recent oncology clinical trials to see whether real-world data (RWD) from electronic health records (EHRs) could be used to create external control groups for clinical trials. Trials were identiﬁed from the Aggregate Analysis of ClinicalTrials.gov database; the selected trials were for oncology drugs approved by the FDA in 2020. Verbatim text from trial inclusion and exclusion criteria was qualitatively assessed by an expert panel to determine if criteria could be ascertained from structured and unstructured EHR data. Identiﬁed criteria were categorized (cancer-related, comorbidity-related, demographic, functional status, and trial operations) and subcategorized. Among 53 identiﬁed trials, 20 met the requirements for study inclusion, which included 463 eligibility criteria. Percentages of criteria by category were as follows: cancer-related factors (46%), comorbidities (20%), functional status (18%), trial operations (14%), and demographics (2%). For 18 of the 20 trials, 80% of the eligibility criteria could be ascertained with RWD; for 4 of the 20, it was 100%. When trial operation-speciﬁc criteria were excluded, all 20 met the 100% threshold. Our study indicates that both structured and unstructured data from community-based oncology-speciﬁc EHRs can be used for determining patient eligibility for external control arms for clinical trials.


Introduction
The 21st Century Cures Act of 2016 established the legislative mandate to use realworld evidence (RWE) for regulatory decision making [1,2]. Real-world data (RWD), which are used to generate RWE, are often defined as data collected in routine healthcare delivery versus data collected specifically for research purposes. The use of RWD vs. traditional clinical trials can provide faster, usually more generalizable, and lower-cost assessments of the utility and risk-benefit profile of a medical intervention, potentially enabling accelerated medical product development and advances to patients. However, because RWD collection is not typically customized for any specific research question, there are nuances related to how these data are best used for decision making.
One promising application of RWD involves the construction of external control groups for clinical trials [3]. While prior studies have reported challenges related to the feasibility of emulating trial eligibility criteria with "claims and/or structured" RWD data [4], few, if any, have focused on assessing the utility of accessing data from oncologyspecific electronic health records (EHRs) using both structured (e.g., standardized fields) and unstructured (e.g., free text physician notes and radiographic reports) data. Although structured data in EHR systems can be readily extracted for research, richer clinical data relevant to external control group construction often exist in the unstructured data fields Pharmacoepidemiology 2023, 2 141 from the EHR. These data can be curated using natural language processing technologies or manual chart abstraction and therefore should be viewed as a significant potential source of deeper RWD even though more resources are required for curation.
The purpose of this study was to examine the specific use case of eligibility criteria from recent oncology clinical trials to assess the degree to which RWD from an oncologyspecific EHR in a community setting can be reasonably used to retrospectively access the information needed for emulation of inclusion/exclusion (I/E) eligibility criteria for external control arms in oncology clinical trials. The study only reviews I/E criteria to specifically address the conclusions of Wallach et al. [4]. Future research should examine the ability to emulate other aspects of the RCT, such as clinical outcomes, using RWD.
The categorization of each criterion into categories and subcategories was complex; actual examples of each selected individual criterion and their placement into categories and subcategories by the panel are detailed in Table 1. Percentages of criteria by category were cancer-related factors (46%), comorbidities (20%), demographics (2%), functional status (18%), and trial operations (14%) ( Table 2) [5,6].
Eighteen of the 20 trials met the 80% threshold for eligibility criteria likely to be ascertainable with RWD, while 4 of 20 trials met the 100% threshold when all criteria were considered. Trial-specific criteria are criteria that may be essential to the operation of the clinical trial, but not necessarily data collected within the context of routine care and documentation in the real-world setting. For example, "signed informed consent" to participate in the clinical trial would not be a part of routine care and therefore would not be in the EHR. Documentation of the use of multiple birth control methods for female patients of child-bearing age may be required to ensure patient and fetal safety in the clinical trial setting. However, documentation of this in the real-world setting is not necessarily required. When trial-specific criteria were removed from the assessment, all 20 trials met the 100% threshold for ascertainable trial criteria (Figure 1). documentation in the real-world setting. For example, "signed informed consent" to participate in the clinical trial would not be a part of routine care and therefore would not be in the EHR. Documentation of the use of multiple birth control methods for female patients of child-bearing age may be required to ensure patient and fetal safety in the clinical trial setting. However, documentation of this in the real-world setting is not necessarily required. When trial-specific criteria were removed from the assessment, all 20 trials met the 100% threshold for ascertainable trial criteria (Figure 1).

Figure 1.
Trial emulation rate before and after removing trial-specific criteria. Trial-specific criteria are criteria that may be essential to the operation of the clinical trial, but not necessarily data Figure 1. Trial emulation rate before and after removing trial-specific criteria. Trial-specific criteria are criteria that may be essential to the operation of the clinical trial, but not necessarily data collected within the context of routine care and documentation in the real-world setting. For example, "signed informed consent" to participate in the clinical trial would not be a part of routine care and therefore would not be in the EHR. Other examples include life expectancy, agreement to be abstinent, and donor status.

Discussion
Building on previously cited reports [4], the inclusion of unstructured data along with structured data from oncology-specific EHRs vastly improves the proportion of eligibility criteria that are likely to be ascertainable for clinical trial emulation. Rich unstructured data are accessible in patient charts and should not be neglected as a potential source of information for evaluating the eligibility of patients, especially as opportunities for implementing scalable technological solutions for abstracting data become more tangible. Our study indicates that RWD-based external controls constructed from oncology-specific EHR data are a conceivable solution.
In the context of phase 3 clinical trials, when external control arms are being designed for regulatory purposes, sponsors and regulators should perhaps carefully consider crafting the inclusion and exclusion criteria to be applicable, as much as possible, in both the clinical trial setting and in real-world practice. We would recommend starting with data available in RWD and, when applicable, using these data to generate inclusion/exclusion criteria. Further, specific operational criteria that do not impact study validity should be identified and clearly noted when external control arms are envisioned. The judicious use of eligibility criteria to identify real-world external controls would maximize external validity while preserving internal validity. Inclusivity in clinical trials is a major topic in the current literature [7], and the use of RWD to construct external control arms can aid in improving inclusivity and generalizability. Indeed, restrictive clinical trial eligibility criteria have been cited as one of the major barriers to the participation of a more diverse population in trials, and "revised criteria may improve participant diversity, without compromising safety or study results" [8]. Therefore, prioritization of key criteria and the relaxation of any criteria that are largely unrelated to internal validity would optimize opportunities for inclusion. As stated earlier, variables found in structured vs. unstructured data will vary from EHR system to EHR system, and structured data often need to be supplemented with other data sources [9]. However, as unstructured data are the "gold standard," the best way to emulate an RCT is through structured data supplemented with unstructured data. The level of agreement between these two sources of data would use the unstructured data as the gold standard. The value of unstructured data is an under-appreciated and poorly researched resource. The actual empirical trial emulation test of inclusion/exclusion criteria (as well as trial process and outcome measures) would be an obvious next step in research.

Methods
In accordance with 45 CFR §46, institutional review board approval was not required for this study because public information was used (no patient data were utilized and informed consent was not required).
FDA approvals of oncology drugs in 2020 were identified and matched to trial data from the Aggregate Analysis of ClinicalTrials.gov (AACT) database (accessed 17 December 2021). Trial data from phase 3 clinical trials with outcome information were included. The verbatim text from the reported inclusion and exclusion eligibility criteria in these trials was downloaded from AACT for evaluation. Reported criteria that contained multiple components were separated into discrete, individual criteria. For example, if a criterion was stated as "women over age 35", then the components were listed as separate entries for sex (women) and age (over age 35).
These were qualitatively assessed by an expert panel (the authors) representing expertise in medicine, pharmacy, epidemiology, nursing, and chart abstraction. The panel has expertise and experience accessing RWD from an oncology EHR (iKnowMed) for research purposes; iKnowMed is the EHR for The US Oncology Network, a network of community oncology practices covering 40 states of the United States with 1.2 million newly diagnosed cancer patients yearly; iKnowMed is used by over 2700 oncology providers [10]. There are many EHR systems available, and the degree to which specific data are included in structured versus unstructured fields will vary from system to system and even within different implementations of the same system. The panel assessed each criterion for its inclusion in the medical record as part of routine patient care (versus clinical trial purposes), and each criterion was scored as "likely" or "not likely" to be ascertainable in structured and/or unstructured EHR data. Any discordance in scores was adjudicated through discussion.
The resulting tabulation of criteria was then organized by the panel into categories and subcategories to facilitate assessment. Many criteria could be worded as a positive (e.g., adequate hepatic function) or a negative (inadequate hepatic function), affecting placement as an inclusion or exclusion criterion. Either distinction would be classified as a single subcategory by the panel.
The criteria were classified into 5 different categories and 29 subcategories as follows: Under the trial-specific category, a subcategory of trial operations was created for those criteria related to logistical and operational elements of running a clinical trial such as informed consent, pre-randomization pregnancy testing, and behavioral restrictions (such as promise to abstain from sex). The proportion of trials with at least 80% and those with 100% of eligibility criteria judged to be likely ascertainable with RWD [4] was determined. The data analysis for this paper was generated using SAS software v9.4 (SAS Institute Inc., Cary, NC, USA. Accessed on 1 January 2022 to 21 March 2023. https://www.sas.com/en_us/legal/editorial-guidelines.html). Institutional Review Board Statement: Institutional Review Board and Compliance/Privacy approvals were not needed as no patient data were used.

Informed Consent Statement:
Patient consent is not relevant to this study as no patient data were used.

Data Availability Statement:
The study data are available to the public from the Aggregate Analysis of ClinicalTrials.gov database and from the Food and Drug Administration website.