Empowering Clinical Trials with Natural Language Processing Models and Real-World Data: A Feasibility Study to Optimize Clinical Trial Eligibility Design with Data-driven Simulations

Background: Clinical trials are vital for developing new therapies but can also delay drug development. Efficient trial data management, optimized trial protocol, and accurate patient identification are critical for reducing trial timelines. Natural language processing (NLP) shows the potential in achieving these objectives. Objective: To assess the feasibility of using data-driven approaches to optimize clinical trial protocol design and identify eligible patients. This involves creating a comprehensive eligibility criteria knowledge base integrated within electronic health records (EHRs) using deep-learning-based NLP technologies. Methods: We obtained 3,281 industry-sponsored, interventional phase 2 or 3 clinical trials recruiting patients with non-small cell lung cancer, prostate cancer, breast cancer, multiple myeloma, ulcerative colitis, and Crohn's disease from ClinicalTrials.gov, spanning between 2013 and 2020. A customized bidirectional long short-term memory (BiLSTM) and conditional random field (CRF) based NLP pipeline was utilized to extract all eligibility criteria attributes and convert hypernym concepts into computable hyponyms with their corresponding values. To illustrate the simulation of clinical trial design for optimization purposes, we selected a subset of non-small cell lung cancer patients (N=2,775), curated from Mount Sinai Healthcare System as a pilot study. Results: We manually annotated the clinical trial eligibility corpus (N=485 trials) and constructed an eligibility criteria-specific ontology. Our customized NLP pipeline, developed based on the eligibility-specific ontology we created through manual annotation


Table of Contents
1) Would you like to publish your submitted manuscript as preprint?Please make my preprint PDF available to anyone at any time (recommended).
Please make my preprint PDF available only to logged-in users; I understand that my title and abstract will remain visible to all users.Only make the preprint title and abstract visible.No, I do not wish to publish my submitted manuscript as a preprint.2) If accepted for publication in a JMIR journal, would you like the PDF to be visible to the public?
Yes, please make my accepted manuscript PDF available to anyone at any time (Recommended).
Yes, but please make my accepted manuscript PDF available only to logged-in users; I understand that the title and abstract will remain v Yes, but only make the title and abstract visible (see Important note, above).I understand that if I later pay to participate in <a href="http

OBJECTIVE:
To assess the feasibility of using data-driven approaches to optimize clinical trial protocol design and identify eligible patients.This involves creating a comprehensive eligibility criteria knowledge base integrated within electronic health records (EHRs) using deep-learning-based NLP technologies.

Methods:
We obtained 3,281 industry-sponsored, interventional phase 2 or 3 clinical trials recruiting patients with nonsmall cell lung cancer, prostate cancer, breast cancer, multiple myeloma, ulcerative colitis, and Crohn's disease from ClinicalTrials.gov,spanning between 2013 and 2020.A customized bidirectional long short-term memory (BiLSTM) and conditional random field (CRF) based NLP pipeline was utilized to extract all eligibility criteria attributes and convert hypernym concepts into computable hyponyms with their corresponding values.To illustrate the simulation of clinical trial design for optimization purposes, we selected a subset of non-small cell lung cancer patients (N=2,775), curated from Mount Sinai Healthcare System as a pilot study.

RESULTS:
We manually annotated the clinical trial eligibility corpus (N=485 trials) and constructed an eligibility criteria- Additionally, an interface prototype demonstrated the practicality of leveraging real-world data for optimizing clinical trial protocols and identifying eligible patients.

Introduction
Clinical trials are crucial for developing new therapies, but they require significant resources and can introduce delays in drug development, leading to increased costs [1,2].Complex and restrictive eligibility criteria hinder patient enrollment, impacting target goals, timelines, and ultimately patient well beings [3][4][5].
This issue is particularly notable in cancer trials with poor recruitment and high failure rates [6][7][8], as over 80% fail to meet their initial target accruals and timeline [9,10].Additionally, overly restrictive criteria limit the representation of the broader patient population, reducing real-world applicability and treatment impact [11][12][13][14].Nonetheless, many reusing complicated criteria without clear rationale is common [15], despite the minimal impact on trial outcomes [16].Liu et al demonstrated that broadening criteria using a data-driven approach can benefit initially excluded patients [16].A comprehensive and standardized eligibility criteria knowledge base, compatible with real-world data can address these challenges.Such a knowledge base optimizes trial protocol design, improves patient enrollment, enhances reliability and applicability of evidence synthesis, and fosters efficient development of new therapies.Furthermore, it enables opportunities like generating synthetic control arms (SCAs) for single-arm clinical trials using electronic health records (EHRs) [17][18][19].
Datasets like "Chia" [28] and "Leaf Clinical Trials" [29] improve NLP models.An NLP interface, Criteria2Query, enables computable queries for eligible cohort identification using EHR [30].This tool supports a clinical trial knowledge base development, enhancing EHR interoperability and scalability for efficient eligibility criteria knowledge engineering [31].Despite significant progress in bridging the gap between eligibility criteria and EHR, limitations persist in accurately representing the granularities of eligibility criteria and real-time eligible patient number checks [21,32,33].Establishing a standardized eligibility criteria knowledge base by transforming ambiguous hypernym concepts into computable hyponyms can enhance optimizing trial protocol designing, and identifying eligible patients through seamlessly integrating with EHR data.
In this study, we aim to create a standardized eligibility criteria knowledge base that seamlessly integrates with EHRs.By utilizing deep learning-based NLP technique, hypernym concepts in eligibility criteria will be converted to their EHR-compatible hyponyms with corresponding values.Additionally, an interactive application prototype will be developed as a pilot study, enabling data-driven optimization of clinical trial protocols and identification of eligible patients through the integration of the eligibility criteria knowledge base and EHRs.For the development of the prototype interface, we selected a subgroup of 2,775 patients diagnosed with NSCLC from the previously curated lung cancer cohort.This cohort was established using the data from Mount Sinai/Sema4 Healthcare System data [34] and the patient information was de-identified for the purpose of this study.

Deep learning-based NLP pipeline development
Our NLP pipeline consists of three modules: ontology construction, manual annotation/model training, and application.
Ontology construction: We randomly selected 425 eligibility criteria from diverse cancer trials and manually analyzed entities and relations.Entities were categorized into primary and modifier groups.
Relations between entities were defined.The applicability of the ontology was tested on 60 UC and CD trials.

Manual annotation and model training:
We manually annotated 246 eligibility criteria from NSCLC trials using the Clinical Language Annotation, Modeling, and Processing (CLAMP), an NLP toolkit [35].Application: The fully trained NER and the relation models were integrated and applied to annotate the remaining eligibility criteria for the four types of cancer studied.The output data included sentences, tokens, parts of speech, entities, negations, and relations.

Construction of standardized eligibility criteria knowledge base table
The standardized knowledge base was constructed in an "EntityGroup-AttributeName-Value" format, involving two key steps: attribute normalization and transforming hypernyms to hyponyms.
Attributes Normalization: To normalize attributes, we employed a three-step approach.Firstly, we assigned a unified medical language system (UMLS) concept unique identifier (CUI) to map synonyms of an entity, such as "estrogen receptor-positive", "ER-positive", and "ER+" to the UMLS CUI "C0279754".Secondly, we developed a set of rules (Table 1) to map abbreviations (e.g., CrCl to creatinine clearance) and different phrases with the same meaning (e.g., ">=1.5xULN", "greater than or equal to 1.5x ULN", ">=1.5xupper limit of the normal range") back to their original text.Lastly, two domain experts manually curated unnormalized entities.

Creation of a prototype interface for enhancing trial protocol design optimization
Lee et al We developed a prototype interface using the R programming language and the shiny package to enhance trial protocol design optimization.The interface allows users to simulate the number of eligible patients based on specific criteria, including a combination of criteria, including histology, stages, lab test values, performance scores, prior line of therapy numbers, and comorbidities.For this pilot study, a subset of NSCLC patients (N=2,776) was selected and de-identified.To ensure consistency and accuracy, we standardized the sample entities found in both the eligibility criteria knowledge base and EHRs using concept codes such as ICD 10, LOINC, and RxNORM codes.Additionally, we converted the patient's absolute lab test values were converted to either ULM or LLM based on the provided normal ranges for each specific test.

Development of eligibility criteria-specific ontology
Our analysis of cancer clinical trials revealed that hormone therapy was the most frequently tested modality (47.37%), primarily in BCA and PCA trials, followed by targeted therapy (25.35%) and immunotherapy (23.26%).Chemotherapy alone was tested in less than 4% of clinical trials.We developed an eligibility criteria ontology applicable to all cancer trials, by manually analyzing 425 trials (Figure 1).Entities were categorized into ten primary groups (inside the blue dot line) and nine modifier groups based on semantic types and relations.Entities falling outside the primary groups were classified as "other observation".Inclusion criteria mainly involved entities in "demographic, diagnosis, lab test, and vital" groups, while exclusion criteria commonly included entities in "comorbidity, procedure, and other medications" groups.Entities in "biomarker, prior therapy, and clinical status" groups appeared in both inclusion and exclusion criteria.
Relationships originated from primary groups and terminated in modifier groups, except for the "has outcome" relationship, which started and ended in the primary group (Figure 1).To assess the applicability of the cancer eligibility criteria ontology in a different disease context, we conducted a manual analysis of 60 trials related to UC and CD.For reference, the computable format of the manually annotated 485 trials can be found in Supplement Multimedia Appendix Table 1-  Modifier entities are placed outside the blue dotted box.The relationship between the primary entities and modifier entities always starts at a primary entity and ends at a modifier entity.

NLP pipeline quality metrics
To evaluate the quality of our NLP pipeline, we computed precision, recall, and F1 measures.For primary group entities, the average scores were 0.91 (precision), 0.79 (recall), and 0.83 (F1).Table 3 presents the range of precision, recall, and F1 values of 17 primary group entities.

Eligibility criteria attribute extraction and classification
The integrated NER and relation model extracted a total of 9,090 NSCLC, 7,427 PCA, 10,217 BCA, 6,803 MM, 1,565 CD, and 1,586 UC entities along with their attribute relations.After normalization and manual curation processes, the eligibility criteria knowledge base for each disease type was established in the "EntityGroup-AttributeName-Value" format (Supplement Multimedia appendix Figure 2 and Table 4 show the distribution of "EntityGroup-AttributeName-Value" in each primary group from different diseases and provide examples.The lab test, prior therapy, and comorbidity groups exhibited a high number of "EntityGroup-AttributeName-Value", followed by biomarker and other medication groups.
Variations were observed between solid cancers and hematologic cancer, with higher "EntityGroup-AttributeName-Value" numbers in solid cancer types for prior therapy and biomarker, while lab tests and comorbidity were comparable.The diagnosis group exhibited varying entity-attribute numbers across all four cancer types."EntityGroup-AttributeName-Value" in the biomarker, diagnosis, and prior therapy groups were specified per indication while shared "EntityGroup-AttributeName-Value" were found in other primary groups.

Transformation of umbrella terms into computable attributes with representative values
The conversion of hypernym concepts into computable attributes with representative values was performed.
Table 5 provides some examples of converted attributes and their representative values for each hypernym.
All the lists can be found in the knowledge base (Supplement Multimedia appendix Table 6-11) Adequate organ function: Adequate organ function criteria were defined using various lab tests.
Normal ranges and eligible values for alanine transaminase (ALT)/aspartate aminotransferase (AST), total bilirubin, serum creatinine, CrCl, absolute neutrophil counts (ANC), platelets, and hemoglobin were determined.Representative values for "adequate organ/hematologic function" included <=2.5x upper limit of normal range (ULN) for ALT/AST, <=1.5xULN for total bilirubin/serum creatinine, >=1,500 cells/ul for ANC, >=100,000 cells/ul for platelets, and >=9 ng/dL for hemoglobin.Figure 3 A-H displays the lab test value range and trial counts for each value in BCA and NSCLC clinical trials.The trends observed are similar in both cancer types.

A B C D
Comorbidity: The presence of comorbidities is a common exclusion criterion in clinical trials but natural language descriptions of comorbidities, such as "uncontrollable cardiovascular diseases"," pulmonary diseases", and "autoimmune diseases" can be ambiguous and need domain knowledge to interpret.We analyzed the hypernyms and their corresponding hyponyms used in BCA trial eligibility criteria.Figure 4 shows the collected hyponyms for each comorbidity class.The presence of second primary malignancies was excluded in almost all trials.Prior therapy, Other medication, and Biomarkers: By combining all examples of each hypernym, we broke down these hypernyms into actual medication and mutation hyponyms.For instance, we collected "procainamide" or "propafenone" for "current usage of antiarrhythmic medication".Similarly, we collected EGFR Exon 20 "T790M", "T797S", "S768I", or "insertion" for "EGFR mutations resistant to EGFR E F G H inhibitors".

Development of a prototype interface for the optimization of protocol design
Our study investigated the impact of various criteria on the number of eligible patients.We developed the prototype interface that utilizes real-world patient information.Using a subset of de-identified NSCLC patient cohorts (N=2,799), we deployed an eligibility criteria knowledge based we constructed in the interface.
Figure 5A displayed the selected criteria list and 5B shows the corresponding patient number.Figure 5C illustrated the distribution of patient numbers in each group.
Sequentially incorporating criteria like "non-squamous histology" and "stage III and IV" criteria identified 2,166 and 426 eligible patients, respectively, from the total pool of 2,775 NSCLC cases.Further inclusion of AST and ALT <=2.5xULN criteria yielded 363 eligible patients.Limiting AST and ALT to <=1.0x ULN resulted in a decreased number of eligible patients to 315 (Figure 5D).Additionally, we explored the influence of ECOG performance status as an additional criterion.With histology, stage, and ALT/AST lab values (<2.5xULN) as fixed criteria, the introduction of either ECOG 0-2 or 0-1 identified 194 and 151 eligible patients, respectively (Figure 5E).format using prevailing values across different cancer types and modality therapies.We believe our EHRinteroperable standardized eligible criteria knowledgebase and interface, integrating real-world EHR data, have the potential to improve the automatic screening system and identify eligible patients.This can increase patient trial enrollment, ultimately improving the overall success rate of trials.Notably, patients given the option to participate in a trial by their physicians demonstrated a significantly higher participation rate of 55% [39] compared to the current average of 5-8% among cancer patients [40,41].
Certain criteria such as histology, stage, previous treatment, or biomarkers are difficult to modify, while others including vital or lab test values can be adjusted during the protocol design [16].Our study revealed the impact of modifying lab test values while keeping other criteria constant, resulting in fluctuations in the number of eligible patients.Our findings, which demonstrate both the number of trials for different lab value ranges and eligible patient numbers, offer insights for optimizing future protocol design and refining patient selection criteria.Our eligibility criteria knowledge base can also be leveraged for generating SCAs using EHR.
SCAs, derived from real-world evidence, are regarded as substitutes for experimental control arms in trials [17,18,42].The integration of SCAs into single-arm trial data or replacing traditional control arms with SCAs can alleviate the burden of target accrual in trials with low eligible patient numbers, such as rare diseases or oncology trials with specific biomarkers.The FDA's approval of Palbociclib inhibitor for male metastatic breast cancer patients based on real-world evidence demonstrates the potential and relevance of SCAs in improving trial design and outcomes [43].

Limitations
Our study has several limitations to consider.Firstly, we focused on a limited scope, analyzing only four common cancer types and exploring extendibility in the context of inflammatory bowel disease.Future studies should encompass a wider range of cancer types and disease domains for a more comprehensive analysis.Secondly, while most attributes were well-defined, some umbrella terms lacked clear examples in other cancer types, potentially affecting result accuracy.Further manual annotation using knowledge bases could enhance the precision of attribute tables.Thirdly, our dataset may be biased as we solely included industry-sponsored trials, potentially limiting the generalizability of our findings.Fourthly, we did not address entity logic, and establishing the logic between entities would enhance cohort definition accuracy.Lastly, our interface feasibility testing was limited to small NSCLC sample cohorts, and the generalizability of our findings to other populations or disease conditions may vary.Furthermore, we did not perform a quantitative evaluation of the accuracy of matched patients though domain experts checked whether the patient information matched to the criteria manually.

Figures
Lee et al The heat map graph illustrates the number of clinical trials with each example hyponym for the hypernym comorbidities.The Yaxis on the left represents the hyponym disease names, while the Y-axis on the right indicates the number of trials.The X-axis represents the comorbidity class.Note: The exception of "Atopy" is mentioned as an autoimmune disease.The group does not include exceptions of other malignancies such as in situ cervical cancer, noninvasive bladder cancer, curative basal or squamous in-situ prostate cancer, in-situ breast cancer, or resected skin cancer other than melanoma.
with Natural Language Processing Models and Real-World Data: A Feasibility Study to Optimize Clinical Trial Eligibility Design with Data-driven Simulations Abstract Background: Clinical trials are vital for developing new therapies but can also delay drug development.Efficient trial data management, optimized trial protocol, and accurate patient identification are critical for reducing trial timelines.Natural language processing (NLP) shows the potential in achieving these objectives.
specific ontology.Our customized NLP pipeline, developed based on the eligibility-specific ontology we created through manual annotation, achieved high precision (0.91), recall (0.79), and F1 scores (0.83), enabling efficient extraction of granular criteria entities and relevant attributes from 3,281 clinical trials.A standardized eligibility criteria knowledge base, compatible with EHRs, was developed by transforming hypernym concepts into machine-interpretable hyponyms along with their corresponding values.

Dataset
We obtained the data from ClinicalTrial.gov (https://clinicaltrials.gov), specifically industry-sponsored phases II or III interventional clinical trials initiated between January 2013 and May 2020.A total of 3,281 were identified, including 817 trials for non-small cell lung cancer (NSCLC), 649 trials for prostate cancer (PCA), 1,057 trials for breast cancer (BCA), 447 trials for multiple myelomas (MM), 160 trials for ulcerative colitis (UC), and 151 trials for Crohn's disease (CD).
Annotated criteria were used to train (80%) a conditional random field (CRF)-based named entity recognition (NER) model and a long-short term memory (LSTM)-based relation model.Model performance was evaluated on a separate validation set (20%) using precision, recall, and F1 scores F1 scores; Precision: TP/ (TP + FP).Recall: TP/ (TP + FN)).F1 score: 2 x Precision x Recall / (Precision + Recall) .The process was repeated with additional annotations until the F1 score exceeded 0.8.(Supplement Figure 1).Pre-annotation method using the NSCLC pipeline was implemented for PCA, BCA, and MM, and s pecific eligibility criteria like biomarkers and treatments were manually annotated for each cancer type: PCA (124 trials), BCA (73 trials), or MM (60 trials).
last 3 months| past 3 months | within 3 months | within three months within 3 months within 2 years| last 2 years | past 2 years within 2 years within 3 years| last 3 years | past 3 years within 3 years within 5 years| last 5 years | past 5 years within 5 years 10 9 /L | 10^9/L| 10 3/ ul |10 3 /microliter| 1000/ul| 1000/ microliter| K/microliter| 10 3/ mm 3 10 3 /ul Other miscellaneous rules Case insensitive Remove spaces Transforming hypernyms to hyponyms with corresponding values: To formalize hypernyms, identified in primary groups such as lab tests, comorbidity, prior therapy, and other medication, we employed the following approaches: (1) For "adequate organ function" lab test values, we determined prevalent lab values by analyzing the unique lab values for each test across the trials of the same cancer type that defined the normal organ function.2) For comorbidity, biomarker, prior therapy, and other medication hypernyms, we collected all example hyponyms described across the trials of the same cancer type. 5.

Figure 1 .
Figure 1.Clinical trial eligibility criteria ontology.Primary entities are grouped inside the blue dotted box.

Figure 2 .
Figure 2. Distribution of attributes in ten primary groups and another observation group extracted from eligibility criteria of four different cancer types and 2 different autoimmune diseases.Y-axis: number of unique "EntityGroup-AttributeName-Values", X-axis (top): Primary groups, X-axis (bottom): Diseases.BCA: breast cancer, NSCLC: non-small cell lung cancer, PCA: prostate cancer, MM: multiple myeloma, UC: ulcerative colitis, CD: Crohn's disease.

Figure 4 .
Figure 4.The heat map graph illustrates the number of clinical trials with each example hyponym for the hypernym comorbidities.The Y-axis on the left represents the hyponym disease names, while the Y-axis on the right indicates the number of trials.The X-axis represents the comorbidity class.Note: The exception of "Atopy" is mentioned as an autoimmune disease.The group does not include exceptions of other malignancies such as in situ cervical cancer, noninvasive bladder cancer, curative basal or squamous in-situ prostate cancer, in-situ breast cancer, or resected skin cancer other than melanoma.

Figure 5 .
Figure 5. Screenshots from a prototype interface are shown.A-B) The selected criteria list and the corresponding number of patients.C) The distribution of patient numbers in each group.D) Displayed eligible patient numbers after sequentially incorporating criteria such as "non-squamous histology" and "stage III and IV," with the further inclusion of AST and ALT lab values either <=2.5xULN or <=1.0xULN.E) The influence of ECOG performance status as an additional criterion.Displayed eligible patient numbers by introducing either ECOG 0-2 or 0-1, with histology, stage, and ALT/AST lab values (<2.5xULN) as fixed criteria.

Table 1 .
Rules for attribute normalization ANC| absolute neutrophil count| absolute neutrophil counts| neutrophil count| neutrophil counts| absolute neutrophil ANC Lee et al WBC| white blood cells | white blood cell| WBC count| white blood cell count| white blood count | leucocytes WBC platelets| platelet| platelet count| platelet counts

Table 2 presents
some examples of normalized concepts and their codes.The interface utilizes a rule-based algorithm to match patients' EHR data with the specified criteria.Users can specify different criteria and combinations, such as different lab test values with specific comorbidities like "no brain metastasis" to determine the number of qualified patients.The algorithm matches each patient's EHR data with the selected criteria and calculates the number of matched patients for each criterion.The performance of the interface was evaluated by comparing it to the manual patient selection process conducted by experienced clinical domain experts.

Table 2 .
Examples of normalized codes for each concept and normal range of each lab test.

Table 3 .
Performance scores of customized NLP pipeline for each entity in primary groups.

Table 6 -
11).The unique "EntityGroup-AttributeName-Value" combinations varied across disease types, with 494 from 817 NSCLC trials, 471 from 649 PCA trials, and 525 from 1,057 BCA trials, 389 from 447 MM trials, 231 from 160 UC trials, and 230 from 151 CD trials.Notably, UC and CD trials had a smaller number of unique "EntityGroup-AttributeName-Value" compared to cancer trials, indicating the presence of more complicated eligible criteria in cancer trials.

Table 4 .
The number of attributes for ten primary groups and examples

Table 5 .
Examples of hypernym concepts (entity and subgroup entity in eligibility criteria) used in eligibility criteria and converted hyponyms with their corresponding values.