Generate Analysis-Ready Data for Real-world Evidence: Tutorial for Harnessing Electronic Health Records With Advanced Informatic Technologies

Although randomized controlled trials (RCTs) are the gold standard for establishing the efficacy and safety of a medical treatment, real-world evidence (RWE) generated from real-world data has been vital in postapproval monitoring and is being promoted for the regulatory process of experimental therapies. An emerging source of real-world data is electronic health records (EHRs), which contain detailed information on patient care in both structured (eg, diagnosis codes) and unstructured (eg, clinical notes and images) forms. Despite the granularity of the data available in EHRs, the critical variables required to reliably assess the relationship between a treatment and clinical outcome are challenging to extract. To address this fundamental challenge and accelerate the reliable use of EHRs for RWE, we introduce an integrated data curation and modeling pipeline consisting of 4 modules that leverage recent advances in natural language processing, computational phenotyping, and causal modeling techniques with noisy data. Module 1 consists of techniques for data harmonization. We use natural language processing to recognize clinical variables from RCT design documents and map the extracted variables to EHR features with description matching and knowledge networks. Module 2 then develops techniques for cohort construction using advanced phenotyping algorithms to both identify patients with diseases of interest and define the treatment arms. Module 3 introduces methods for variable curation, including a list of existing tools to extract baseline variables from different sources (eg, codified, free text, and medical imaging) and end points of various types (eg, death, binary, temporal, and numerical). Finally, module 4 presents validation and robust modeling methods, and we propose a strategy to create gold-standard labels for EHR variables of interest to validate data curation quality and perform subsequent causal modeling for RWE. In addition to the workflow proposed in our pipeline, we also develop a reporting guideline for RWE that covers the necessary information to facilitate transparent reporting and reproducibility of results. Moreover, our pipeline is highly data driven, enhancing study data with a rich variety of publicly available information and knowledge sources. We also showcase our pipeline and provide guidance on the deployment of relevant tools by revisiting the emulation of the Clinical Outcomes of Surgical Therapy Study Group Trial on laparoscopy-assisted colectomy versus open colectomy in patients with early-stage colon cancer. We also draw on existing literature on EHR emulation of RCTs together with our own studies with the Mass General Brigham EHR.

List of nonspecific terms such as "disease", "acute" or any terms with 3 letters or less were excluded from the dictionary.
1. Only features with description achieving at least 0.4 score were kept.
2. Matching to groupings (PheCode and CCS) were preferred over matching to base codes (ICD-9, ICD-10, CPT). If a concept is matched to a grouping and some other base codes, we will keep the grouping along with any base code of strictly higher score than the grouping. 3. Matching from parent terms were preferred over that from child terms. For example, we may recognize parent term "acute appendicitis" C0085693 along with child term "appendicitis" C4553526. If matching of the parent term is successful, none of its child terms will be considered. 4. For concepts without a direct mapping, we used the KESER knowledge network to find codes with top cosine similarities [31]. A shiny interface is available at https://dev.parsehealth.org/shiny/ARCH/.
The final mapping was presented in Table S1. Compared with the expert guided approach in the prior emulation of COST Study Group Trial [28], the scalable approach recovered mapping of key eligibility criteria on colon cancer, concurrent cancer, transverse colon cancer, rectal cancer, Crohn disease, bowel obstruction, metastasis familial polyposis and perforated colon to structured codes. In addition, the scalable approach suggested a list of CUIs as potential alternatives to structured data. The identification of treatment procedures and tumor locations, however, requires further curation in Modules 2 and 3.  Construct the disease cohort. Many phenotyping algorithms require the silver-standard labels, often the total counts of associated PheCodes or CUIs. Some also require a feature that is a proxy for healthcare utilization to account for the heterogeneity in the dataset. A set of goldstandard labels should be generated for validating the performance of the phenotyping algorithm. Additional gold-standard labels are needed for training supervised or semi-supervised phenotyping methods. Our example is based on MAP[28,36].
1. Extract the silver-standard labels and healthcare utilization feature for patients in data mart. We recommend using total days with disease PheCode and total number of disease CUI as silver-standard labels and total days with any ICD code as healthcare utilization. Our recommendation stems from the observation that multiple codes in one day merely reflect the administration pattern (less for integrated provider and more for segmented providers) yet multiple mentions of the disease in medical notes usually indicate likelihood of disease onset. Manual chart review to obtain the gold-standard labels for a random subset, e. g. 59 patients as in Module 4, should also be done in parallel.
2. Apply an unsupervised phenotyping method, e. g., MAP, and validate the performance with the gold-standard labels. If the numeric prediction is reasonable (area-underreceiver-operating-characteristic-curve, AUROC >0.9), choose the threshold with 0.95 specificity and construct the disease cohort with patients whose numeric prediction is greater than the threshold. Otherwise, go to next step. 2. Patients with multiple surgical codes around first colectomy CPT (3 days before to 3 days after) were excluded, as they likely underwent more complex procedures.
3. Patients were required to have a recent radiological test (42 days within registration) and undergo a colectomy within 21 days, so we interpreted the requirement as the implicit eligibility that a colectomy must be done within 90 days following the colorectal cancer diagnosis.
A further refinement of the treatment arms will be done in Module 3 with the curated data for eligibility criteria.

While the indication for COST Study Group Trial is relatively straightforward, indications for
other studies may require delicate learning process. For example, the indication "first-line therapy for metastatic cancer" would be more challenging to identify because the indication allows prior therapy before metastasis as adjuvant therapy but excludes other therapies between metastasis and the therapies of interest. For such studies, the temporal phenotyping Over 100 labels on terminal status annotated by a clinically trained abstractor, we observed 9 missing death records while the terminal score is highly predictive for death/terminal status at the end of follow-up with area-under-reception-operator-curve (AUC) 0.954. We selected the threshold as 0.5 to approximate the death/terminal status rate in the 100 labeled patients, i. e.
the event status is set as 1 if the terminal score is larger than 0.5.

Section S4. Details and examples for Module 4
Validation sample size calculator We derive the validation size calculator under two assumptions: 1) patient data is independent across individuals; 2) validation sample is a relatively small subset in the large study cohort. Under such setting, the number of errors detection among validation sample can be modeled by a binomial distribution with size being validation sample size and probability being overall error rate, Solve the equation for validation size, we will get the formula, We obtain the lower bound for validation size according to the following facts: 1. Fixing error rate, increasing validation size will result in higher detection chance. To ensure integer number for validation size, we relax the equation to inequality, which will guarantee a higher than nominal error detection chance, ) .
2. The formula is monotone decreasing in error rate. Since the true error rate is unknown in practice, we set an error tolerance for the calculation of validation size that guarantees a higher than nominal error detection chance whenever true error rate is higher than the error tolerance ( ≥ ),