Cerner real-world data (CRWD) - A de-identified multicenter electronic health records database

Cerner Real-World DataTM (CRWD) is a de-identified big data source of multicenter electronic health records. Cerner Corporation secured appropriate data use agreements and permissions from more than 100 health systems in the United States contributing to the database as of March 2022. A subset of the database was extracted to include data from only patients with SARS-CoV-2 infections and is referred to as the Cerner COVID-19 Dataset. The December 2021 version of CRWD consists of 100 million patients and 1.5 billion encounters across all care settings. There are 2.3 billion, 2.9 billion, 486 million, and 11.5 billion records in the condition, medication, procedure, and lab (laboratory test) tables respectively. The 2021 Q3 COVID-19 Dataset consists of 130.1 million encounters from 3.8 million patients. The size and longitudinal nature of CRWD can be leveraged for advanced analytics and artificial intelligence in medical research across all specialties and is a rich source of novel discoveries on a wide range of conditions including but not limited to COVID-19.

Big Data Analytics Specific subject area Multicenter electronic health records database Type of data Electronic health records How data were acquired Data use agreements and permissions from individual health systems were obtained from clients of Cerner across the United States. Data from each health system were combined and de-identified into a single database. Data format Parquet Tables  Parameters for data collection Electronic health records from each health system that fits into a Structured Query Language tabular format excluding most freetext entries, clinical notes, and images. Description of data collection To create CRWD, each contributor's HealtheIntent data (copy of the EHR) is retrieved for processing and merged into a data warehouse which is then processed to help reduce duplication of identifiers between contributors. After de-duplication, the data is deidentified on an individual patient level by removing fields that contain personal identifiable information (PII) and date-shifting all date/timestamp values. Unique identifiers masking the health systems was created in addition to corresponding U.S. census regions. Data source location Cerner Corporation North Kansas City, MO USA Data accessibility Readers may request access to Cerner Real World Data by (1) licensing the database for a research project that is granted approval by the Cerner Learning Health Network Governance Council. (2) Access is also available to organizations who are contributing data to CRWD. For inquiries about CRWD including information on data use agreements reach out to realworlddata@cerner.com while inquiries about the COVID-19 dataset can be sent to COVIDDataLab@cerner.com.

Value of the Data
• Cerner Real-World Data TM (CRWD) is designed to help users answer deep and complex research questions using data from multiple health systems and heterogenous patient groups. It reduces bias in research due to data from homogenous population that may be inherent in single center studies, and it provides larger sample sizes for rare disease studies. It also ensures that most conditions can be modeled using machine learning given the larger sample sizes. • All researchers, including academic, health system, or life sciences investigators can access CRWD if their healthcare organization is contributing de-identified data to the dataset or by contracting with Cerner through a Learning Health Network (LHN) for access to Healthe-DataLab (a cloud-parallel distributed learning framework) to conduct an approved research project. Other interested researchers or organizations can apply for access to CWRD pending approval. • With this longitudinal database, researchers can analyze detailed sets of deidentified clinical data at the patient level and develop statistical and machine learning models that may be implementable in various healthcare settings.  The tables in CRWD include encounters,  demographics, conditions, immunizations, medications, medication administrations, order lists,  procedures, and results. The encounter table consists of pertinent information regarding patients'  episode of care (or encounter with the health system). It is comprised of data on the encounter class (inpatient, outpatient, emergency department, etc.), datetime fields (service date, hospitalization date, discharge date, etc.), insurance presented during the visit (Medicare, Medicaid, Commercial, etc.), unique identifiers for the patient, encounter, and the health system visited among others. The demographics table includes data on birth sex, gender, date of birth, race, ethnicity, tenant (health system), and one-digit zip code for tenant. Information on the diagnoses of patients are captured within the conditions table including information on diagnosis rank, condition coding system identifiers (International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM), International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM), Systemized Nomenclature of Medicine -Clinical Terms (SNOMED), etc.), and the class of condition (admitting, working, discharge, final, etc.). The labs, measurements, and clinical events tables keep track of clinical events data including vital signs, results of clinical assessments, and laboratory test result. Data on immunization, medication orders, and procedures are captured within the corresponding tables and include full information required for research with the data.

Data Description
The 2021 Q3 COVID-19 Dataset is a subset of CRWD, and Table 1 identifies the eight core data tables included in the COVID-19 database and which CRWD tables were used to populate it. The December 2021 version of the CRWD consists of data from 117 health systems across the United States. It includes data from 100 million patients and more than 1.5 billion encounters across all care settings. A geographical distribution of encounters is shown in Figure 2 . An overview of the contents (table, item) and size (numbers of patients and encounters) of the database tables in CRWD are listed in Table 2 . Counts are calculated using distinct person IDs which leverage a multipoint match algorithm to account for and remove duplicates within a single health system; patients who have visited multiple health systems may appear more than once in the data.
There is usually a 4-month lag between the release of CRWD and release of corresponding COVID-19 data set. As a result, the version of the COVID-19 dataset available at the time of writing was the 2021 third quarter version. The 2021 Q3 COVID-19 Dataset consists of 3.8 million patients and 130.1 million encounters from 110 health systems in the United States. Among these, 2.1 million patients had 2.7 million inpatient or emergency department encounters with

Table 1
Correspondence between CRWD and the COVID-19 database. Table  Primary CRWD Source Table(s)   allergy  allergy  allergy_reaction  allergy  clinical_event  clinical_event  condition  condition  covid_labs  lab  demographics  demographics  encounter  encounter, demographics  immunization  immunization  lab  lab  measurement  measurement  med_rec_compliance medication, order_list medication medication procedure procedure infections with SARS-CoV-2 virus. Additional de-identification processes resulted in combining pediatric patients that are 17 years or younger into a single age group occluding the pediatric distribution of patients. Both the pediatric and adult age distribution of the COVID-19 data set are shown in Fig. 3 . Description of the contents (table, item) and size (numbers of patients and encounters) of the database tables in the COVID-19 data set is shown in Table 3 .

Experimental Design, Materials and Methods
The development of CRWD was initiated in 2019. As of December 2021, 117 health systems in the United States (U.S.) have formally agreed to contribute deidentified patient data to the database in exchange for benefits which include access to the entire multicenter database. The number of participating health systems is expected to grow.
CRWD has its roots in HealtheIntent [1 , 2 , 3] , a Cerner EHR-agnostic population health management platform that aggregates and standardizes all clinical data sources at each health system regardless of EHR vendor. To create CRWD, each contributor's HealtheIntent SM data is transferred into the Cerner Amazon Web Services (AWS) environment for processing. Data from each Table 2 Description of contents of the December 2021 version of CRWD. contributor is merged into an aggregated data warehouse which is then processed to help reduce duplication of identifiers (e.g. person IDs or encounter IDs) between contributors. After de-duplication, the data is de-identified on an individual patient level by removing fields that contain personal identifiable information (PII) (e.g. first name, last name, address, phone number, unstructured information) as well as date-shifting all date values. Additional details on the U.S. policy on deidentification can be found on the US Department of Health and Human Services website [4] . Date shifting is done on a patient level by assigning a patient a random date   shift value that is a multiple of ±7. Each patient's dates are then shifted by this value, preserving the day of the week for all data captured on the same patient as well as the temporal relationship between clinical events. Data is also de-identified on the health system level by removing fields which contain health system-identifying information such as the name or address of the contributing health system. Additional details on the U.S. policy on deidentification can be found on the US Department of Health and Human Services website [4] .
With this longitudinal database, researchers can analyze detailed sets of de-identified clinical data at the patient level. Researchers can access CRWD if their healthcare organization is contributing de-identified data to the dataset or by contracting with Cerner through a Learning Health Network (LHN) for access to conduct an approved research project. Currently, access is limited to U.S. researchers only; however, CRWD may be available for select international researchers in the future. Interested researchers or organizations can apply for access to CWRD, pending approval of proposed research by a data governance council within the LHN and appropriate data use agreements.
All researchers, including academic, health system, or life sciences investigators, who wish to gain access to CRWD must submit a standard data access proposal to Cerner via the LHN (which is described in full details in the next subsection). The proposal provides information about study objectives, study populations, data elements and outcomes of interest, methods, and ultimate use of the analysis. Data access proposals are blinded of all identifying information before review and decision on approval by a data governance council. The LHN governance council is comprised of representatives from many contributing health systems, academic researchers, and a privacy domain expert. This governance council review data access proposals of all researchers regardless of LHN membership.
For all CRWD access, a data use agreement is required which governs the rules of interacting with the data. Rules include not exporting or downloading the data from the secure cloudbased data ecosystem nor attempting to re-identify any data. Analysis of CRWD is conducted in HealtheDataLab TM [1] , the Cerner cloud-based data science ecosystem for analyzing data and predictive model development. HealtheDataLab is built and deployed by AWS and is designed to help users answer deep and complex research questions using statistical and data-science oriented tools that query data, extract and transform datasets into research-ready formats, build complex models and algorithms and validate findings. Available open-source tools in Healthe-DataLab include Apache Spark TM , Jupyter TM , Python®, Spark R, and Spark SQL.
Researchers from non-contributing health systems, universities, or organizations can apply for access to CRWD to conduct research on approved research projects. Access and work with the CRWD are not without financial costs due to the size of the database and the state-of-theart cloud computing resources required for hosting and managing it. These researchers (from non-contributing health systems) can gain access to CRWD via contractual agreement with the LHN covering both the need to guard and protect the privacy of contributing health systems and to cover the cost of data access and preprocessing using cloud computing.

The Learning Health Network (LHN)
The Cerner Learning Health Network SM (LHN) is a collaboration of healthcare organizations that leverage EHR data for research and to improve clinical care. As of March 2022, 81 U.S. health systems in 39 states and the District of Columbia have joined the LHN, and they comprise of over 45,0 0 0 hospital beds and more than 2,222 facilities. Member organizations agree for Cerner to map their institutions' de-identified, site-anonymized patient data for use in approved research. In return, contributing organizations receive benefits including complimentary access to CRWD and HealtheDataLab, as well as opportunities to participate in a variety of federal and industrysponsored research studies, share learnings and collaborate with other members in the network, and propose their own research ideas. Any healthcare organization, academic, provider-focused, rural, or community-based health systems, can opt-in to join the LHN by signing a data network agreement to contribute their de-identified patient data to CRWD.
While data sharing is inherent to the LHN, a key focus is on operationalizing research tools to support clinical studies and create opportunities for members to participate in them. Study tools include patient recruitment, data capture and quality, chart review, patient adherence, and risk calculation. Designed to be a continual data quality improvement loop, the LHN pushes further cleansed data back into the network for members to leverage for research purposes.

Curating the CRWD COVID-19 De-identified Database
The COVID-19 pandemic presented an immediate need for data that could be used for research of risk factors, conditions, outcomes, and potential therapies. The Cerner COVID-19 database, which launched in April 2020, is a curated data set of patients with possible SARS-CoV-2 infection created from de-identified data obtained from CRWD. To qualify a patient for inclusion, an encounter must have a service date of December 1, 2019 or later; an encounter type of emergency, inpatient, admitted for observation, inpatient hospice care, or urgent care encounter; and either a diagnosis related COVID-19 disease from the CRWD condition table or a positive result from a qualifying laboratory code (CRWD result table). In summary, patients qualified for the database with a COVID infection code or COVID exposure code, and patients with a negative COVID-19 test qualified for the cohort if the test was completed on one of the qualifying encounters.
Although there is a direct correlation between data tables across the two databases, the structure of the COVID-19 database has been simplified to facilitate more efficient analysis, which aims to help reduce burden on end users. For example, some CRWD tables include data that were nested within complex data structures such as structs and arrays that are not intuitive for new users of Apache Spark. In preparing the COVID-19 data model, the nested data structures were re-engineered to accommodate users that are accustomed to working with flat files and basic Structured Query Language (SQL) tabular data. The steps taken to derive the COVID-19 database include the following: Identify all encounters that would qualify a patient for inclusion in the database using confirmed laboratory findings for SARS-CoV-2 and diagnosis codes corresponding to COVID-19.
For each person in the cohort, obtain data from all CRWD encounters, conditions, and medications having a service date on or after 1/1/2015. Calculate derived variables (e.g., age at time of encounter) and transform the data to fit the COVID-19 data model and add metadata. Examples of metadata elements include binary indicators to help identify qualifying encounters and various record counts, which are intended to help offer insight into the availability of data for each patient and assist with the selection of study cohorts.
Variability in the way health systems report qualitative lab results has led to the need for some standardization. This standardization is reflected in the "covid_labs" Once all encounters that would qualify a patient have been identified, a table is created that includes the list of the unique person ID values represented. This table is the cohort and serves as a list that can be joined to the CRWD tables to extract the full range of data to be included in the COVID-19 database.

Ethics Statement
The EHR data collected was part of routine treatment and not originally collected for research. It is fully de-identified and therefore does not require patient consent. Furthermore, there are more than 100 million patients in the database making consent impractical. Research has been carried out in accordance with The Code of Ethics of the World Medical Association (Declaration of Helsinki). Conducting research with de-identified data does not meet regulatory criteria for research involving human subjects research, therefore human subjects research regulations regarding informed consent does not apply. All HIPPA guidelines were followed.

CRediT Author Statements
Louis Ehwerhemuepha: All authors contributed to the design, acquisition of data, interpretation of data/results, drafting and revising the article critically for important intellectual content and approved the manuscript. Louis Ehwerhemuepha conceived of the study and led it. Kimberly Carlson: All authors contributed to the design, acquisition of data, interpretation of data/results, drafting and revising the article critically for important intellectual content and approved the manuscript. Ryan Moog: All authors contributed to the design, acquisition of data, interpretation of data/results, drafting and revising the article critically for important intellectual content and approved the manuscript. Ben Bondurant: All authors contributed to the design, acquisition of data, interpretation of data/results, drafting and revising the article critically for important intellectual content and approved the manuscript. Cheryl Akridge: All authors contributed to the design, acquisition of data, interpretation of data/results, drafting and revising the article critically for important intellectual content and approved the manuscript. Tatiana Moreno: All authors contributed to the design, acquisition of data, interpretation of data/results, drafting and revising the article critically for important intellectual content and approved the manuscript. Gary Gasperino: All authors contributed to the design, acquisition of data, interpretation of data/results, drafting and revising the article critically for important intellectual content and approved the manuscript. William Feaster: All authors contributed to the design, acquisition of data, interpretation of data/results, drafting and revising the article critically for important intellectual content and approved the manuscript.