Organizational Breast Cancer Data Mart: A Solution for Assessing Outcomes of Imaging and Treatment

PURPOSE In the United States, a comprehensive national breast cancer registry (CR) does not exist. Thus, care and coverage decisions are based on data from population subsets, other countries, or models. We report a prototype real-world research data mart to assess mortality, morbidity, and costs for breast cancer diagnosis and treatment. METHODS With institutional review board approval and Health Insurance Portability and Accountability Act (HIPPA) compliance, a multidisciplinary clinical and research data warehouse (RDW) expert group curated demographic, risk, imaging, pathology, treatment, and outcome data from the electronic health records (EHR), radiology (RIS), and CR for patients having breast imaging and/or a diagnosis of breast cancer in our institution from January 1, 2004, to December 31, 2020. Domains were defined by prebuilt views to extract data denormalized according to requirements from the existing RDW using an export, transform, load pattern. Data dictionaries were included. Structured query language was used for data cleaning. RESULTS Five-hundred eighty-nine elements (EHR 311, RIS 211, and CR 67) were mapped to 27 domains; all, except one containing CR elements, had cancer and noncancer cohort views, resulting in a total of 53 views (average 12 elements/view; range, 4-67). EHR and RIS queries returned 497,218 patients with 2,967,364 imaging examinations and associated visit details. Cancer biology, treatment, and outcome details for 15,619 breast cancer cases were imported from the CR of our primary breast care facility for this prototype mart. CONCLUSION Institutional real-world data marts enable comprehensive understanding of care outcomes within an organization. As clinical data sources become increasingly structured, such marts may be an important source for future interinstitution analysis and potentially an opportunity to create robust real-world results that could be used to support evidence-based national policy and care decisions for breast cancer.


INTRODUCTION
Breast cancer is the most common cancer diagnosed in women and the leading cause of cancer death in women worldwide, 1 with estimated global macroeconomic cost of $2 trillion international dollars for 2020-2050, using 2017 prices. 2 In the United States, breast cancer treatment accounts for 14% of all cancer costs with 2020 annual expenditure of $29.8 billion.The highest care cost occurs in the final year of life, estimated to be $76,100 per patient. 3Despite the magnitude of expenditures on diagnosis and treatment of this disease, an estimated 43,000 women will die in 2023 in the United States from it. 4Thus, identifying real-world optimal diagnosis and care strategies is critical to reduce individual and societal impacts.Such determinations are difficult because of lack of real-world data comparing different approaches for detection, diagnosis, and treatment.For a comprehensive analysis of breast cancer care in the United States, large data sets are needed, and, as outcomes are dependent on regional variances that include delivery of care and population differences, data sets from multiple institutions and regions are required to understand the status nationally.
In several countries, population-based national registries collect detailed screening, treatment, and outcome data.This enables direct understanding of benefits and costs of screening and treatments in those populations.However, in the United States, screening is voluntary, and no such comprehensive registry exists.Federal law requires states maintain a cancer registry (CR).The Centers for Disease Control (CDC) oversees the National Program of Cancer Registries (NPCR), which encompasses approximately 97% of the population.The National Cancer Institute SEER program includes approximately 48% of the population.Both registries collect detailed information regarding stage at diagnosis, treatments, and outcomes, as does the American Surgical Society's Commission on Cancer (CoC) registry.None collect imaging information.The American College of Radiology's registry, the National Mammography Database, collects detailed imaging history but only has cursory data on stage at diagnosis and no outcome data.
The Breast Cancer Surveillance Consortium (BCSC) includes six active SEER registries collecting screening information and SEER data so is an important comprehensive database.The US Preventative Services Task Force guidelines, updated in 2009 5 and 2016, 6 have been based in part on BCSC data with estimations for population-level outcomes using NCI's Cancer Information and Modeling Network (CISNET).Critics have raised concern that because BCSC registries are in regions with relatively poor care delivery, the data do not reliably predict outcomes of screening, but instead reflects a fractured health care delivery system.Merging of SEER, NPCR, NMD, and BCSC is possible but challenging because of significant privacy concerns regarding data sharing, costs in creating and maintaining a new database, and other factors.
Institutions have invested significantly in electronic storage of medical information in the past several decades.Most have multiple products intended to collect portions of information such as radiology (RIS), pathology (PD), and medical oncology (MD) information systems, electronic health records (EHRs), and CRs.Although each serves a particular purpose for patient care, historically these have been siloed.Recently, institutions have begun to realize the powerful clinical care, quality improvement, and clinical research benefits that internal data warehouses can provide.Large-scale initiatives such as from the Office of the National Coordinator for Health Information Technology (ONC) also are moving health care vendors and organizations toward ever increasing standardization to facilitate information exchange for optimal patient care.This initiative and others will ultimately facilitate easier sharing across institutional data marts for other purposes such as research.
Herein, we present the process involved in creation of a research institutional breast data mart.The mart's purpose is to study innumerable questions related to breast health and care delivery and to be a template for other organizations.

Mart Construction
Institutional review board approval was obtained to create and study a Health Insurance Portability and Accountability Act (HIPPA)-compliant research breast data mart within our preexisting research data warehouse (RDW), Neptune.The RDW was the resource for cohort identification and data extraction.It contains atomic layer data stored close to source structures, containing only transformations for deidentification requirements, and extracted from our EHR systems, health insurance claims, and research data.Data are extracted monthly from the source structures for transformation and loading into the RDW, the construction of which has been previously published. 7This design allows the same high level of granularity as the source systems, thereby creating a gold standard that would be lost if the data were manipulated or rolled up in any way.In cases where domain data are pulled from multiple source systems, the domain data are stored in a table specific to that domain and source system, thereby permitting researchers to select the data from the preferred source system.As new elements are adopted by source structures and added to RDW, the mart will be updated as well.Thus, the mart represents an evolving clinical database that will stay current with clinical systems.
A working group of radiology, pathology, surgical, medical, and radiation oncology physicians, the chief informatics research officer, computer scientists, and the cancer registrar itemized a comprehensive list of data elements and their source information system(s) related to patient demographics, imaging, diagnosis, treatments, costs, and outcomes.
At outset, the RDW did not contain RIS (ImageCast, General Electric, Waukesha, WI) and CR (METRIQ, Eleckta Inc, Stockholm, Sweden) content.Appropriate leaders were interviewed to understand their databases and data transfer agreements established.For the CR, all persons with an International Classification of Diseases (ICD)-9 or ICD-10 code indicating a breast cancer diagnosis (IDC-9: 174.0-174.9,198.81,233 and ICD-10: C50.011-50.919,C79.81, D05.90, D05.91, D05.92) who underwent initial therapy at our primary breast care facility (Magee Womens Hospital of University of Pittsburgh Medical Center) were included for initial trial mart construction.North American Association of Central Cancer Registries (NAACCR)-required elements and several site-specific elements were included (Appendix Table A1).All RIS elements in the source structure were incorporated.Data dictionaries were collected for both the CR and RIS.PD (CoPath Plus, Sunquest, Tucson, AZ) and MD (Aria, Varian Inc, Palo Alto, CA) information were in the RDW before this effort; however, the data exist as free text.For example, synoptic surgical pathology reports that contained detailed information on every cancer had not been discretely identified nor evaluated for potential mining previously.To improve the RDW search for these elements, interviews with the pathology informatics and physician leaders clarified the report constructs.Then, an RDW analysis identified these reports, determined what structured data were available, and what data existed as free text.

Data Collection
In an iterative process, every possible source location for each element was identified and evaluated.When possible, the primary element source was the one most frequently completed, robust, and extractable.As examples, EHR and RIS included family history.We compared element fill rates and specific relationship information.The EHR was selected because of better granularity (eg, EHR had paternal uncle, whereas RIS only contained uncle).In another example, cancer immunohistochemistry elements (ie, estrogen receptor, progesterone receptor and Her-2-neu) were in the PD and CR.The CR was selected as primary as it contained discrete site-specific data elements, whereas the PD contained free text only in the synoptic surgical pathology reports.To mine free text, an algorithm such as natural language processing (NLP) is necessary.All available sources for every element were included in the mart regardless of which was denoted as primary, then elements were sorted and labeled into related groups.

Mart Organization
The RDW was queried for all patients with at least one breast imaging examination (on the basis of RIS examination codes) from January 1, 2004, through December 31, 2020.Each patient was assigned a unique research identifier.Oracle (Oracle, Inc, Austin, TX) was used for the mart construction.Data domains were defined by prebuilt views of grouped elements to extract data denormalized according to RDW requirements.The views were used to add data using an export, transform, load pattern from the RDW to the mart.
The mart was structured into two cohorts so that domains (except the CR) contained two views.Cohort" view included everyone with an ICD-9 or ICD-10 diagnosis code for breast cancer; control view included everyone with a RIS examination code specific for a breast imaging examination and without these ICD-9/-10 codes.Each Oracle view, imported element, source structure name, and comments from the RDW building team (if needed) were cataloged in the mart for researcher reference.Figure 1 depicts the data flow.Appendix Table A1 lists all mart data elements and views.

Initial Mart Quality Analysis
Structured query language searches sorted the number of examinations by year and compared results against the historic number of examinations performed in the organization to filter and consolidate duplicate records.Valid mammogram dates (ie, January 1, 2004, to January 1, 2021) linked an examination to a CR entry to determine screening interval.Examinations with invalid or unknown dates were removed.Remaining examinations were then sorted by patient research ID.Screening examinations were identified by screening indication examination codes or examinations more than 260 days from the most recent previous examination date.Screen interval of a minimum of 260 days was selected to avoid inclusion of examinations that may have been performed for 6-month short-interval follow-up from the last screen.This generated an imaging history for each patient.Nonduplicates were concatenated, separated with a semicolon then joined to the CR view using patient ID and examination date as links with the CR element date of first contact.Results were pivoted to arrange all examinations for each patient in chronological order, with the associated columns for each examination.

Data Collection, Data Elements, and Domain Mapping
RDW query identified 497,218 unique patients with at least one breast imaging examination and a total of 2,967,364 breast imaging examinations.Electronic medical record query identified 39,860 patients with a diagnosis of breast cancer, based on ICD-9/-10 codes (Table 1).CR query revealed 15,619 patients who received their initial breast cancer treatment at our primary breast care facility.
The working group defined 331 unique data elements and mapped them to 27 data domains.Because each domain, except the CR, included control and cohort views, the mart contains 589 total elements mapped to 53 Oracle views with an average of 12 elements per view (range, 4-67; Appendix Table A1).Source system contributions included EHR: 17 domains, 34 views, and 311 elements; RIS: nine domains, 18 views, and 211 elements; and CR: one domain, one view, and 67 elements.

Data Curation
Data were curated by consolidating records for a given element that appeared in multiple sources and/or data were missing.For example, the RIS and EHR had menopausal status.The RIS element was the most complete, thus used when available, but when empty, we implemented a classification strategy to establish menopausal status at the time of each imaging examination.The order of analysis was used as a hierarchical decision tool with the first considered most robust and the 6 th the least robust.When elements conflicted, related fields were searched to attempt to establish truth.Occasionally, menopausal status was recorded as postmenopausal then later as premenopausal.In this situation, the related RIS element last menstrual period was reviewed and if blank, then EHR drug lists were used to determine if perhaps the patient was on birth control (implying premenopausal status) or prescribed drugs used for cancer treatment in some postmenopausal women (eg, aromatase inhibitors).Thus, all records except 39,123/2,967,364 (1.3%) were assigned premenopausal or postmenopausal with the residual labeled as perimenopausal.Table 2 lists steps and results.

DISCUSSION
This effort builds on earlier proof-of-concept work in which we matched RIS and CR records of 1,316 patients with breast cancer. 8Although CDC, SEER, BCSC, CoC, and insurance databases such as Optimum each maintain some elements of the patient record, none contain the detail and breadth described herein.By linking patient information from the EHR, CR, RIS, PD, and MD, we created a breast cancer data mart that exists within our organizational RDW.This robust mart has advantages over other existing marts because information on breast imaging procedures and diagnoses are coupled with patient clinical data, treatments, and outcomes.
A strength of real-world data is that they include the source systems with every detail of the medical record.Therefore, evolving treatments, imaging modalities, etc. can be captured at the patient and encounter level to observe outcomes effects and answer important clinical questions.For example, racial disparities in outcomes can be examined at a more granular level and risk assessment can be modeled using a variety of data.In radiology, questions regarding the frequency and modality of screening and the ideal age to start and stop screening can be examined in terms of patient outcomes and economic cost-benefits.In pathology, rare breast cancer subtypes can be extracted for study of the clinical-pathologic features and outcomes to better define these entities.One could also extract granular pathology information including semiquantitative receptor data and automatically compute multivariable models such as Magee Equations to assess clinical outcomes.In breast surgical oncology, the effect of aggressive surgical intervention versus de-escalation could be carefully studied.For example, identification of populations for which surgery or axillary staging could be safely omitted might be possible, including in those receiving neoadjuvant therapies.These can be a cost-effective way of providing outstanding patient care, which can be of interest to many integrated health systems.
Although significant standard structure exists in clinical data sources on the basis of existing standards (eg, HL-7, NAACCR) and federal initiatives (eg, ONC), interpretation of standards, implementation of optional components, robust methodology to establish truth when conflicting source data are present, and a vast amount of information existing as free text all pose future challenges to navigate.For example, because our institution is within an NPCR state, the CR collects all NAACCR-required elements but not all additional SEER-required elements.For example, SEER but not NPCR requires the NAACCR element PR summary.The information is in our pathology data source as free text, thus can be identified using word search or NLP and extracted into the mart.
There are three published reports of breast cancer marts.
Nelson and Weerasinghe 9 collected data from 2008 to 2011, accumulating 250,968 mammograms for a quality improvement project.Two more recent efforts were created to facilitate research using artificial intelligence.GENERATOR supports breast cancer pathways of care at Gemelli University Hospital in Rome, Italy. 10 This mart does not include RIS information, so it is not possible to determine the effects of imaging history or method of cancer detection on patient outcome.The Diagnosis Data Archive at Salah Azaiez University Hospital in Tunisia does contain mode of detection. 11he status of that effort is unclear as no detailed information, such as the number of cases, is reported yet.
Future expansion to include all images, CR from every facility, and financial and genetics information is planned.
Given the amount of unstructured clinical data (eg, provider notes, procedural notes, and pathology reports) available across source systems, future work will include incorporating more unstructured data into the mart.This could be accomplished using a tool, such as MetaMap, 12 to recognize clinical concepts in unstructured data and map these concepts to the UMLS Metathesaurus. 13Such an approach will allow for the codification of unstructured data, potentially expanding the scope and amount of data in the mart and aiding in the data curation.The size of our mart facilitates use of artificial intelligence, in particular deep learning, to discover new knowledge such as identifying imaging biomarkers to predict tumor response to different treatments.
We experienced several unexpected challenges.Our RIS is based on an older software program, which is harder to mine.We overcame this through a series of meetings between the RDW and RIS leaders to understand data structure details and how to best integrate them.The CR is a network of facility-level registries that migrated to the METRIQ platform at various times over the past several decades.We chose to focus, in this trial analysis, on our primary breast cancer facility as it was comprehensive of the date range and a test for incorporation of the CR more comprehensively into the warehouse.This restriction reduced available cancers for initial analysis to 39% total in the organization.
A limitation of this work is that it comes from a single health care enterprise.Our patient population does not necessarily reflect the distribution of patients across the country in terms of ethnicity, race, and social determinates of health.Nevertheless, our institution is a mixture of academic and community facilities in rural, suburban, and urban settings.It is our hope that our mart will inform other health care enterprises to develop their own mart, and as such, consensus might be achieved on elements for inclusion, thereby facilitating a collection of data marts, which, in turn, may support existing established and burgeoning national endeavors such as NCI's SEER and BOLD, so that real-world data can be studied to inform clinical decisions and national standards.This effort represents only initial internal steps.Much additional work and collaboration is needed within and across institutions and organizations to accomplish such lofty goals.
To understand real-world benefits, efficiencies and harms of breast cancer screening and treatment in the United States, patient-level linkage of demographic data, imaging history, and results with treatment, cost, and outcomes is needed.
Institutions have this information disparately located in many databases.We have demonstrated the creation of a curated data set from disparately located clinical sources into a mart is possible, and we have given a detailed description that should enable any institution to replicate our data mart using their own electronic medical records.Institutional marts may play an important role eventually in understanding real-world outcomes for breast care.

FIG 1 .
FIG 1. Data flow diagram.Neptune is the name of our research data warehouse.EMR, electronic medical record; RDW, research data warehouse; SQL, structured query language.

TABLE 1 .
Demographics of Data Mart Population

TABLE 2 .
Menopausal Status a Hormonal treatments and menstrual data.b Premenopausal if tamoxifen or raloxifene in medication list at the time of examination.Postmenopausal if aromatase inhibitors in medication list at the time of examination.

TABLE A1 .
Cancer Registry Data Elements

TABLE A1 .
Cancer Registry Data Elements (continued)

TABLE A1 .
Cancer Registry Data Elements (continued)

TABLE A1 .
Cancer Registry Data Elements (continued)

TABLE A1 .
Cancer Registry Data Elements (continued)

TABLE A1 .
Cancer Registry Data Elements (continued)

TABLE A1 .
Cancer Registry Data Elements (continued) © 2024 by American Society of Clinical Oncology Zuley et al