Development and Application of an Open Tool for Sharing and Analyzing Integrated Clinical and Environmental Exposures Data: Asthma Use Case

Background The Integrated Clinical and Environmental Exposures Service (ICEES) serves as an open-source, disease-agnostic, regulatory-compliant framework and approach for openly exposing and exploring clinical data that have been integrated at the patient level with a variety of environmental exposures data. ICEES is equipped with tools to support basic statistical exploration of the integrated data in a completely open manner. Objective This study aims to further develop and apply ICEES as a novel tool for openly exposing and exploring integrated clinical and environmental data. We focus on an asthma use case. Methods We queried the ICEES open application programming interface (OpenAPI) using a functionality that supports chi-square tests between feature variables and a primary outcome measure, with a Bonferroni correction for multiple comparisons (α=.001). We focused on 2 primary outcomes that are indicative of asthma exacerbations: annual emergency department (ED) or inpatient visits for respiratory issues; and annual prescriptions for prednisone. Results Of the 157,410 patients within the asthma cohort, 26,332 (16.73%) had 1 or more annual ED or inpatient visits for respiratory issues, and 17,056 (10.84%) had 1 or more annual prescriptions for prednisone. We found that close proximity to a major roadway or highway, exposure to high levels of particulate matter ≤2.5 μm (PM2.5) or ozone, female sex, Caucasian race, low residential density, lack of health insurance, and low household income were significantly associated with asthma exacerbations (P<.001). Asthma exacerbations did not vary by rural versus urban residence. Moreover, the results were largely consistent across outcome measures. Conclusions Our results demonstrate that the open-source ICEES can be used to replicate and extend published findings on factors that influence asthma exacerbations. As a disease-agnostic, open-source approach for integrating, exposing, and exploring patient-level clinical and environmental exposures data, we believe that ICEES will have broad adoption by other institutions and application in environmental health and other biomedical fields.


Introduction
Several large-scale initiatives are advancing efforts to reduce barriers surrounding access to patient data maintained in electronic health record (EHR) systems. Relevant initiatives include Columbia Open Health Data [1] and Medical Information Mart for Intensive Care [2]. The common goal is to promote open access to and sharing of patient data for research purposes, while respecting and preserving patient privacy and institutional assurances.
As part of the Biomedical Data Translator program (Translator) [3,4], supported by the National Center for Advancing Translational Sciences, we have developed a disease-agnostic, regulatory-compliant framework and approach for openly exposing and exploring patient data: the Integrated Clinical and Environmental Exposures Service (ICEES) [5]. ICEES was designed to overcome the regulatory, cultural, and technical challenges that hinder efforts to openly share and explore patient data [6,7]. ICEES is unique from similar efforts toward open patient data in that the service provides access to clinical data that have been integrated at the patient level with environmental exposures data derived from a variety of public sources. Thus, ICEES allows for patient-level research in environmental health and related fields.
Herein, we describe the further development and application of ICEES to data on a large cohort of patients with a diagnosis of asthma or a related condition. We examine the impact of select airborne pollutant exposures, demographic factors, and socioeconomic exposures on asthma exacerbations, which we define using 2 primary outcome measures: annual emergency department (ED) or inpatient visits for respiratory issues and annual prescriptions for prednisone. We present our findings and compare results for the 2 outcome measures.

Study Approval
All study procedures were approved by the Institutional Review Board at the University of North Carolina at Chapel Hill (protocol #16-2978). Informed consent was not required as the study involved existing biomedical data only and patient contact was not involved.

ICEES
was designed as a disease-agnostic, regulatory-compliant, open platform. For the work described here, we focused on 157,410 patients with asthma or a related pulmonary condition at UNC Health (all available sites). The specific criteria used to select patients for inclusion in the ICEES asthma cohort were adapted from [8] and included a combination of diagnoses, medications, and laboratory measures. (Details can be found in [5].) Briefly, we captured data on (1) patients with a diagnostic code for "asthma" and prescribed or administered medications that are typically used to treat asthma; (2) patients with a diagnostic code for a respiratory condition other than asthma and prescribed or administered medications that are typically used to treat asthma; (3) patients with a diagnostic code for a pulmonary condition other than asthma but prescribed tests or procedures that are typically used to manage asthma; and (4) patients with a diagnostic code for a respiratory condition other than asthma but with frequent ED visits in which albuterol nebulizer treatments were administered.

ICEES Integrated Feature Tables
"ICEES integrated feature tables" are key to the open design of ICEES. These tables were created using a complex custom software pipeline within a secure environment and under a protocol (#16-2978) approved by the Institutional Review Board at the University of North Carolina at Chapel Hill. For data extraction, Clinical Asset Mapping Program for Health Level 7 Fast Healthcare Interoperability Resource (CAMP FHIR) converted patient data from the PCORnet common data model to FHIR files [9]. FHIR Patient data Integration Tool (FHIR PIT) then ingested the FHIR files and integrated the patient data with multiple sources of environmental exposures data, using patient geocodes as reported in the EHR and dates [10]. The exposures data were derived from public sources and included airborne pollutant exposures data from the United States (US) Environmental Protection Agency Fused Air Quality Surface Using Downscaling repository; major roadway or highway exposures data (a proxy for airborne pollutant exposures) from the US Department of Transportation; and socioeconomic exposures data from the US Census Bureau American Community Survey. (Additional information on the sources of environmental exposures data can be found in [11].) After the data were integrated, the resultant ICEES integrated feature tables were stripped of identifiers per the Safe Harbor method outlined in the Health Insurance Portability and Accountability Act (HIPAA) before being exposed with an open application programming interface (OpenAPI).
ICEES integrated feature tables were created with respect to 1-year "study" periods, that is, calendar years, to provide a reference point for date-based calculations such as age and estimated exposure. Rows contained binned or recoded data on individual patients, with column headers representing data fields for each of the integrated feature variables. Of note, our institution classifies exposure estimates as "secondary protected health information" because the estimates are derived using primary protected health information (PHI; namely, geocodes and dates) to account for the fact that exposure estimates vary across space and time. We addressed this concern by binning all exposure estimates.
The binning strategy that was applied to each feature variable was based on a combination of expert opinion, published literature, and mathematical approaches. Age on day 1 of the 1-year study period was binned using our prior approach [5,12,13]: <5, 5-17, 18-44, 45-64, and 65-89 years (89 years being the oldest permissible age per HIPAA). Sex was treated as male or female as coded in the EHR. Multiple race categories were available in ICEES; we focused on Caucasian and African American, as each of the other categories encompassed ≤1% of the total patients. Rural versus urban residence was examined using the US Census Bureau classifications based on American Community Survey-estimated residential density: rural area (<2500 persons per Census block group); urban cluster (between 2500 and 50,000 persons per Census block group); and urbanized area (>50,000 persons per Census block group). Estimated probability of no health insurance and estimated median household income were binned using the pandas.qcut function, which bins according to frequencies:

ICEES OpenAPI
We accessed the ICEES OpenAPI through the ICEES Swagger OpenAPI interface and by command-line requests. An ICEES user interface was also available. ICEES was designed to support several functionalities for exploring and displaying the data, including chi-square tests, with counts of patients, chi-square statistics, and probabilities returned to users. In this study, we applied an ICEES functionality that allows users to run multiple chi-square comparisons based on available features and a primary outcome measure, with options to include a correction metric for multiple comparisons or collapse contiguous bins. In all cases, missing data were excluded from analysis. We queried the ICEES OpenAPI for data on all patients included in the asthma cohort and focused on outcomes in year 2016, which was the most recent year available with complete exposures data. We ran separate queries for each of the following primary outcome measures: (1) 1 or more annual ED or inpatient visits for respiratory issues; and (2) 1 or more annual prescriptions for prednisone. Specifically, we asked the following natural language question: "Among all patients within the ICEES asthma cohort, what airborne pollutant exposures, demographic features, and socioeconomic exposures differ significantly between patients with 0 versus 1 or more annual ED or inpatient visits for respiratory issues in year 2016?" The corresponding command-line API request was: curl -X POST "https://icees.renci.org:16340/patient/2016/cohort/ COHORT%3A12/associations_to_all_features" -H "accept: text/tabular" -H "Content-Type: application/json" -d "{\"feature\":{\"TotalEDInpatientVisits\":{\"operator\":\"=\", \"value\":0}},\"maximum_p_value\":1}" A similar query was used to examine the primary outcome of 1 or more annual prescriptions for prednisone.

Statistical Analysis
The exploratory 1 × N feature association functionality available via the ICEES OpenAPI automatically invoked a chi-square test of the association between available features and our user-defined primary outcome measure, significance level, and multiple-comparison correction. We considered the primary outcomes of 1 or more annual ED or inpatient visits for respiratory issues and 1 or more annual prescriptions for prednisone. We focused our analysis on select feature variables that were considered a priori to have a potential impact on asthma exacerbations and were available for patients within the asthma cohort: demographic factors (age, sex, and race); socioeconomic exposures (residential density, health insurance access, and median household income); and airborne pollutant exposures (proximity to major roadway or highway, and exposure to PM 2.5 and ozone). We set the significance level at α=.05, which was adjusted by Bonferroni correction to α=.001.
A power calculation was not conducted, as this was an observational, exploratory study focused on existing biomedical data.

Results
We successfully queried the ICEES OpenAPI for outcomes data on year 2016. Of the 157,410 patients who met the criteria used to define the asthma cohort, 26,332 patients (16.73%) had 1 or more annual ED or inpatient visits for respiratory issues, and 17,056 patients (10.84%) had 1 or more annual prescriptions for prednisone. Table 1 provides additional details on the cohort, including demographic and clinical profile and environmental exposures. We then examined associations between select feature variables and annual ED or inpatient visits for respiratory issues, focusing initially on demographic factors ( Figure 1A-C). We found that the percentage of patients with asthma exacerbations was higher among females than males (  We also examined associations between socioeconomic exposures and annual ED or inpatient visits for respiratory issues (Figure 2A-C). We found that the percentage of patients with 1 or more annual ED or inpatient visits for respiratory issues was higher among patients residing in low-density rural areas than among those residing in higher-density urban clusters (  We then examined associations between airborne pollutant exposures and annual ED or inpatient visits for respiratory issues ( Figure 3A-C  Results for the primary outcome of annual prescriptions for prednisone ( Figures 1D-F

Principal Findings
We describe the further development and application of ICEES+ to explore select feature variables associated with asthma exacerbations in a large cohort of patients with asthma or a related condition. We focused on select demographic factors, socioeconomic exposures, and airborne pollutant exposures. We compared results for 2 outcome measures that are indicative of asthma exacerbations: annual ED or inpatient visits for respiratory issues and annual prescriptions for prednisone. We found that female sex, Caucasian race, rural residential density, high probability of no health insurance, low estimated median household income, close residential proximity to a major roadway or highway, and exposure to relatively high levels of PM 2.5 or ozone were significantly associated with asthma exacerbations. Moreover, the results were largely consistent across outcome measures, even though rates of annual ED/inpatient visits for respiratory issues were higher than those for annual prednisone prescriptions.

Limitations
Our study has several limitations that should be considered when interpreting the results. Specifically, as an open service that exposes EHR data, ICEES must abide by stringent regulatory and institutional regulations that limit the granularity of data that can be exposed and the statistical capabilities that are supported. For instance, ICEES exposes binned or recoded data, not raw data. In addition, our institution treats exposure estimates as secondary PHI because they are derived from primary PHI (ie, geocodes and dates); as such, we are unable to reveal the estimated values themselves, only the bins, thus preventing a determination of mean exposures and other statistics based on continuous values. Finally, ICEES currently only supports basic bivariate statistical capabilities. However, we are developing approaches to adapt ICEES to support, in a regulatory-compliant manner, more sophisticated multivariate statistical approaches and machine learning algorithms [15,16].

Comparison With Prior Work
We highlight several scientific findings and discuss unexpected findings. First, we observed an increase in the proportion of asthma exacerbations among females versus males. Asthma and acute exacerbations of asthma are more common in males than females in childhood. However, in adulthood, the effect of sex shifts, with females accounting for the majority of asthma and asthma exacerbations. As the majority of patients in our cohort were adults, this observation is consistent with what has been reported in the literature [17][18][19]. In addition, the increase in asthma exacerbations among patients with lower median household income and those lacking health insurance reflects established disparities in asthma management, particularly among minorities [20]. However, the increase in the proportion of asthma exacerbations among Caucasians versus African Americans was unexpected and contradicts both our findings [10] and those of other investigators [21]. While the reason for this apparent discrepancy is unclear, several possible explanations exist, including the fact that our institution's racial category of "Caucasian" does not definitively distinguish Hispanic Caucasians from non-Hispanic Caucasians, which may have introduced variability. We are currently exploring approaches that may allow us to clearly distinguish Hispanic and non-Hispanic Caucasians and thus refine our racial and ethnic categorization. Another possible explanation is that our prior study focused on year 2010 [10], whereas this study focused on year 2016, and our institution's demographics and patient catchment area have changed significantly over that period [22].
Second, the relationship between age and asthma exacerbations was U-shaped when based on annual ED or inpatient visits for respiratory issues and linear when based on annual prescriptions for prednisone. We suspect that this difference is due to the heterogeneity of wheezing phenotypes in the younger age range, which can be associated with different long-term prognoses for the development of asthma and variance in the use of oral corticosteroids for disease exacerbation [23][24][25][26].
Third, one of the key features of ICEES is that it supports research on the impact of environmental exposures such as airborne pollutants on health and disease. Indeed, we identified that asthma exacerbations increased with increasing exposure to PM 2.5 and ozone, as we and others have shown [5,27]. We also found an increase in asthma exacerbations among patients residing in close proximity to a major roadway or highway, as others have found when using roadway exposure as a proxy for airborne pollutant exposure [14,28,29], although the effect in this study was modest. While one might have expected an increase in asthma exacerbations among persons living in densely populated areas, we found the opposite to be true, with increased asthma exacerbations among persons residing in low-density regions classified by the US Census Bureau as rural areas versus higher-density regions classified as urban clusters. We suspect that several factors might explain these findings. For instance, UNC Health's patient catchment area draws heavily from rural regions of North Carolina, with multiple clinics and small hospitals located across the state and many patients relying on the state hospital system for health care services. Indeed, not a single patient in the cohort described in this study resided in a region classified by the US Census Bureau as an urbanized area. This may have introduced bias into the results. In addition, we note that many major roadways and highways run through rural parts of our patient catchment area, and so any presumption that close proximity to a major roadway or highway is more common in urban versus rural regions may not be valid. A related point is that rural exposures carry risks that may differ from urban exposures. For instance, we are expanding ICEES to include data on concentrated animal farming operations and landfills so that we can begin to examine exposures that may uniquely impact persons residing in rural regions.
We also highlight key technical aspects of this study and discuss limitations. First, the data reported herein are openly available via the ICEES OpenAPI, without any regulatory restrictions or login credentials. This allowed us to rapidly execute the queries and analyze the results, thereby accelerating the speed of discovery. Because ICEES is designed to be disease agnostic and is not restricted to patients with asthma and related conditions, we can adapt our approach and the service itself to support any number of use cases and explore environmental influences on virtually any disease. Indeed, we have deployed additional ICEES instances that expose data on patients with drug-induced liver injury and patients with coronavirus infection. In addition, we are adapting ICEES to support a use case on primary ciliary dyskinesia and related rare pulmonary disorders.
Second, by using health care system EHR data, a large and clinically relevant patient sample can be identified. In this study, our sample size was approximately 160,000 patients, thus supporting rigorous open statistical analysis. While the statistical tests available via the ICEES+ OpenAPI are currently limited to bivariate analyses, we are developing approaches to support multivariate analyses such as generalized linear models, random forest trees, and causal inference models, with options to control for potential covariates, account for missing data, and examine only those patients who are active in a given year, meaning that they were seen at 1 or more clinics within UNC Health. One significant challenge is the binning approach that is adopted for variables. For instance, automated binning algorithms typically bin data by value or by frequency. The former supports the study of extreme values, but at the expense of evenly distributed bin sizes; the latter supports an even distribution of observations among cells, but at the expense of overlap in patients with equal exposures between bins and bin cutoff points that may not be scientifically meaningful. We are systematically exploring this issue.

Conclusions
Our results demonstrate that the open-source ICEES can be used to replicate and extend published findings on factors that influence asthma exacerbations. While we are actively researching the limitations of the service and developing ways to improve it, we believe that ICEES will greatly speed and democratize the use of EHR data to support research and discovery. Moreover, to the best of our knowledge, ICEES is the only open source of clinical data that have been integrated at the patient level with multiple sources of public environmental exposures data. While we have described an application use case focused on asthma, ICEES is disease agnostic. We expect the service to advance research in environmental health and related fields and continue to grow as we expand both our user base and the service itself to support new clinical use cases, additional EHR elements (eg, laboratory measures), and new data sources (eg, survey data). Moreover, because ICEES is open source, the model and software code [30,31] can be adopted by other institutions as a novel approach for openly exposing and sharing sensitive data. Indeed, ICEES may have application as an open, privacy-preserving approach to inform decision making by the US Environmental Protection Agency and other federal agencies regarding the patient-level impact of environmental exposures on risk of disease. Finally, we are assessing regulatory-compliant options for applying ICEES as a tool for clinical decision support by identifying patients with asthma (and eventually patients with other chronic diseases) or geographical regions at high risk for poor health outcomes based on their exposures profile and then flagging those patients in their EHR to inform patient care.