The Laboratory-Based Intermountain Validated Exacerbation (LIVE) Score Identifies Chronic Obstructive Pulmonary Disease Patients at High Mortality Risk

Background: Identifying COPD patients at high risk for mortality or healthcare utilization remains a challenge. A robust system for identifying high-risk COPD patients using Electronic Health Record (EHR) data would empower targeting interventions aimed at ensuring guideline compliance and multimorbidity management. The purpose of this study was to empirically derive, validate, and characterize subgroups of COPD patients based on routinely collected clinical data widely available within the EHR. Methods: Cluster analysis was used in 5,006 patients with COPD at Intermountain to identify clusters based on a large collection of clinical variables. Recursive Partitioning (RP) was then used to determine a preferred tree that assigned patients to clusters based on a parsimonious variable subset. The mortality, COPD exacerbations, and comorbidity profile of the identified groups were examined. The findings were validated in an independent Intermountain cohort and in external cohorts from the United States Veterans Affairs (VA) and University of Chicago Medicine systems. Measurements and Main Results: The RP algorithm identified five LIVE Scores based on laboratory values: albumin, creatinine, chloride, potassium, and hemoglobin. The groups were characterized by increasing risk of mortality. The lowest risk, LIVE Score 5 had 8% 4-year mortality vs. 56% in the highest risk LIVE Score 1 (p < 0.001). These findings were validated in the VA cohort (n = 83,134), an expanded Intermountain cohort (n = 48,871) and in the University of Chicago system (n = 3,236). Higher mortality groups also had higher COPD exacerbation rates and comorbidity rates. Conclusions: In large clinical datasets across different organizations, the LIVE Score utilizes existing laboratory data for COPD patients, and may be used to stratify risk for mortality and COPD exacerbations.

Background: Identifying COPD patients at high risk for mortality or healthcare utilization remains a challenge. A robust system for identifying high-risk COPD patients using Electronic Health Record (EHR) data would empower targeting interventions aimed at ensuring guideline compliance and multimorbidity management. The purpose of this study was to empirically derive, validate, and characterize subgroups of COPD patients based on routinely collected clinical data widely available within the EHR.
Methods: Cluster analysis was used in 5,006 patients with COPD at Intermountain to identify clusters based on a large collection of clinical variables. Recursive Partitioning (RP) was then used to determine a preferred tree that assigned patients to clusters based on a parsimonious variable subset. The mortality, COPD exacerbations, and comorbidity profile of the identified groups were examined. The findings were validated in an independent Intermountain cohort and in external cohorts from the United States Veterans Affairs (VA) and University of Chicago Medicine systems.
Measurements and Main Results: The RP algorithm identified five LIVE Scores based on laboratory values: albumin, creatinine, chloride, potassium, and hemoglobin. The groups were characterized by increasing risk of mortality. The lowest risk, LIVE Score 5 had 8% 4-year mortality vs. 56% in the highest risk LIVE Score 1 (p < 0.001). These findings were validated in the VA cohort (n = 83,134), an expanded Intermountain cohort (n = 48,871) and in the University of Chicago system (n = 3,236). Higher mortality groups also had higher COPD exacerbation rates and comorbidity rates.
Cluster analysis techniques have been used in empirically identifying groups of patients diagnosed under a common disease umbrella in other fields. For example, cluster analysis of patients with severe asthma identified five subgroups of patients with asthma who have unique characteristics and profiles (23). More recently, cluster analysis has been used to identify subgroups of patients with diabetes (24). While prior risk scores in COPD have used PFT and dyspnea scores to identify subgroups of COPD patients (20), those data are not routinely available to be queried in most current Electronic Health Records (EHRs), and thus have limited utility when designing interventions to improve COPD care within a healthcare system.
While PFT data and dyspnea scores are not routinely available to be queried in an EHR, many other clinically collected variables are accessible. For example, in cardiology, increased red cell distribution width (RDW) is associated with increased cardiovascular mortality (25)(26)(27)(28)(29). Although RDW is a marker for disease, rather than a primary driver, the reliance on laboratory values to derive a risk score allows for the identification of high-risk patients in real time during a healthcare encounter (27,30). This ability to identify patients at risk for cardiovascular mortality in real time, has facilitated the development of focused interventions with increased resources and coordination of care for high-risk patients. A similar approach has been shown to be effective in improving outcomes for floor patients at risk of developing sepsis (31). Thus, risk scores have been most useful in improving care for patients when individual patient risk can be assessed automatically and an alert surfaced to clinicians for additional care only in those with highrisk.
Despite the advances in risk scores, finding a COPD related risk stratification score that allows system wide identification of patients who may benefit from targeted interventions remains elusive. Given the large number of variables routinely collected as part of clinical care, we wanted to determine whether clustering COPD patients would identify different subgroups that may have differential mortality, exacerbation frequency, or comorbidity rates.  (190)(191)(192) at any inpatient, Emergency Department, or ambulatory face-to-face encounter in 2013 or prior were identified (Supplementary Figure 1 and Supplementary Methods).

Outcome Variables
Mortality among derivation patients was assessed based on the known date of death in the Intermountain EHR for in-hospital deaths and was supplemented by Utah death certificate data and Social Security death master file records. Exacerbations requiring hospitalization and comorbidity rates were collected from the EHR.

Clinical Predictor Variables
A complete list of variables is listed in Supplementary Table 1. We included a large number of variables in the dataset including PFTs. Due to the frequent cardiovascular comorbidities in COPD patients, and the likely contribution of fluid status in respiratory symptoms, we included a number of variables from Transthoracic Echocardiograms (TTE). We attempted to add 6-min walk distance and dyspnea scores, but these were not available in an encoded format in our data system.

Cluster Analysis
Hierarchical cluster analysis of the clinical variables was carried out in the R statistical program. Cluster analysis (23,32) was run with the "cluster" package, using the "daisy" function, which calculates the Gower's distance for mixed variables (i.e., continuous and nominal variables). All of the vast arrays of clinical variables were included in the cluster analysis to determine the optimal clusters for a derivation subset of Intermountain patients with available data for most of the variables. Some variables were only available only for a minority of patients but cluster analysis is robust to missing data and can proceed with these variables included. We visually analyzed the cluster tree (dendrogram) and evaluated 4, 5, 6, 7, 8, and 9 cluster solutions using the "cutree" function in R (33). A seven-cluster solution was identified based on variable break points, cluster sizes, and the initial goal of identifying four to eight clusters (Supplementary Figure 2). This method provided segmentation of the population using clinically similar groupings that were derived independently of study outcome variables.

Recursive Partitioning
After the cluster analysis, we used Recursive Partitioning (RP) (21,34,35) to identify a parsimonious subset of variables that best predict the cluster assignments and are more likely to be available for use in other populations. RP is a nonparametric regression approach for modeling relationships among variables, which allows for evaluation of a large number of mixed predictor variables (i.e., continuous, ordinal, and categorical) with missing values, as is often the case with EHR clinical measures. We categorized continuous laboratory variables based on laboratory determined clinical cutoffs (e.g., low, normal, high), because it created more stable decision trees, then ran RP in the R statistical platform using the "rpart" package.
We attempted alternative statistical methods, such as stepwise regression analysis techniques, however due to the frequency of missing data in the dataset too many cases were eliminated in the modeling process. For example, given the large number of variables in the data set, almost no cases had all data elements (pulmonary function test data, medication, labs, healthcare utilization, echocardiograms, etc.).
We evaluated the concordance between our clusters and the RP tree assigned groups for all patients with complete data that allowed RP assignment. Then, we evaluated the concordance between clusters and RP assigned groups for patients with missing data where we imputed normal values. The RP assigned groups were named the LIVE Scores for those patients.

Validation
We validated the LIVE Score and our findings internally within the Intermountain Healthcare system in an expanded cohort of 48,871 patients. External validation was done at two independent sites: 83,134 patients in the United States Veterans Affairs (VA) nationwide healthcare system EHR data (VA Informatics and Computing Infrastructure, VINCI) (36) and 3,236 the University of Chicago Medicine system (Supplementary Table 2 and Supplementary Figures 4-6). To validate the COPD clusters, we used the RP tree derived above to empirically assign LIVE Scores based on the limited number of variables needed for the tree. This approach allowed us to validate the tree in external sites based on a much smaller number of variables. Kaplan-Meier survival curves were calculated to evaluate time to event results for mortality and exacerbation outcomes.

Subject Demographics
From the initial 11,048 patients identified with a COPD diagnosis in the Intermountain Healthcare system on or before 2013, the presence of a transthoracic echocardiogram (TTE), not its findings, was the initial most important variable for risk stratification. This observation suggested selection bias and pattern of care: patients who were more likely to come to the hospital often were more likely to get a TTE. Thus, we decided to focus on the higher risk patients (those with a prior TTE) for our cluster analysis.

Cluster Analysis
Cluster analysis of the 5,006 patients with a COPD diagnosis and a TTE by 2013 was performed using all clinical variables. A seven-cluster solution was identified based on variable break points, cluster sizes, and the initial goal of identifying four to eight clusters (Supplementary Figure 2). The seven clusters differed in number of patients, overall mortality, and healthcare utilization data (Supplementary Cluster Descriptions, Supplementary Tables  3, 4). We had encoded PFT data available for only 11% (535) of patients in our cohort, and the vast majority had obstruction (Supplementary Figure 6).

Recursive Partitioning and Tree Diagram
We used Recursive Partitioning to derive an empiric decision tree assigning each patient into a specific LIVE Score (Figure 1). FIGURE 1 | Decision tree. The empiric decision tree assigning five LIVE Scores of the seven cluster types is shown. Six laboratory variables categorize all patients into one of five LIVE Scores (approximately corresponding to the clusters). LIVE 5 (cluster 2), the "healthiest" is characterized by normal hemoglobin and normal chloride. LIVE 1 and 2 (clusters 1 and 6) the "sickest" are characterized by multiple laboratory abnormalities-most notably hemoglobin, albumin, and potassium. The presence of history of renal failure (Max Creat is high) distinguishes the higher risk LIVE 3 (cluster 5) from the relatively lower risk LIVE 4 (cluster 3). Max, maximum; Min, minimum; hgb, hemoglobin; creat, creatinine; Cl, Chloride, Alb, albumin; K, potassium; nl, normal; "ever" The decision tree had six nodes: albumin, creatinine, chloride, potassium, and hemoglobin (two variables: the minimum hemoglobin value over all years in the dataset, and the maximum hemoglobin value for the year). Using these six variables, the decision tree assigned each subject to one of five LIVE Scores. The decision tree did not assign two original cluster types (Cluster 4, n = 251, 5% and Cluster 7, n = 79, 1.6%). The agreement between the RP decision tree assigned LIVE Scores and the original Clusters is shown in Supplementary Figures 7-9.

Intermountain Validation
We identified all Intermountain Healthcare patients with a billing code for COPD based on an expanded list of COPD diagnosis codes to create a dataset of 48,871 patients alive in 2009 (Supplementary Table 2 and Supplementary Figure 3). Thirty thousand five hundred and thirty-three patients had laboratory data allowing LIVE Score calculation without imputing: basic demographics, healthcare utilization, and comorbidity rates for the 9,221 patients with a TTE in 2009 or prior (Tables 1, 2) and for the 21,312 patients without a TTE (Supplementary Tables 5,  6) are summarized.
Overall mortality was assessed for each of those cohorts based on the calculated LIVE Score in 2009. The mortality for patients with a TTE was higher than the mortality for patients without a TTE (46 vs. 23%, respectively, p < 0.001), and the LIVE Scores stratified mortality within both cohorts. Figure 2 shows the Kaplan-Meier survival curve for patients with (Figure 2A) and without a prior TTE ( Figure 2B). In both cohorts, LIVE Score 5 had the lowest mortality (23 and 15%, respectively, p < 0.001) and LIVE Score 1 (77-57%, p < 0.001) had the highest mortality (Figure 2).
The time to first COPD exacerbation requiring a COPDrelated Emergency Department visit and/or hospitalization was also statistically significantly different in both cohorts (Figure 3). Patients with LIVE Score 5 with a prior TTE had the lowest rate of COPD exacerbations (0.20 COPD related visits/year vs. 0.67 visits/year overall, p < 0.001). Patients with LIVE Scores 1 and 2 had the highest COPD related healthcare utilization rate (1.57 and 1.46 visits/year, respectively). The difference between COPD exacerbations with LIVE Score 1 and LIVE Score 2 was not significant, but both were statistically significantly higher compared with the overall rate of 0.20 visits/year, p < 0.001 (Table 1 and Figure 3).
The LIVE Scores with higher overall mortality were statistically significantly associated with higher comorbidity rates (Tables 1, 2, and Supplementary Tables 5, 6).

Veterans Affairs National Health System Validation
External Validation was performed in a retrospective dataonly cohort of 83,134 VA patients with COPD alive in 2009 from all VA hospitals throughout the United States who had a LIVE Score calculation in 2009 (Supplemental Table 7 and Supplementary Figure 4). We performed the analysis on the 6,034 patients who had TTE in 2009 or prior and examined 7-year mortality and risk of severe COPD exacerbation. We repeated the analysis on the 77,100 patients without a TTE in 2009 or prior. Patients with a prior TTE had a statistically significantly higher overall mortality than those without a prior TTE (Supplementary Figure 12). However, within each cohort the LIVE Score separated patients into statistically significantly different overall mortality rates (Figure 4). LIVE Score 1 patients with a prior TTE had an 81% mortality compared with 23% mortality for LIVE Score 5 patients (p < 0.001 Figure 4A). Similarly, LIVE Score 1 patients without a prior TTE had a 72% mortality compared with 17% mortality for LIVE Score 5 (p < 0.001, Figure 4B). Furthermore, in both cohorts, the LIVE Scores were associated with statistically significantly different rates of COPD exacerbation. The highest rates were in patients  with LIVE Scores 1 and 2 where 80-84% of patients had a COPD exacerbation by 8 years, respectively. Although the difference between LIVE Score 1 and 2 was not significant, both groups were statistically significantly higher compared with the other LIVE Scores and only 25% of patients in LIVE Score 5 had a COPD exacerbation (p < 0.001, Supplementary  Figures 13, 14).

University of Chicago Health System Validation
We repeated the LIVE Score validation in a second retrospective data-only cohort of 3,236 patients from the University of Chicago Medicine system where TTE data were not available (Supplementary Tables 2, 8 and Supplementary Figure 5). The University of Chicago Medicine system is relatively open and comprises a unique urban population. Patient cohort enrollment was normalized such that time zero for patient data was the date that patients first met cohort criteria. Given the relatively small number of patients in the cohort, as well as the Intermountain work showing good predictions when imputing missing variables as normal (Supplementary Figures 9, 10), we elected to impute missing variables as "normal" in this cohort. In this second external cohort with slightly different COPD definitions and a unique patient population, the LIVE Score showed the same pattern of separation of 6-year all-cause mortality ( Figure 5). Overall the separation was between two low-risk LIVE Scores (LIVE Score 4 and 5) and three high-risk LIVE Scores (LIVE Scores 1, 2, and 3). The difference between the low risk and high-risk LIVE Scores was statistically significant, but in this small cohort of patients the differences among the individual LIVE Scores did not reach statistical significance (Supplementary Table 9). Similarly, in this small cohort with a large number of imputed variables in a relatively open health system, differences among the LIVE Scores with regard to severe COPD exacerbation were not found to be statistically significant (Supplementary Table 10 and Supplementary Figure 15).

DISCUSSION
Using a large dataset of routinely collected clinical variables from the EHR and narrowing it down to an optimal parsimonious set of common variables, we identified and externally validated a novel Laboratory-based Intermountain Validated Exacerbation (LIVE) Score in patients diagnosed with COPD. The LIVE Score is calculated based on six routinely collected laboratory values, which are reliable across institutions and care settings, are obtained in real time, and do not rely on clinician judgment or billing codes. The LIVE Scores stratify patients with differing  overall mortality rates and severe COPD exacerbation rates across different healthcare systems.
The LIVE Score is based on the hemoglobin, potassium, albumin, creatinine, and chloride laboratory values obtained through routine clinical care. Although our analysis does inform why these specific variables most robustly separated patients with a diagnosis of COPD into different groups, we speculate that they may be markers of comorbidity and disease. For example, patients who not had evidence of renal failure (maximum creatinine is normal) would be at lower risk for complications related to congestive heart failure exacerbations and would be less likely to be hospitalized or to die. Additionally, those with evidence of anemia (minimum hemoglobin ever low), may be a marker for patients with anemia of chronic disease, which in turn may be related to their other morbidies and the patient's overall health. Similar speculations regarding the correlation of mortality and laboratory abnormalities may be made regarding potassium (e.g., diuretic use), albumin (malnutrition, general health), or chloride.
The value of risk stratifying patients based on the LIVE Score lies in the ability to identify high-risk patients across a healthcare system for targeted interventions. Although the bedside physician may recognize that their individual COPD patient is at high risk for mortality and future healthcare utilization, identifying high-risk patients on a system level allows for resource allocation that would better support the patient and their physicians. Indeed, while large gaps between recommended care and actual care in COPD patients remain (37)(38)(39)(40), this type of risk stratification may help improve adherence to guidelines in the high-risk patients who need better support. Thus, the utility of risk stratification is that within a health system identifying high-risk patients may help focus resources around improving access to care and care coordination (41)(42)(43). This approach of risk stratifying patients based on passively collected and calculated risk scores with subsequent intensive clinician attention to the highest risk patients has been shown to be effective in improving heart failure and sepsis outcomes (30,31).
The LIVE Score risk stratifies complex real-world patients who have been diagnosed with COPD and may have a variety of competing comorbidities, which affect their overall mortality and healthcare utilization. These comorbidities are important determinants not only of overall mortality, but also of hospitalizations and healthcare utilization. While healthcare systems have increased their focus on reducing 30-day COPD readmissions, nearly half of the patients readmitted after a COPD related hospitalization are admitted for problems unrelated to their COPD (43). Thus, interventions aimed at improving COPD care must take into account the multimorbidity model of COPD in identifying patients (19,43,44). Indeed, for many patients with COPD, improving care may be achieved more effectively by diagnosing and treating comorbidities rather than focusing on COPD therapy alone (45).
The strength of our study is the empiric, reliable, risk stratification of COPD patients using readily available EHR data.
The validation using clinical patient data from three different healthcare systems with different definitions of COPD suggests that these groups reflect underlying stable patient groups. This risk stratification strategy may form a basis for identifying COPD patients at high risk of mortality and complications on a system level thus better targeting interventions. Although our study advances the field by identifying novel laboratory based LIVE Scores in COPD patients, it has some limitations. First and foremost, unlike research cohorts with prospectively collected PFT data, we cannot be certain that all patients have COPD. This limitation in identifying and categorizing COPD patients reflects the underlying structure of most EHR systems, which do not have PFTs available, and system limitations whereby patients with COPD do not regularly receive PFT testing. Nevertheless, factors beyond PFTs are increasingly recognized as driving outcomes in patients with COPD (4). The lack of diagnostic certainty does not take away from the utility of our LIVE Scores. Our cohorts represent patients in clinical care with diagnostic uncertainty and competing comorbidities, which may cause respiratory symptoms that are evaluated in routine clinical care. Indeed, our risk stratification schema may facilitate more accurate diagnosis of COPD by prioritizing diagnostic accuracy in high-risk patients where additional resources may be focused.

CONCLUSION
In large clinical datasets across different organizations, a LIVE Score that utilizes existing laboratory data for COPD patients may be used to stratify risk for mortality and COPD exacerbations.

IMPACT
Despite advancements in interventions that improve clinical outcomes of COPD patients, gaps between clinical guidelines and care persist. While COPD patients in clinical research studies are well-characterized and managed according to current guidelines, in clinical care those hospitalized with respiratory symptoms may have diagnostic uncertainty and lack guideline recommended care. Identifying the highest-risk groups of COPD patients in order to prioritize enrollment in disease management programs remains a challenge. Here we developed and validated the LIVE Score, a system for population health management to identify COPD patients at high risk for healthcare utilization, morbidity, and mortality through existing data for real-world clinically diagnosed COPD. The LIVE Score could be used to risk stratify COPD patients within a healthcare system in order to prioritize initiatives aimed at improving healthcare delivery for COPD, saving clinician time and reducing health system costs.

AUTHOR CONTRIBUTIONS
DB conceived of the study, designed the data set, performed data analysis and interpreted the data, and wrote the first draft of the manuscript. DB had full access to the data and is the guarantor of the paper, taking responsibility for the integrity of the work as a whole, from inception to published article. DC helped design the study, performed the statistical analysis and validation, and critically revised the manuscript for important intellectual content. SR generated and validated the dataset, helped analyze the data, and critically revised the manuscript for important intellectual content. BH helped analyze the data and critically revised the manuscript for important intellectual content. VP, MC, and KC validated the findings in the University of Chicago data and edited the manuscript. RM assisted with data analysis and interpretation, edited the manuscript. SZ generated the data set and R code and validated the dataset in the VA data set and edited the manuscript. MA generated the data and analyzed the data for the validation in the VA cohort and critically revised the manuscript for important intellectual content.

FUNDING
This work is supported by the Intermountain Research and Medical Foundation (DB). MA is supported by the Flight Attendant Medical Research Institute. VP was supported by a K23 (HL118151) from the National Heart, Lung, and Blood Institute (NHLBI) and MC is supported by a K08 (HL121080) from the NHLBI and R01 (GM123193). BH is supported by grants from Intermountain Healthcare's Foundry innovation program, the Intermountain Research and Medical Foundation.