A mixed-ensemble model for hospital readmission

https://doi.org/10.1016/j.artmed.2016.08.005Get rights and content

Highlights

  • A mixed-ensemble model for hospital readmission is proposed.

  • The mixed-ensemble model enables controlling the tradeoff between reasoning transparency and predictive accuracy.

  • The mixed-ensemble model increases the classification accuracy for positive readmission instances.

  • An optimization approach for the mixed-ensemble model is proposed.

  • The mixed-ensemble model has been implemented for predicting all-cause hospital readmissions of CHF patients.

Abstract

Objective

A hospital readmission is defined as an admission to a hospital within a certain time frame, typically thirty days, following a previous discharge, either to the same or to a different hospital. Because most patients are not readmitted, the readmission classification problem is highly imbalanced.

Materials and methods

We developed a hospital readmission predictive model, which enables controlling the tradeoff between reasoning transparency and predictive accuracy, by taking into account the unique characteristics of the learned database. A boosted C5.0 tree, as the base classifier, was ensembled with a support vector machine (SVM), as a secondary classifier. The models were induced and validated using anonymized administrative records of 20,321 inpatient admissions, of 4840 Congestive Heart Failure (CHF) patients, at the Veterans Health Administration (VHA) hospitals in Pittsburgh, from fiscal years (FY) 2006 through 2014.

Results

The SVM predictions are characterized by greater sensitivity values (true positive rates) than are the C5.0 predictions, for a wider range of cut off values of the ROC curve, depending on a predefined confidence threshold for the base C5.0 classifier. The total accuracy for the ensemble ranges from 81% to 85%. Different predictors, including comorbidities, lab values, and vitals, play different roles in the two models.

Conclusions

The mixed-ensemble model enables easy and fast exploratory knowledge discovery of the database, and a control of the classification error for positive readmission instances. Implementation of this ensembling method for predicting all-cause hospital readmissions of CHF patients allows overcoming some of the limitations of the classifiers considered individually, and of other traditional ensembling methods. It also increases the classification accuracy for positive readmission instances, particularly when strong predictors are not available.

Introduction

A hospital readmission is defined as an admission to a hospital within a certain time frame, following an original hospital discharge, either to the same or to a different hospital. The Congestive Heart Failure (CHF) diagnosis includes some of the highest percentages of patients who are readmitted to a hospital within thirty days of discharge [1], [2], [3], [4], and is the leading cause of hospital admissions among patients over the age of 65 years [5]. CHF is also associated with high rates of mortality and morbidity [6]. Several previous papers used logistic regression to estimate the probability of hospital readmissions [7], [8], [9], [10]. Another type of baseline model uses survival analysis (or hazard models) to estimate the time duration between consecutive patient admissions [11], [12]. Although both approaches are useful in identifying readmission risk factors, they are not as useful for dealing with the non-stationary nature of patient readmissions, where the readmission propensity might change over time, depending on different conditions and treatments during prior admissions [13]. Also, most approaches are characterized by limited classification power when a large range of variables is considered.

A variety of reasons could lead to readmissions, such as early discharge of patients, improper discharge planning, and poor care transition [14], [15], [16]. Vinson et al. [4] found that the factors predictive of readmission of CHF patients included a prior history of heart failure, four or more admissions within the preceding eight years, and heart failure precipitated by an acute myocardial infarction or uncontrolled hypertension. Using subjective criteria, they indicated that factors contributing to preventable readmissions included non-compliance with medications or diet, inadequate discharge planning or follow-up, a failed social support system, and failure to seek medical attention promptly when symptoms recurred. Schwartz et al. [17] studied the severity of cardiac illness, cognitive functioning, and functional health of 156 patients within seven to ten days of a patient’s discharge from the hospital. They found that 44% of the patients were re-hospitalized during a three-month period. A patient's severity of cardiac illness, functional health status, and caregiver psychosocial and informal support factors influenced hospital readmissions during the three months post hospital discharge. Key predictors of readmissions in their regression analysis were the interaction of the severity of patient cardiac illness with functional status, the interaction of depression with stress, and informal social support. He et al. [18] developed an administrative claim-based algorithm to predict 30-day readmission using standardized billing codes and basic admission characteristics available before discharge. The algorithm works by exploiting high-dimensional information in administrative claims data, and performing logistic regression on the selected attributes. Shipeng et al. [19] experimented with a general framework for hospital-specific and condition-specific models for readmission risk prediction, by implementing a Support Vector Machine (SVM) as a classification approach, and Cox Regression as a prognostic approach. The institution-specific readmission risk prediction framework has been shown to have flexibility and to be effective, as compared to one-size-fit-all models. For a systematic review of statistical models and predictors for CHF patients’ hospital readmissions, see [20], [21]. For a survey of previous research on CHF predictive factors, we refer the reader to [22] and [23].

However, most readmission models do not provide clinically useful, interpretable rules that could explain the reasoning process behind their predictions. They typically only produce a score that describes the chance of readmission, based on the values of the predictors. Providing at least some level of explanation for the reasons behind a prediction may assist healthcare providers in their decisions to administer patient-specific, targeted interventions, in order to reduce patients’ chances of readmission, as well as to facilitate proper resource utilization by the hospital. While actionable insights could be extracted from decision trees, trees may perform less well than other more sophisticated tools [24], [25]. Furthermore, for datasets that are noisy, inconsistent, skewed, and have a significant amount of missing values, tree models may not be flexible enough to produce consistent predictions [24]. Other models that are characterized by greater classification accuracy, such as logistic regression based models [7], [8], [9], [10], may not enable physicians, and clinical staff, to interpret the results in the context of their existing knowledge.

As discussed by Shmueli and Koppius [10], the design of a predictive analytics model involves a trade-off between its predictive power and its explanatory transparency. In this paper, we present a new way to combine two different types of data mining models, to address a challenging clinical predictive task, while supporting the goal of effective communication of the reasoning underlying a prediction. The ensemble model described in this paper integrates a boosted C5.0 tree model, the principal classifier, with a complementary Support Vector Machine (SVM) model as a secondary component. The idea of ensemble predictive modeling is to apply a collection of independent models for predicting a class for a case, rather than basing the prediction on a single model. Ensembles typically produce a consensus prediction, using a majority vote of the members of the ensemble. In such an approach, each model in the ensemble addresses the same original task, but in a different way. The motivation is that a composite model will produce more accurate and reliable decisions than would be obtained from a single model [26]. For example, random forests induce multiple trees, using different subsets of the given input variables, and combine the results by majority vote. Previous empirical work for ensembling has focused primarily on the total predictive error reduction from using multiple models, or on the exploration of novel methods for generating models and combining their predicted classes [27]. Along with simple combiners, there exist more sophisticated methods, such as stacking [28] and arbitration [29]. Ali and Pazzani et al. [27] use an empirical analysis to understand the reduction in generalization error that results from using multiple learning models. By comparing several combination methods, such as uniform voting, Bayesian combination, distribution summation, and likelihood combination, they show that the amount of observed error reduction is negatively related to the degree of correlation of errors in the individual models [30], [31]. In particular, they found that when the limiting factor is not the noise or difficulty of the data, the multiple models approach provides an excellent way to achieve a large error reduction.

One situation in which the limiting factor is not noise or difficulty is where the data set includes many irrelevant attributes. As the number of irrelevant attributes is increased, the ensemble models approach does increasingly better than does a single model. However, beyond some point, adding irrelevant attributes reduces the accuracy of the model. Thus, when building a predictive model for a data set that has a large number of variables, the tradeoff between the error correlations between pairs of the included models, and the number of variables in those models, is the one that would most likely influence the amount of observed error reduction that would be achieved through ensembling. On the other hand, the degree to which the error patterns of models are correlated is related to the ability of those models to provide accurate predictions without having prior domain knowledge, and to how they deal with noisy labels. Reducing the amount of error is also a function of the unique characteristics of each included model. An additional complication is the degree of imbalance in class size in the data set. When the number of instances in the minority class is significantly less than the number of instances in the majority class, i.e., the data set is imbalanced [32], the amount of error for the default classifier, due to a skewed distribution of data, or lack of information, is increased. Common approaches for dealing with imbalanced data involve modifications either to the data distribution or to the classifier itself [24], [25]. Considering any of those techniques, for the purpose of building an ensemble classifier, should also take into account the influence it has on the predictive power of each included model, in terms of the total error reduction.

In this paper, we suggest combining both approaches. The base classifier, a boosted C5.0 decision tree, is able to handle high-dimensional data, and its representation of the knowledge induced from the data is highly intuitive. By using a boosting algorithm, we obtain an ensemble classifier that exhibits less classification error that would a single C5.0 classifier. C5.0 associates, with each prediction, a confidence level that quantifies its trustworthiness. Based on the dataset characteristics, we define a confidence threshold, such that records for which the C5.0 tree reports an insufficient confidence are further analyzed by an SVM model. The SVM is characterized with strong power in evaluating information in the case of non-regularity in the data, and is able to provide good out-of-sample generalization. Based on the considerations above, we suggest an optimization approach for the mixed-ensemble model, which takes into account the unique characteristics of the learned dataset.

Section snippets

Data

We were granted access to the anonymized administrative records of 20,321 inpatient admissions, of 4840 patients, to Veterans Health Administration (VHA) hospitals in Pittsburgh, from fiscal years (FY) 2006 through 2014. All patients in that data set had been diagnosed with CHF during this time period, although the admissions were for all causes. Each admission record is considered as a unit of analysis. The number of admissions for each patient ranged between 1 and 32. The time elapse between

Group characteristics and predictors’ importance

We found that, in both models, historical variables, such as the number of admissions before the current index admission, play an important role. Significant predictors of readmissions risk include: two laboratory values, the readings for albumin and for white blood cell (WBC) count; comorbidities, such as anemia and COPD; and the source of admission. A history of recent prior admissions also predisposes a patient to readmission. Of the records having a C5.0 confidence that is lower than Tc5

Discussion

This study presents a mixed-ensemble model for estimating the probability that a hospitalized patient will be readmitted within 30 days following discharge, either to the same or to a different hospital. The mixed-ensemble model combines (1) a boosted C5.0 model (five trees) as the base ensemble classifier, which enables exploratory knowledge discovery about the learned readmission database; and (2) a support vector machine (SVM) as a secondary classifier, which allows control of the

Conclusions

We developed a new, dynamic mixed-ensemble model for predicting hospital readmission. To the best of our knowledge, our model is the first readmission model that deals with the potential conflict between predictive accuracy and reasoning transparency. Our results indicate that a cautious optimization of the model structure would support an effective communication of the reasoning underlying its prediction, as well as controlled-sensitivity classification of the minority class. Desirable

Contributors

We thank the U.S. Department of Veterans Affairs for providing financial support for this research, through master contract numbers VA244-13-C-0581 and VA240-14-d-0038 with the University of Pittsburgh. This work is an outcome of a continuing partnership between the Katz Graduate School of Business and the Pittsburgh Veterans Engineering Resource Center (VERC). Inpatient admissions data were pulled from the VA corporate data warehouse by Dr. Youxu C. Tjader. The categorization at Step 6 of the

References (51)

  • D.H. Wolpert

    Stacked generalization

    Neural Netw

    (1992)
  • P.K. Chan et al.

    A comparative evaluation of voting and meta-learning on partitioned data

    ICML

    (1995)
  • M. Gal-Or et al.

    Assessing the predictive accuracy of diversity measures with domain-dependent, asymmetric misclassification costs

    Inf Fusion

    (2005)
  • Y. Sun et al.

    Cost-sensitive boosting for classification of imbalanced data

    Pattern Recogn

    (2007)
  • K. Lau et al.

    Leave one support vector out cross validation for fast estimation of generalization errors

    Pattern Recognit

    (2004)
  • I. Aydin et al.

    A multi-objective artificial immune algorithm for parameter optimization in support vector machine

    Appl Soft Comput

    (2011)
  • B. Zheng et al.

    Predictive modeling of hospital readmissions using metaheuristics and data mining

    Expert Syst Appl

    (2015)
  • J.H. Min et al.

    Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters

    Expert Syst Appl

    (2005)
  • T. Islam et al.

    Hospital readmission among older adults with congestive heart failure

    Aust Health Rev

    (2013)
  • H.M. Krumholz et al.

    Readmission after hospitalization for congestive heart failure among Medicare beneficiaries

    Arch Intern Med

    (1997)
  • J. Vinson et al.

    Early readmission of elderly patients with congestive heart failure

    J Am Geriatr Soc

    (1990)
  • K. Muus et al.

    Effect of post-discharge follow-up care on re-admissions among US veterans with congestive heart failure: a rural-urban comparison

    Rural Remote Health

    (2010)
  • M.D. Silverstein et al.

    Risk factors for 30-day hospital readmission in patients≥ 65 years of age

    Proc (Baylor University Medical Center)

    (2008)
  • G. Shmueli et al.

    Predictive analytics in information systems research Robert H. Smith School Research Paper No RHS

    (2010)
  • I. Bardhan et al.

    A predictive model for readmission of patients with congestive heart failure: a multi-hospital perspective

  • Cited by (0)

    View full text