Leveraging deep survival models to predict quality of care risk in diverse hospital readmissions

Hospital readmissions rate is reportedly high and has caused huge financial burden on health care systems in many countries. It is viewed as an important indicator of health care providers’ quality of care. We examine the use of machine learning-based survival analysis to assess quality of care risk in hospital readmissions. This study applies various survival models to explore the risk of hospital readmissions given patient demographics and their respective hospital discharges extracted from a health care claims dataset. We explore advanced feature representation techniques such as BioBERT and Node2Vec to encode high-dimensional diagnosis code features. To our knowledge, this study is the first to apply deep-learning based survival-analysis models for predicting hospital readmission risk agnostic of specific medical conditions and a fixed window for readmission. We found that modeling the time from discharge date to readmission date as a Weibull distribution as in the SparseDeepWeiSurv model yields the best discriminative power and calibration. In addition, embedding representations of the diagnosis codes do not contribute to improvement in model performance. We find dependency of each model’s performance on the time point at which it is evaluated. This time dependency of the models’ performance on the health care claims data may necessitate a different choice of model in quality of care issue detection at different points in time. We show the effectiveness of deep-learning based survival-analysis models in estimating the quality of care risk in hospital readmissions.

www.nature.com/scientificreports/ of-life care planning. In the US, the Centers for Medicare and Medicaid Services (CMS) established penalties for hospitals with high 30-day readmission rate by reducing the payment for readmitted patients 10 . In 2019, under the penalties program of CMS's Hospital Readmission Reduction Program, 82% of hospitals were penalized for having excess readmissions 11 . CMS includes the following six medical conditions to evaluate unplanned readmissions in the program: • Acute myocardial infarction (AMI).
• Elective primary total hip arthroplasty and/or total knee arthroplasty (THA/TKA).
Besides the US, the United Kingdom (UK), Denmark, and Germany have also introduced policies, financial or non-financial, to monitor hospital readmission rates 12 . Objective. Since early hospital readmissions have been established as a measure to control for quality of care of medical services, our goal is to understand the risk of hospital readmissions given the information related to patients and their respective hospital discharges in Medicare/Medicaid claims data. Most previous studies have focused on the prediction of hospital readmission risk for comparisons among hospitals or for facilitating targeted interventions during or after hospital discharges 13 . These studies aim to predict the probability that a patient is readmitted within a specific time frame (usually 30 or 90 days), often using simple rule-based models such as the LACE index 14 or the HOSPITAL score 15 . A literature review by Ref. 16 reveals that 52 out of 76 studies use logistic regression to predict the likelihood of hospital readmission. Some other methods explored in prior research include support vector machines [17][18][19][20][21] , decision tree-based techniques 17,22 , Bayesian methods 22 , and ensemble methods (e.g., boosting, bagging and random forest 4,17,18,[20][21][22][23][24] ). The majority of these studies structure the problem of hospital readmission risk prediction as a binary classification problem-whether a hospital discharge results in readmission within a certain number of days.
Another line of research is to learn a distribution of hospital readmission risk over time since an initial hospital discharge. For any time after discharge, these models predict the probability of hospital readmission occurring at or before the actual readmission time using survival analysis (or time-to-event analysis). The most commonly used survival model is the Cox Proportional Hazards model (Cox PH) 1,18,21,[25][26][27] . A few studies have also implemented Random Survival Forest model, which decorrelates individual trees in the tree-based ensemble 27,28 . More recently, studies have shown that neural networks can improve the performance of traditional survival analysis models [29][30][31][32][33][34][35] . For example, DeepWeiSurv 30,31 uses a multi-task learning neural network on the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) dataset (a UK-Canada project which tries to classify breast tumors into subcategories) and the Surveillance, Epidemiology, and End Results (SEER) dataset (which provides information on cancer statistics) to show that neural network based survival models outperform traditional survival models such as Cox PH. Similar to DeepWeiSurv, another fully parametric approach is Deep Survival Machines (DSM) 33 . DSM does not require constant proportional hazards assumption of the underlying survival distribution for time-to-event prediction. In contrast to DeepWeiSurv which learns the Weibull parameters and mixture coefficients from multi-layer perceptions following the latent representations, DSM learns these parameters directly from the latent representation.
Significance. To our knowledge, our study is the first to apply neural network-based survival-analysis models to predict hospital readmission risk from health care claims data, agnostic of specific medical conditions and a fixed window for readmission. There are several benefits to taking this approach in the context of quality of care. First, it allows us to identify quality of care issues for patients with any medical condition. This is especially important for claims data where patient populations are not segregated based on their diagnosis. Second, it gives us the probability of readmission within any time frame, making readmissions after the traditional 30-day or 90-day time frames also eligible for inspection on potential quality of care issues. While 30-day or 90-day time frames may be critical for policy compliance, these arbitrary time frames are not amenable to the diversity of medical conditions and corresponding discharge/ readmission times we consider in our study.

Materials and methods
A survival analysis framework is adopted in this study where a distribution over time to an event from a particular starting point is estimated. In our case, this time-to-event is the time elapsed between a hospital discharge and subsequent readmission for similar medical conditions. In survival analysis, typically, censored data needs to be handled. Censoring happens when a study subject is not being monitored or observed at a particular point in time (also known as censored time), and the occurrence of an event after the censored time is unknown. The two primary reasons for a data point to be censored in survival analysis are (1) a subject withdraws from the study so their information beyond the withdrawal time is unavailable, and (2) after a pre-specified cut-off time a subject is not monitored and hence survival data is not collected. In our study, censoring happens primarily due to the www.nature.com/scientificreports/ latter reason as we do not consider hospital readmissions after T = 1095 days (3 years). While there may be other possible reasons for censoring, such as a patient changes their health insurance program and can no longer be tracked, or a patient expires at home, such events cannot be observed and hence not considered in our study.
Problem statement. In this section we formalize how we apply survival analysis to our data: • A covariate matrix X ∈ R N×d that represents d-dimension feature vectors of N hospital discharges.
x n = X[n][:] is the feature vector for the n-th discharge in the dataset. • The time elapsed t n ∈ R since the n-th discharge to either a readmission or a censored time.
• Censoring variable δ n ∈ {0, 1} that indicates whether a readmission occurs at time t n after the n-th discharge or it is censored at t n .
Also denote T n as the actual time of readmission following the n-th discharge ( T n ≡ t n if δ n = 1).
The goal is to estimate the distribution Most survival models do not learn f (t) directly. For example, the Cox PH model and its extensions (introduced in Models section) learn the hazard function: where S(t) is the survival function: The hazard function is the instantaneous rate of occurrence of the event at a particular time point and we can derive the desired density function from it. Study data. We conducted this study on 222,175 redacted and anonymized inpatient medical claims from state Medicare programs. The dataset was redacted and anonymized following the Safe Harbor method, Section 164.514(b)(2) of the HIPAA Privacy Rule.
Data overview. A claim submitted to a Medicare program typically includes the following information: • Claim number a distinct identifier of a claim.
• Diagnosis codes encoded using the International Classification of Diseases (ICD-10) 39 , a standardized system used to encode clinical terms. A claim contains at least a primary diagnosis code, and optionally secondary and tertiary diagnosis codes. An ICD-10 code consists of up to 7 characters that distinctly identify a medical condition. The first three characters represent the general diagnosis, and the other characters represent more specific categories. Examples of a hierarchical break-down for general diagnosis codes I05 and I06 are shown in Table 1.
When considering readmissions, we only analyze the first three digits of the ICD-10 codes, which represent the general category of the diagnoses. This helps us generalize our model by considering related diagnosis for which a patient may be readmitted. • Procedure codes represented by Current Procedural Terminology (CPT) codes 40 , encodes procedures performed by health care providers. • Provider ID a unique identifier that represents the health care provider that submitted the claim, as used in National Provider Identifier (NPI) 41 registry. • Patient demographics patient sex and age.
• Admittance date and discharge date the dates when a patient is admitted and discharged.
• Billed amount the total amount billed for services rendered by the health care provider.
(1) f (t|x n ) ∼ Pr(T n = t|x n , t n , δ). www.nature.com/scientificreports/ Data pre-processing and representation. From a dataset of over 8 million claims, we filter for only inpatient claims, where a patient gets admitted to a hospital, and claims that have a positive paid amount. We construct two subsets: readmission claims and non-readmission claims. The readmission subset includes initial admissions and the subsequent readmissions of the same patient with the same general diagnosis codes. The non-readmission subset includes admissions without any subsequent readmissions.
To represent the dataset in a way that conforms to the structure of the survival data, we define a time-to-event and censoring indicator for each admission and readmission. Figure 1 illustrates how the data is represented. The time-to-event of each admission is the time (in days) from the discharge date of that admission to the admittance date of the next readmission. For admissions that are not followed by a readmission, the exact time-to-event is unknown, so we use the time from the discharge date to the last recorded date in the data (01/26/2020). These admissions are said to be censored.
Based on patient ID numbers, we split 222,175 total claims into training, validation, and test sets. Because of the severe class imbalance (readmission cases comprise of only 13% of the entire dataset), we downsample the training subset by first sampling one claim per patient in the non-readmitted set and then subsampling from these claims so that the size of the non-readmitted and that of the readmitted datasets are equal. Table 2 shows the observed vs. censored ratio in the training, validation, and test datasets. Supplementary Appendix A shows the dimension of each feature and the corresponding summary statistics (for applicable variables) for n = 222, 175 claims.
Feature engineering. We use the following features in our survival analysis.
• Patient age Age bucketed into five categories: 0-17 years, 18-38 years, 39-59 years, 60-80 years, and greater than 80 years. We one-hot encode this feature. • Patient sex patient sex is binarized into 0 (female) and 1 (male). No non-binary sex is present in the data.
• Specialty code this code represents the specialty 42 of the respective health care provider and is one-hot encoded. Empty code is represented as the 'UNK' (unknown) category. • Length of hospital stay the difference in days between the discharge date and the admittance date. Claims with zero length of stay along with those within the same readmission chain with these claims are removed. • Diagnosis code each claim number has at least one and at most three diagnosis codes. The primary code is always present. We only consider the first three digits of the codes as the general category of the diagnosis. For each claim number (a data point), we collect all the general diagnosis codes and multi-hot encode them. We do not consider codes that appear less than 100 times in the entire dataset and code them into an Other category.  /20) and is marked as censored. This is interpreted as the time-to-event for this discharge is at least 5 days, but we do not know when exactly the readmission will happen after 5 days (could be indefinitely long). A0001-0003 claims are in the readmission subset. B0001 is in the non-readmission subset. All claims in the non-readmission subset are censored. The last claim in each readmission chain in the readmission subset (e.g., claim A0003) is censored. The other claims are observed. www.nature.com/scientificreports/ Besides the multi-hot encoding approaches for feature representation, we experimented with word embedding and graph embedding models for feature representation of the diagnosis codes. For word embedding, we use the embeddings (dimension: 200) obtained from the BioWordVec model 43 trained on the string descriptions of the diagnosis codes. For graph embedding (dimension: 256), we use the Node2Vec model 44 trained on five million claims from the same dataset in this study, with the proxy task of link prediction (predicting whether or not a link exists between a pair of nodes).
While other features from the claims data could be viewed as relevant to readmission modeling, similar to previous studies 45, 46 , we focus on patient demographics and diagnosis as key indicators in estimating the likelihood of readmission. All the features listed above are concatenated into a vector representing features for the respective claim.

Models.
We conduct a series of experiments to evaluate five survival analysis models (detailed in the next section) on our data.
Cox proportional hazards (Cox PH). Cox PH is one of the most common regression models and baseline models in survival analysis 1,21,[25][26][27]47 . Cox PH assumes linearity to model the hazard function: Notice that 0 (t) , the baseline hazard, depends on time but does not depend on the covariates x n . It describes the risk of readmission when x n = 0 , and the exact risk level of each discharge is scaled by the exponential that depends on x n (proportional hazard assumption). For this reason, the value exp{β · x n } (also known as the hazard ratio) is characteristic of an individual's relative risk level compared to other individuals. Cox PH can be viewed as a regression model, which tries to estimate β to maximize the partial likelihood of data: where C-mix model. Besides Cox PH, we also include the C-mix model in our experiment, which is reported to outperform other survival models experimented in Ref. 21 , a study on predictive models for hospital readmissions following vaso-occlusive crisis (VOC). C-Mix was originally designed to identify subgroups in the data with varying risk levels. It models the density of time-to-event as a mixture of Weibull distributions. In the case of a two-component mixture, the density is: where where π 0 (·) and π 1 (·) are the weights for the components, f 1 and f 2 are two Weibull distributions parameterized by vectors α 1 and α 2 , respectively, and: The two components f 0 and f 1 in the mixture can be viewed as representing two subgroups: the high risk and the low risk groups, respectively. Then, π 0 (x n ; β 0 ) can be interpreted as the risk level. The parameters are estimated by minimizing the negative log likelihood of data.
DeepSurv. DeepSurv 29 is an extension of Cox PH as it uses a neural network to model the hazard function: where h θ (·) is a multilayer perceptron (MLP) with weights θ . Notice that this is almost identical to Eq. ( 1) except that the relationship with the covariates is modeled by an MLP instead of a linear function. DeepSurv is optimized by minimizing the negative log partial likelihood with regularization: where is the regularization strength.
Sparse DeepWeiSurv. DeepWeiSurv 30 models the density over time to readmission as a mixture f W of K Weibull distributions: (7) f (t|x n ) = π 0 (x n ; β 0 )f 0 (t; α 0 ) + π 1 (x n ; β 1 )f 1 (t; α 1 ), www.nature.com/scientificreports/ where α k is the weight, and β k and η k are the shape and scale parameters of the k-th Weibull component in the mixture (note that these parameters depend on the covariates x n ). The goal of DeepWeiSurv is to learn α ∈ R K , β ∈ R K , η ∈ R K from each x n . DeepWeiSurv adopts a multi-task learning approach: there is a common sublayer f DWS that is a MLP that learns a representation of x n : After that DeepWeiSurv learns two MLP's f 1 and f 2 : SparseDeepWeiSurv 31 extends DeepWeiSurv by incorporating a sparsing layer in f 1 to learn the number of components in the mixture. The model is optimized by minimizing the negative log likelihood. SpraseDeepWei-Surv outperforms DeepWeiSurv across five real-world datasets.
Deep cox mixture (DCM). DCM 34 fuses the Cox PH and DeepSurv to obtain a deep learning model that learns a mixture of Cox PH to model individual time-to-event distribution. It assumes there are latent groups and within each group, the proportional hazard assumption holds. In each Cox group of the mixture, DCM fits the hazard ratios using deep neural networks and the baseline hazard for each mixture component non-parametrically. It is reported to have a state-of-the-art performance on time-to-event regression tasks on survival data on mortality (e.g., METABRIC, SEER 30 ).

Evaluation metrics.
As pointed out by Ref. 34 , most studies on survival models evaluate them using the relative ranking of the predictions of the risk level such as the concordance index (C-index). However, these metrics disregard the absolute values of the probability predictions, while these probabilities are directly used when detecting quality of care issues. The set of metrics that only depend on the ranking in terms of risk level of data points measure models' discriminative power, while those factoring in the actual predicted probabilities of readmission measure calibration. Following 34 , we assess the statistical models on both aspects with the following 4 metrics. All metrics are time-dependent. In this study, we evaluate the metrics at the time points that are the 25th, 50th and 75th percentile of event times in our dataset.

Time-dependent concordance index (discrimination metric).
The C-index measures the proportion of all eligible pairs of observations that are correctly ranked in terms of risk. The time-dependent concordance index restricts these comparisons to instances that occur within a certain time frame.
Area under the receiver operation characteristic curve (AUC) (discrimination metric). At any point in time t 0 , we can retrieve a binary label for any data point that indicates whether the readmission has happened by that time. Using (t 0 |x i ) to score an example i , we can compute the AUC as in a typical binary classification problem (using logistic regression for example).

Expected calibration error (ECE) (calibration metric)
. ECE is the average absolute difference between the observed and the predicted readmission rate, given the predicted readmission rate. Let the predicted readmission rate at time t 0 be R(t 0 |x i ) = P(t i < t 0 |x i ) , then We can estimate ECE(t 0 ) by bucketing R(t 0 ).
Brier score (dual metric). Brier score computes the mean squared error that quantifies the deviation of the predicted readmission rate within a time frame from the censoring indicator.
For model tuning and validation, we use a vanilla (non-time-dependent) version of the concordance index, which is traditionally used to evaluate survival analysis models and computed as: (11) f (t|x n , θ n ) = K k=1 α k (x n )f β k (xn),η k (xn) , Results. For each of the five models, following the approach in Ref. 34 , we compute the four evaluation metrics at three time-quantiles, 25th, 50th and 75th ones. The 25th, 50th and 75th time-quantiles correspond to readmission time frames of 17 days, 49 days, and 123 days. The metrics are reported in Table 3 for the test dataset. For a short window (e.g., 25th percentile), the DCM model has the highest discriminative power, with AUC of 0.822 and C-index of 0.817, closely matched by SparseDeepWeiSurv, with AUC of 0.821 and C-index of 0.815. SparseDeepWeiSurv, on the other hand, is the best calibrated with the lowest Brier score and ECE. Notably, its ECE at 0.007 is 79% lower than the second lowest ECE of 0.034 achieved by DCM.
For larger time windows (i.e., 50th and 75th percentiles), SparseDeepWeiSurv outperforms other models in both calibration and discrimination with best performance across all four metrics. For these larger time windows, Cox PH closely matches SparseDeepWeiSurv on discrimination metrics (e.g., for 50th percentile window, both Cox PH and SparseDeepWeiSurv have a C-index of 0.827). With respect to ECE, as with a smaller percentile window, the gap between the performance of SparseDeepWeiSurv is large (80% and 69% lower than the secondlowest ECE in 50th percentile and 75th percentile windows, respectively). DCM has lower discriminative power but better calibration compared to Cox PH. C-mix and DeepSurv models consistently have the lowest performance across almost all metrics and percentiles, with the exception of C-mix's ECE of 0.060 at 50th percentile, where it achieves the second best calibration score.
As discussed in "Feature engineering" section, we also experimented with word and graph embedding based feature representation of the diagnosis codes. The results of using these embeddings are reported in Tables 4  and 5.
Using word embeddings of the diagnosis codes, we observe a minor improvement in the best calibration score ECE. For example, the best ECE at 25th percentile when using multi-hot encoded diagnosis codes is by SparseDeepWeiSurv (0.007 ECE), and the corresponding figure for word embedding is 0.004. However, with metrics at 75th percentile, the ECE performs worse than in the multi-hot encoding experiments (e.g., for SparseDeepWeiSurv, performance is worsened from 0.040 to 0.043). Using graph embeddings, we observe a decrease in performance across all metrics and models. Overall, we do not see significant improvement in model performance when incorporating advanced embedding techniques for embedding health care diagnosis codes.
Finally, we conduct an experiment with a 30-day readmission time frame. This time frame is commonly used in the existing literature on hospital readmission analysis and makes our finding comparable to other studies. Since word or graph embeddings of diagnosis codes do not improve model performance, we conduct this experiment with multi-hot encoding of diagnosis codes. Results in Table 6 show that the DCM model has the highest discriminative power with AUC and C-index scores of 0.836 and 0.831, respectively. SparseDeepWeiSurv is the best calibrated model with the lowest Brier score and ECE of 0.029 and 0.009, respectively. Table 3. AUC, C-index, Brier score, and ECE computed at the 25th, 50th and 75th percentiles for the five models on the test set. The diagnosis codes are multi-hot encoded. The best performing value for each evaluation metric is highlighted in bold.   Table 5. AUC, C-index, Brier score, and ECE computed at the 25th, 50th and 75th percentiles for the five models on the test set. The diagnosis codes are embedded using graph embedding. The best performing value for each evaluation metric is highlighted in bold. www.nature.com/scientificreports/ tion at different points in time. For example, suppose we are evaluating whether a readmission, which happens 100 days after its previous discharge, the choice of model used to evaluate the readmission should depend on which model has the best performance at t = 100 days. In our statistical tests for significance, the differences in performance of the DCM and SparseDeepWeiSurv are non-significant at the 25th and 50th percentiles of the discriminative metrics. In addition, SparseDeepWei-Surv outperforms DCM and other methods in calibration metrics. In our task of identifying unusually early readmissions, calibration plays a more important role than discrimination. The downstream decision is based on the predicted likelihood of readmission at and before the date a patient of interest is being readmitted to determine if the readmission falls out of the acceptable likelihood threshold. We also note that the simpler Cox PH has strong performance in terms of discriminative power, comparable with the much more complex model SparseDeepWeiSurv. Therefore, Cox PH may be favored in tasks that focus only on the ranking of patients' readmission risk level.

AUC C-index
DeepSurv, which removes the assumption of linearity in Cox PH and using a neural network to model the hazard function, consistently has worse performance than Cox PH across discrimination and calibration metrics. Further investigation is needed to understand why it underperforms CoX PH on our data.
There are some promising directions that we would like to explore as next steps. First, while our current approach is based solely on claims data, in future we would like to explore complementary data sources such as electronic health records through which we may be able to enrich our feature set to include lab results, radiology reports, etc. Second, since our models are built to allow for inference on patients with any medical condition (not restricted to one or a small set of medical conditions), we would like to investigate to what extent this relaxation compromises the models' performance in either discriminative power or calibration. It is also important to know which of the four metrics are the most reliable and relevant for this problem of detecting quality of care issues through hospital readmission prediction.

Limitations.
Our study has several limitations. First, while we view quality of care issues through the lens of hospital readmissions (as do several prior studies ("Hospital readmission rate as an indicator of quality of care" section)), there are studies which have not found a strong link between readmissions and quality of care. For example, in Ref. 14 the authors show that less than one-fifth of urgent readmissions were potentially avoidable based physician reviews of patient files in a prospective study. In contrast to most prior studies (including 12 ), we focus on survival modeling to predict the likelihood of all-cause readmissions that may not be urgent (i.e. within 30 days) based on claims data. Second, the claims data we use in our experiments do not have an indicator of mortality, a confounding factor for survival analysis. The right-censoring of our data accounts for both mortality and no readmissions within 3 years following the initial admission. Third, our analysis is based on patients enrolled in the Medicare program (who are aged 65 years or over, younger people with disabilities, and people with End Stage Renal Disease) and our findings may not apply to readmissions data from other demographics.

Conclusion
In this study, we frame the problem of identifying early readmissions following a discharge as a survival analysis problem, where we estimate the distribution over time to readmission after a discharge conditioned on the discharge's covariates. We evaluate five models both on the discriminative power and the calibration. We observe that Cox PH and SparseDeepWeiSurv, the top performing models, have comparable discrimination ability; but SparseDeepWeiSurv, which models the time to readmission as a Weibull distribution, is the better calibrated. DeepSurv, which removes the linearity assumption of Cox PH and replaces it with a more complex relationship modeled by a neural network, has worse performance than Cox PH. DCM, an extension of DeepSurv, also generally performs worse than DeepSurv on our dataset. We also find that representing the diagnosis codes with advanced embedding methods such as those from Node2Vec and BioBERT does not improve and, in some cases, worsens model performance.

Data availability
The raw datasets analysed during the current study are not publicly available in full due to licensing and contractual restrictions, but synthetic sample datasets are available from the corresponding author on reasonable request. The source dataset was redacted and anonymized by a team who are specialized in this process, following Table 6. AUC, C-index, Brier score, and ECE computed for 30-day readmission for the five models on the test set. The diagnosis codes are multi-hot encoded. The best performing value for each evaluation metric is highlighted in bold. www.nature.com/scientificreports/ the Safe Harbor method, Section 164.514(b)(2) of the HIPAA Privacy Rule. Deloitte holds contracts with various Medicare and Medicaid agencies through which it has access to this data. We cannot advice on conditions under which other researchers can access similar datasets. The CMS provides beneficiary-level health information to researchers which can be requested through https:// www. cms. gov/ resea rch-stati stics-data-and-syste ms/ files-fororder/ limit eddat asets. Implementations of the models we experimented with are the following: Cox PH https:// lifel ines. readt hedocs. io/ en/ latest/ index. html, C-mix https:// github. com/ Simon Bussy/C-mix, DeepSurv https:// github. com/ jared leeka tzman/ DeepS urv, DeepWeiSurv https:// github. com/ survml, DeepCoxMixture https:// auton lab. org/ auton-survi val/.