Acknowledging the role of patient heterogeneity in hospital outcome reporting: Mortality after acute myocardial infarction in five European countries

Background Hospital performance, presented as the comparison of average measurements, dismisses that hospital outcomes may vary across types of patients. We aim at drawing out the relevance of accounting for patient heterogeneity when reporting on hospital performance. Methods An observational study on administrative data from virtually all 2009 hospital admissions for Acute Myocardial Infarction (AMI) discharged in Denmark, Portugal, Slovenia, Spain, and Sweden. Hospital performance was proxied using in-hospital risk-adjusted mortality. Multilevel Regression Modelling (MLRM) was used to assess differences in hospital performance, comparing the estimates of random intercept modelling (capturing hospital general contextual effects (GCE)), and random slope modelling (capturing hospital contextual effects for patients with and without congestive heart failure -CHF). The weighted Kappa Index (KI) was used to assess the agreement between performance estimates. Results We analysed 46,875 admissions of AMI, 6,314 with coexistent CHF, discharged from 107 hospitals. The overall in-hospital mortality rate was 5.2%, ranging from 4% in Sweden to 6.9% in Portugal. The MLRM with random slope outperformed the model with only random intercept, highlighting a much higher GCE in CHF patients [VPC = 8.34 (CI95% 4.94 to 13.03) and MOR = 1.69 (CI95% 1.62 to 2.21) vs. VPC = 3.9 (CI95% 2.4 to 5.9), MOR of 1.42 (CI95% 1.31 to 1.54) without CHF]. No agreement was observed between estimates [KI = -0,02 (CI95% -0,08 to 0.04]. Conclusions The different GCE in AMI patients with and without CHF, along with the lack of agreement in estimates, suggests that accounting for patient heterogeneity is required to adequately characterize and report on hospital performance.


Introduction
The growing availability and use of administrative data are resulting in a profusion of healthcare performance assessment initiatives worldwide. Either institutionally framed or developed under the umbrella of research projects, the wealth of administrative data offers the opportunity to access larger samples of patients, covering virtually all providers in a health plan, allowing cross-country comparisons and most importantly, enabling the systematic and continuous monitoring of providers' performance. Many institutional-based [1-7] and research-oriented examples [8][9][10][11][12][13][14][15] illustrate this enormous potential. On the other hand, as performance assessment is increasingly deemed to be the basis for different value-based initiatives (e.g. benchmarking strategies, pay for performance schemes, patient choice programs, etc), decision makers are increasingly calling for trustworthy measurements and reliable reporting [16].
In this respect, analytical methods play a critical role. Once the use of ordinary (single level) regression models were shown to be inappropriate, as they circumvent the interdependence of patient outcomes within a hospital (i.e. patient risk within a hospital is more alike than patient risk from a different hospital) [9,12,[15][16][17][18][19], and are at risk of the Yule-Simpson paradox [20], marginal models (Generalized Estimating Equations, GEE) or multilevel modelling (MLRM) have become increasingly popular, although their approach and interpretation are clearly different; while the use of GEE focus on the estimation of the population-averaged risk of death adjusting hospitals' heterogeneity, MLRM assumes that each hospital has their own underlying risk of an event, and this risk varies across hospitals (i.e. the probability of an event is conditional to the place where the patient is treated). Accordingly, MLRM has been suggested as a more appropriate approach when hospital-specific interpretations are needed [21].
But most importantly, variations in hospital performance are usually presented as the comparison of adjusted average measures, excluding the possibility that hospital performance may also be conditioned by patient heterogeneity, for example, determining the care responses to specific subgroups of patients [22]. One fundamental feature of MLRM in hospital performance assessment is that MLRM can drop the assumption that the underlying risk for an individual is the same for all hospitals, allowing this risk to vary at hospital level; therefore, the hospital effect also becomes a function of patient heterogeneity [23]. In practical terms, this property, which implies the inclusion of random slopes, allows the development of specific performance measurements for subgroups of patients. Therefore, the observation of better or worse performance will refer not just to the hospital outcome obtained for the regular patient but also to the hospital achievement for specific subgroups of individuals. This well-known property of MLRM has scarcely been exploited in the assessment and reporting of hospital performance.
In this paper, we use MLRM to draw out the relevance of accounting for patient heterogeneity when reporting on hospital performance, using in-hospital mortality in acute myocardial infarction as a case study. Including a random slope for AMI patients with coexistent CHF will show the relevance of accounting for patient heterogeneity in hospital performance reporting.

Design, population and setting
An observational cross-sectional study utilising administrative data representing virtually all hospital admissions for AMI in patients aged from 40 to 80, treated in 434 hospitals from 5 European countries (Denmark, Portugal, Slovenia, Spain and Sweden) in 2009, totalling 73,812 potential discharged episodes. Hospitals accounting for fewer than 250 AMI episodes in 2009 (discretionary threshold) were excluded from the sample in order to reduce structural heterogeneity across hospitals and gain robustness in the estimations (Fig 1). The final sample accounted for 107 hospitals, accounting for 46,875 AMI episodes (63.5% of all assisted episodes), from which 5.2% deceased (2,451 case-fatalities). CHF coexisted in 13,5% of the sample (6,314 AMI episodes).

Endpoints
Our work comprised two consecutive endpoints; firstly, the variation in the hospital effect (i.e. GCE) when including a random slope for CHF patients in the MLRM; and, additionally, the level of agreement in hospital outcomes, contrasting both types of hospital GCE (i.e. under the assumption that the underlying risk for an individual level association is the same for all the hospitals or under the assumption that the underlying risk for CHF patients varies across hospitals).

Variables in the models
As aforementioned, the hospital outcome in this study (i.e. proxy of performance measure) was the adjusted in-hospital mortality risk in AMI patients who stayed for up to 30 days after admission; thus, inpatients with admission diagnosis code 410 � in those countries using ICD-MC 9 th (Spain and Portugal) and I21 � and I22 � in those countries using ICD 10 th (Denmark, Slovenia and Sweden). Those admissions due to pregnancy, puerperium or childbirth were excluded (codes ICD-MC 9 th O00 � -O99 � or ICD10 th 630-677). [detailed in S1 Appendix] The patient-level independent variables were: a) age, categorized as 40-49, 50-59, 60-69 and 70-80, using the youngest group as the reference group; b) sex, using male as the reference category; c) patient comorbidities, computed as an Elixhauser risk score [24,25], obtained from the predicted probability of death for each of the episodes modelled with a single level logistic regression; and d) the coexistence of congestive heart failure (CHF) in the episode of AMI. Patients with CHF constituted the subgroup of patients of interest, being more fragile than the regular patient and supposedly with the requirement for a higher intensity of care. A CHF was flagged when ICD9th codes 398.91 and 428 � , and ICD10th code I50 � , were found in any secondary diagnosis recorded within the same episode. The definitions and corresponding codes were developed and validated in the context of the ECHO project [26]. When it comes to the hospital-level, no specific variables where included except the GCE captured as a random-effect. Finally, a dummy variable identifying the country of admission was included using Sweden as a reference in the comparisons.

Analyses
Upon the estimation of the basal risk of death associated to patient features and country of residence (basal model) throughout a conventional single level logistic regression model including age, sex, the Elixhauser score of risk, the coexistence of CHF, and the country of "treatment" (see variable definitions above), two MLRM models were built to estimate the hospital-specific risk of death for patients with AMI. For that purpose, we followed the methodology described elsewhere in a two-stage process [15].
The first MLRM specification included a random intercept for the hospital level in a twolevel multilevel logistic regression model, so that each hospital got its own intercept (i.e. basal risk of death) (Eq 1).
Where u oj +ε ij is the random effect part of the model The second MLRM specification, as an extension of the previous one, included a random slope, allowing each hospital to vary their risk slope according to a specific group of patients (in our case, patients with CHF). In practice, we obtain a hospital variance for patients without CHF and a different hospital variance for patients with CHF (see Eq 2).
Where X nj are the N variables characterising the gender and age of patients Z ij is the probability of death for a patient according to the concurrence of Elixhauser comorbidities, except CHF D j are dichotomous variables which identify the countries where hospitals belong u oj +u 2j CHF ij +ε ij is the random effect part of the model Estimation of the general contextual effect. The GCE was estimated for both models, the random intercept model and the extended model which adds a random slope. For both models, the hospital variance derivatives, the Variance Partition Coefficient (VPC) and the Median Odds Ratio (MOR), were also calculated.
(i) We calculated the VPC based on the latent response formulation of the model as [21,22,27]: Where s 2 u denotes the hospital variance, and p 2 3 the variance of a standard logistic distribution (π = 3.1416).
VPC is reported as a percentage that goes from 0% to 100%. If hospital differences (i.e. variance) were not relevant for understanding the individual differences in the latent propensity of death, the VPC would be 0%. That is, the hospitals would be similar to random samples taken from the whole patient population.
(ii) The median odds ratio (MOR) is an alternative interpretation of the magnitude of hospital variance [28][29][30]. The MOR is defined as the median value of the distribution of odds ratios (OR) obtained when randomly picking two patients with the same covariate values from two hospitals with a different underlying risk of an event of interest, and comparing the one from the hospital with the higher risk to the one from the hospital with the lower-risk. In simple terms, the MOR can be interpreted as the median increased odds of reporting the outcome if a patient is treated in another hospital with a higher risk. The MOR is calculated as: expð ffi ffi ffi ffi ffi ffi ffi 2s 2 where F −1 (�) represents the inverse cumulative standard normal distribution function. In the absence of hospital variation (i.e. s 2 u ¼ 0), the MOR is equal to 1. Theoretically, the MOR values may extend from 0 to 1 and the higher the MOR value, the more relevant the hospital effect in terms of patient outcome. The MOR translates the hospital variance estimated on the log-odds scale to the widely used OR scale, making MOR values comparable to the individual OR covariates in the model.
For the estimation of the models, we used the Restricted Iterative Generalized Least Squares (RIGLS) method to obtain the values needed to finally run the Markov Chain Monte Carlo (MCMC) estimation method [31]. The goodness-of-fit of the models was assessed through the Bayesian Diagnostic Information Criterion (BDIC).
We performed the analyses using MLwiN run on Stata 1 statistical software: Release 13, College Station, TX: StataCorp LP and MlwiN version 2.35, The Centre for Multilevel Modelling, University of Bristol [32].
2.4.2. Concordance in hospital performance. Finally, for the assessment of concordance (i.e. agreement in hospital performance on patients with and without CHF), we compared the residuals from both random parts in the extended model, the intercept [u oj ] and the slope [u 2j ]. The level of concordance between both residuals was studied using a measurement of agreement between observers for categorical variables. As the number of cases per country varied substantially, a weighted Kappa Index was estimated [33]. The choice of this estimator depends on how commonly performance measurements are reported through funnel plots, so units of analysis are categorized as: hospitals with residuals which are statistically above the average (i.e. exhibiting a higher risk of death than expected), hospitals with residuals which are statistically below the average (i.e. exhibiting a lower risk of death than expected), and hospitals that did not differ statistically from the expected risk of death, irrespective of their actual position above or below (i.e. hospitals within the funnel boundaries). According to this approach, hospitals in the sample were classified into three possible situations: better, neutral or worse than the expected, this categorization becoming the subject of the concordance measurement. As for interpretation purposes, the higher the Kappa Index, the higher the concordance between the two estimated hospital effects, which could suggest that hospitals perform equally in the patients without CHF as in the patients with CHF. Conversely, low concordance could suggest that hospitals perform differently.

Data sources
Hospital admissions from Denmark, Portugal, Slovenia and Spain were extracted from the database consolidated and validated during the ECHO project [30]. In turn, the Swedish Patient Register [34,35] provided the hospital data for Sweden. Both pseudonymised datasets were linked into a single database, stored, validated and analysed in a secure server set up in the premises of the Faculty of Medicine at Lund University (Malmo, Sweden), as foreseen in the access policies of the Swedish Register.

Ethics statement
This study, observational in design, used retrospective anonymized, non-identifiable and nontraceable data, and was conducted in accordance with the amended Helsinki Declaration, the International Guidelines for Ethical Review of Epidemiological Studies, and Spanish laws on data protection and patients' rights. The study implies the use of pseudonymised data, using double dissociation (i.e. in the original data source and once the data are stored in the database for analysis) which actually impedes patients' re-identification. The information supplied for the European collaboration presented the same strong characteristics of confidentiality as the other collaborating countries.

Results
The final sample was composed of 46,875 episodes with a primary admission diagnosis of AMI, discharged from 107 hospitals. Overall, 6,314 patients underwent a concomitant CHF. By countries, Denmark treated 4,635 of those AMI episodes in 6 hospitals (9.9% of the episodes in the sample); Portugal accounted for 6,217 from 16 hospitals (13.3% of the episodes), while Slovenia yielded 1,898 episodes in 3 of the hospitals (4.1% of the admissions analysed). In turn, Spain treated 23,043 AMI episodes in 56 hospitals (49.2% of the episodes in the sample) while Sweden dealt with 11,082 of the AMI episodes in 26 hospitals (23.6% of the episodes).
The sample had 38.2% of patients aged 70 to 80, varying across countries, with 33.2% in Denmark and 42.5% in Sweden. Overall, 26.4% of the patients were female, ranging from 23.6% in Spain to 30.6% in Sweden. The average risk score (i.e. predicted probability of death according to the Elixhauser comorbidities) for the whole sample was 5.2, ranging from 4.5 in Denmark to 5.8 in Portugal. Finally, the overall proportion of AMI patients with congestive heart failure was 59%, ranging from 56% in Portugal to 79% in Slovenia (Table 1).
Overall, 5.3 per 100 AMI patients died in hospital (2,451 cases out of 46,875) in the period of study; the crude mortality rate ranged from 0.5 to 13.1 per 100 patients at risk, for an interquartile interval of 1.63. By countries, Sweden, Slovenia and Denmark showed the lowest inhospital mortality rates, 4, 4.2 and 4.8 per 100 patients at risk respectively, while Portugal showed the highest with 6.91 per 100 patients at risk, followed by Spain with an in-hospital mortality rate of 5.6 per 100 patients at risk (Table 1). Table 2 shows the estimated adjusted-risks of death in the basal model, the basic GCE and the extended RS model. As observed in the basal model, the AMI risk of death increased with age (as compared to patients younger than 50), with the highest risk amongst the oldest (4.8 times more likely to die), the presence of comorbidities (2.1 times more likely), and the coexistence of CHF (2.8 times more likely). As compared to Sweden, patients living in Portugal were at 78% more risk of death, Denmark and Spain showing a 36% increased risk, while Slovenia barely registered a 6% increase. Being female did not increase the risk of death. Patient-level and country-level estimates were similar in both MLRM (second and third column in Table 2).
Both MLRM models confirmed the existence of a GCE; thus, beyond individuals' features, we observed an increase in the risk of death associated to the hospital of treatment. Moreover, in the specific case of the extended model with a random slope (the best model according to BDIC), the GCE was much higher in CHF patients, [VPC of 8.34 (CI95% 4.94 to 13.03) and a MOR value of 1.69 (CI95% 1.62 to 2.21)] than in those without CHF [VPC = 3.9 (CI95% 2.4 to 5.9), MOR of 1.42 (CI95% 1.31 to 1.54)].

Is the hospital effect consistent between estimations?
Once was the existence of hospital variance and the observation of a better goodness-of-fit of the model with random slope examined and proved, its residuals were compared. Once hospitals were classified in accordance with their level of performance (high, moderate or low), the agreement in the classification of hospital performance was non-existent [weighted Kappa Index value of -0,02 (CI95% -0,08 to 0.04)] suggesting a distinct performance in both groups of patients.

Discussion
Assuming the construct validity of AMI case-fatalities as a measure of hospital quality, this performance assessment exercise, based on 46,875 hospital admissions from five countries, shows that hospital outcomes differ when it comes to specific subgroups of patients, in our case, patients with CHF. Indeed, the greater MOR for the model including a random slope (i.e. assuming an interaction term for patients with CHF) reveals a greater influence of "hospital of treatment" when it comes to the case mortality rates for CHF patients (from MOR 1.42 in patients without CHF to MOR 1.69 with CHF).
Finally, the lack of correlation between the hospital effects on the non-CHF AMI patients and AMI patients with CHF (weighted Kappa Index = -0.02), prompts the need for analysing hospital effects on regular and specific subgroups of patients.

Caveats with regard to the lack of concordance
Despite the mathematical robustness of the results in terms of goodness-of-fit of the model and precision, two questions might be challenging the lack of concordance between the hospital effect on non-CHF vs. CHF AMI patients.
We could hypothesize, for example, that systemic factors could affect the GCE estimations distinctly, if the number of CHF patients per hospital is uneven across the sample (e.g. because of biased coding practices, because more complex patients arrive at certain hospitals, or because of differential expertise in the treatment of more fragile patients between centres). Although we have reduced this potential risk by excluding the smaller hospitals from the sample, if those phenomena are true they could have an influence on the estimations of the hospital contextual effect in the specific subgroup of patients, resulting in a higher risk of death associated to the hospital of treatment in those hospitals with more CHF patients. Fig 3 showing the potential correlation between the prevalence of high-risk patients (x axis) and the estimated risk in terms of u 2j (y axis) shows that this is not the case for the hospitals in the sample, ruling out this possibility.
Another point that could eventually affect the hospital contextual effect differently on non-CHF vs. CHF patients is the surviving bias in those with no concurrent CHF. Indeed, AMI patients with concomitant CHF (most of them STEMI cases) are supposed to be more likely to die within the first 24 hours. In these cases, patients might die in the emergency room. After analysing the survival curves for both groups of patients, the negligible differences observed in the first 24 hours after in-hospital admission strongly suggest that under-recording is not likely happened in our sample [S2 Appendix shows survival curves for each country]. However, as patients who died in the emergency room are not part of our dataset, we cannot discard some under-representation in those CHF patients. Whether this fact could imply any bias in the estimation is unknown.

Implication of the use of random-slope MLRM in hospital performance assessment
In contrast with single level estimations, MLRM takes into account the multilevel structure of the variance existing in the data (e.g. patients nested within hospitals), accounting for the interdependence of patient outcomes within a hospital and allowing a less biased estimation of uncertainty, providing weighted estimations of average hospital risk (i.e. shrunken residuals) and allowing a more reliable assessment of the units under study.
As compared to GEE, both MLRM and GEE assume the existence of a GCE assessing hospital performance. This contextual effect is termed "general" because it reflects the influence of the hospital context as a whole, without specifying any contextual characteristics other than the very boundaries that delimit the hospital [36]. This GCE expresses the joint effect of an array of factors like, for instance, the skills and specialization of the physicians, the available access to adequate technology as well as the quality of treatment and care in the hospital. In Acknowledging the role of patient heterogeneity in hospital outcome reporting such a way, the hospital context may condition patient outcomes beyond individual characteristics; that is, the same patient might have a different outcome if he or she is treated in a different hospital. However, while GEE modelling takes for granted that this GCE can be quantified by measuring differences between hospital averages only, in MLRM the GCE is measured by the share of the total patient variance that is between hospital averages; that is, the MLRM does not dislocate the individual patients from the hospital averages, but rather considers that there is a distribution of individual outcomes that can be decomposed into two levels of analysis, the individual and the hospital [9,12,14]. Therefore, "hospital effects" (i.e. GCE) are not properly appraised by studying the differences between hospital averages alone, but by quantifying the share of the total patient heterogeneity (i.e. variance) that exists at the hospital level [37][38][39]. To do this, MLRM estimates the hospital variance and its derivatives partition coefficient as a measure of the hospital GCE. Thus, when studying a specific quality outcome in patients from different hospitals, the higher the hospital variance, the more relevant is the hospital context to understanding the differences between patient outcomes [12].
More importantly, unlike other methods used to analyse clustered information (i.e. patients nested within hospitals) MLRM considers individual-level associations to be hospital-specific and drops the assumption that individual level associations are the same for all the hospitals. Consequently, in an extended MLRM with RS, hospital variance, and thereby hospital GCE, becomes a function of the patients' heterogeneity. In other words, by including a RS for a specific subgroup of patients, the hospital effect is not just a function of the very boundaries of the hospital but also a function of patients' features of interest (i.e. in our case having CHF). In practice, for a dichotomous variable, we obtain a hospital variance (i.e. a GCE) for patients without CHF and a different hospital variance for patients with CHF. This becomes, beyond considerations of interpretation, the analytical advantage of MLRM as opposed to GEE modelling.

Implications for hospital performance reporting
Some authors have already suggested, while acknowledging the risk of using indirect standardization in hospital performance assessment [20,22] or in the context of social epidemiology [23], that not considering patient heterogeneity could lead to an inappropriate assessment of performance. This paper empirically underpins the need for exploring both the hospital effect for patients with or without CHF. Therefore, a clear message is conveyed to those interested in the public reporting of performance measures. Beyond the assumption that performance assessment using administrative data is not a firm diagnostic tool but rather an instrument for screening, reporting mechanisms, more specifically league tables or funnel plots, [9,18] should represent hospital performance according to the results of the MLRM. If the model without a random slope prevails (which is not the case in our example), a single representation for the average patient might be enough; however, if a MLRM with random slope better explains the difference in hospital outcomes, then public reporting should represent hospital effects separately for specific subgroups of patients.
One last important implication for decision-makers is that MLRM provides a measure of the effect of size (i.e. to what extent the hospital contextual effect is relevant to the differences in health outcomes) through a number of statistics (hospital variance, variance partition coefficients, and MOR) not yielded by the popular indirect standardization methods or the GEE models. This feature makes MLRM findings more actionable than other approaches.

Conclusions
The hospital contextual effect in 107 hospitals from five different European countries was different in non-CHF AMI patients and AMI patients with CHF, suggesting that accounting for patient heterogeneity should be a requirement for adequately characterising and reporting hospital performance.
MLRM is flexible enough to allow the joint analysis of both overall effects and patient-specific hospital effects, providing accurate estimations of performance as well as a measure of the actual relevance of the hospital contextual effect.
Supporting information S1 Appendix. Description of selected codes. Inclusion and exclusion criteria and codes for episode selection. (DOCX) S2 Appendix. Survival curves testing differential underreporting in CHF patients. (DOCX)