Introduction

Economic evaluation of healthcare interventions is typically carried out using quality-adjusted life years (QALYs) as the outcome measure. The QALY combines length of life and health-related quality of life (HRQoL) in a single metric. As an addition to direct empirical comparison of QALY gain of available treatment options, modeling of cost utility is becoming increasingly common, since modeling based on existing data is more flexible and affordable than tailoring clinical tests to every scenario of potential interest. Such modeling rests on extensive use of preexisting recorded values representing the mean HRQoL loss associated with particular health conditions—so-called health-state utility values (HSUVs). This has created a demand for values for common ailments, which in turn has spurred on an effort to estimate catalogs of HSUVs associated with specific diagnoses [1, 2]. Priority setting in health care is becoming an increasingly important field for policy makers as the medical frontier is advancing ahead of budget constraints [3]. Access to sound estimates of health state utilities is important in order to ensure that resources are allocated in an efficient manner when evaluating treatments and interventions.

Modeling is complicated by the fact that patients frequently have more than one health problem, that is, are comorbid. Comorbidity is an ubiquitous and high-impact phenomenon [4], to the extent that three in four Americans above 65 years of age are diagnosed with two or more chronic diseases. In order to accurately estimate the QALY gain of alternative interventions in target populations, analysts need information about the HSUVs of health states characterized by being a combination of medical conditions.

Several efforts to construct catalogs of off-the-shelf HSUVs representing the mean HRQoL of various sub-populations have been undertaken [1, 2, 5]. These studies have in common that they are based on multivariate linear regression modeling on large datasets and thus can be said to incorporate quite accurate information while taking into account a number of factors which influence HRQoL. This is a suitable way for cost-utility modeling within a specific population where information (socio-demographics, diagnoses, etc.) is abundant. However, this method is not aimed at gaining knowledge about how comorbidities per se may interact with HRQoL. In particular, this paradigm assumes additive effects of having several diagnoses and therefore may be inadequate to inform on the relationship between comorbidity and HRQoL. A study by Sullivan et al. [6] looks at the impact of the number of chronic conditions on HRQoL, in a similar setting, concluding that the number of chronic conditions of an individual is a very important predictor of HRQoL.

A rather different approach to dealing with comorbidity is represented by attempts at identifying good mathematical models of comorbidity [4]. A mathematical model of comorbidity assumes that the HSUV of a compound health state can be estimated from the HSUVs of the component health states. Different models have been studied and compared, without any clear best fit [7, 8]. Research has mainly focused on combining single-state health state values into joint-state health state values [913], because large enough populations with any given combination of three distinct diagnoses are too small. Several joint health state predictors have been suggested (i.e., additive, multiplicative, best-of-pair, worst-of-pair, etc.), but no general consensus has been reached [7]. The various models (additive, multiplicative, minimum, etc.) lead to diverging predictions. An additive model implies that preferences should decline linearly with increased diagnoses; the multiplicative model implies diminishing marginal loss of HRQoL as a function of additional diagnoses. The best and worst-of pair models both imply a rapidly flattening HRQoL as diagnoses add up.

Investigating the mathematical relationship between single-state HSUVs and their corresponding joint-state HSUV is likely to be insufficient to uncover a general trend. Without any preconceptions about the preferred functional form, the purpose of this study is to explore the relationship between mean HRQoL and increasing numbers of diagnoses.

Methods

Data

We obtained data from the 2001 and 2003 Medical Expenditure Panel Survey (MEPS) [14]. These MEPS datasets contain detailed information on non-institutionalized US respondents’ health and socio-demographics, as well as self-reported HRQoL measured by two multi-attribute utility instruments, the EQ-5D and SF-6D (further details are provided below; the choice of years was based on the availability of contemporaneous EQ-5D and SF-6D data). The MEPS Web site also provides, in an auxiliary medical conditions file, a list of International Classification of Diseases, Ninth Revision, Clinical Modification (ICD9-CM) diagnose codes, which are linked to individuals by an identification variable. For privacy reasons, ICD9-CM diagnoses are provided as a truncated, 3-digit code in the MEPS file. For example, this means that an individual diagnosed with ‘hypertrophy of nasal turbinates’ (ICD9-CM code 478.0) and ‘polyp of vocal cord or larynx’ (ICD9-CM code 478.4) will be coded with two occurrences of the 3-digit ICD9-CM code 478.

HRQoL instruments

The EQ-5D is one of the most frequently used instruments to assess HRQoL in health economic evaluation [15], requiring individuals to describe their health state across five dimensions: mobility, self-care, usual activities, pain and discomfort, and anxiety and depression. The 2001 and 2003 MEPS included the three-level version of the EQ-5D, in which each of the five dimensions has response options ‘no problems,’ ‘some problems’ or ‘extreme problems’. EQ-5D utility values were estimated using the preference-based algorithm published by Shaw et al. [16]. SF-6D scores were derived from the 12-Item Short-Form Health Survey (SF-12) [17]. The SF-6D is a multi-attribute utility instrument comprising items for the following six dimensions: physical functioning, role limitations (physical and emotional), bodily pain, vitality, social functioning, and mental health. Seven of the 12 items from the SF-12 are used to derive an SF-6D index score, and the six dimensions have between three and five levels of severity. SF-6D utility values were calculated according to the preference-based algorithm published by Brazier and Roberts [18].

Inclusion criteria

We denote by P k the collection of MEPS individuals who satisfy the following conditions: (1) at least 18 years of age, (2) have valid data for both HRQoL instruments, and (3) who have exactly k registered diagnoses in the MEPS medical condition file. Similarly, for an ICD9-diagnosis, D, the symbol P D denotes the set of individuals who are at least 18 years of age, have valid HRQoL data, and are registered with diagnosis D.

Statistical analysis

The primary aim of the study was to investigate how HRQoL is affected by additional medical diagnoses. To find support for generalizability beyond one specific HRQoL instrument, we analyzed both the EQ-5D and the SF-6D data from the MEPS dataset.

For each respondent in the MEPS 2001–2003 data, we computed an auxiliary variable named ‘Number of (registered ICD9) Diagnoses’ (NoD); a variable which simply counts the number of distinct ICD9 diagnoses assigned to the individual via the MEPS medical conditions file (Note: V-codes from the MEPS medical conditions files were omitted; this is further discussed in the Discussion section). Due to the 3-digit truncation of ICD9-codes, some responders were registered with more than one occurrence of the same ICD9-code. Such instances were counted with multiplicity, since they originate from different ICD9-codes in the underlying dataset. Next, the data were stratified according to the NoD variable into P k subgroups. To ensure robust estimates of mean HRQoL for the NoD-defined strata, a pre-defined threshold of 1000 individuals, per strata, was required for inclusion in further analyses. For each strata satisfying this threshold, mean EQ-5D and SF-6D estimates were calculated. To assess the functional relationship between HRQoL and NoD, we next fitted three models to the aggregated data:

$${\text{Model A}}:\;{\text{ HRQoL}} = \alpha + \beta \cdot {\text{NoD}}$$
$${\text{Model B}}:\;{\text{HRQoL}} = \alpha + \beta \cdot {\text{NoD}} + \beta_{2} \cdot {\text{NoD}}^{2} \quad {\text{and}}$$
$${\text{Model C}}:\;{\text{HRQoL}} = \alpha \cdot \beta^{\text{NoD}}$$

Model A may support an additive—or linear—relationship. Model B may support a linear, an approximate multiplicative or an accelerating HRQoL loss relationship between NoD and HRQoL, depending on the signs, the magnitudes and the associated p values of the coefficients β and β 2. Model C, which is equivalent to the model \(\ln \left( {\text{HRQoL}} \right) = \alpha^{\prime } + \beta^{\prime } \cdot {\text{NoD}}\), reflects a multiplicative relationship between HRQoL and NoD. In all three models, the intercept (α) is interpretable as the estimated mean HRQoL of individuals with zero diagnoses. In Models A and B, a fixed decrement β is subtracted for each additional diagnosis; in Model B, an extra fixed adjustment of β 2 · NoD is added to the estimate. For Model C, instead of a fixed decrement from α, the estimate is multiplied with a factor of β for each additional diagnosis; whence a good fit of this model may be taken as support of an underlying multiplicative relationship.

Because early inspection of plots of the values suggested the models would provide very similar fits, several model selection statistics were computed to explore our research question: regression coefficients, p values for the regression coefficients, the adjusted R 2’s, the root-mean-squared error (RMSE), and the leave-1-out root-mean-squared residuals (L1O-RMSR) [19, 20], which is the analogous index of the model’s predictive value [21]. The residuals that enter the L1O-RMSR index are the distance between the predicted value and the observed value for HRQoL for P k , when leaving out the estimate of P k when estimating the model (i.e., a standard leave-1-out cross-validation approach). As Model A is a nested specification of Model B, the two can be compared directly using standard analysis of variance methods, i.e., a nonsignificant regression coefficient for the quadratic term indicates over-specification. Because Models A and C are not nested models, there is no canonical best way of comparing them. As a further aid in interpreting the results, we also calculated the root-mean-squared distance (RMSD) between the fitted values of Model A and Model C. The RMSD is simply the Euclidean distance between the two models’ fitted values or equivalently the standard RMSE of Model A’s fitted values regarding Model C’s fitted values as the observed values. This last statistic is non-standard and therefore should be interpreted with caution. On the other hand, mathematically, the RMSD as defined here is simply the Euclidean distance between the fitted values of the two models, measured by the same metric as is used for the RMSE statistic. Therefore, it has one obvious interpretation: the relative differences in magnitude of the RMSDs between the two models, and the two models’ RMSEs, say something about the mutual distance between the fitted values relative to the fitted values to the observed ones. As for the regressions, all means for the RMSE’s and the RMSD’s were weighted by the strata’s relative sizes.

Adjusting for age, sex, and severity

Correlations between demographic factors, such as age or gender, and HRQoL, or between the number of diagnoses and diagnosis severity, may introduce bias unless accounted for in the analysis. An ideal dataset would ensure that P k+1 is comprised of individuals from P k after being affected by one more diagnosis. Within our dataset, this assumption does not hold because of inevitable differences in age, sex, severity, and numerous other factors. Accordingly, we performed a number of adjustments (age, sex, and severity) to compensate for these potential sources of bias. A detailed explanation of the adjustment methods is reported in Appendix. Briefly, for the severity adjustment, a new variable called ‘severity-weighted number of diagnoses’ (SWNoD) was computed for each respondent by summing severity weights rather than the (unadjusted) number of diagnoses. Note that the severity weights were calculated separately for EQ-5D and SF-6D, so that, for example, when investigating the relationship between SWNoD and EQ-5D, the severity weights used were computed with respect to the EQ-5D. Furthermore, for each of the HRQoL instruments, we computed severity weights using two sets of criteria. ‘Relaxed’ weights were calculated for all diagnoses for which we had at least one observation of an individual with no other diagnoses. The ‘strict’ weights required at least 10 sole-diagnosis individuals for a weight to be estimated.

In total, four analyses were carried out for each of the two HRQoL instruments (see Table 1 for an overview). All analyses were carried out in the statistical software R [22]; the models were fitted with the built-in linear regression modeling lm-function.

Table 1 List of analyses carried out to compare linear and multiplicative models

Results

The pooled 2001–2003 material contains a total of 67,771 individuals. A total of 47,178 individuals were 18 years or older, out of which 39,817 (84.4 %) had valid data for both MAUIs (which were administered to 18+ year olds only). The age variable ranged over 18–85 (mean 45.36); 45.5 % were males. The NoD variable ranged over 0–45 (mean 3.28). A total of nine strata P 0,…,P 8 (consisting of patients characterized by having exactly 0,…,8 diagnoses) remained after omitting strata with fewer than a thousand respondents. Table 2 gives descriptive statistics for the strata P 0P 8: unadjusted means for the strata’s mean HRQoL as measured by EQ-5D and SF-6D, mean age, percentage of males, strata size, and relative share (of the 39,817 with valid HRQoL information).

Table 2 Descriptive statistics for each stratum defined by the number of diagnoses

The age distribution was skewed toward more elderly individuals in the strata representing more diagnoses, with a near-linear relationship between the strata’s mean over age and NoD. It is also the case that in general, the respondents with more diagnoses also have more severe diagnoses, as is evident by the perfect correlation (r = 1.000) between NoD and mean SWNoD (Table 2); indeed, the individuals with eight diagnoses have on average almost nine severity-adjusted diagnoses.

Unadjusted analyses

For both HRQoL indices, the parsimonious linear models exhibited R 2 > 0.995, indicating that a linear relationship between NoD and HRQoL explains the average values very well. Summaries of the regression models are presented in Table 3, together with results from the age-, sex-, and severity-adjusted variables.

Table 3 Key statistics for the regression models across the eight analyses described in Table 1

Adjusted analyses

The regression model of EQ-5D as a function of age and sex within P 0-stratum was significant for both independent variables (p < 0.000) and predicted age–sex reference values.

$$u_{\text{EQ}} \left( {a,s} \right) = 0.9697 - 0.0007 \cdot a + 0.0085 \cdot s$$

For SF-6D, only the sex variable was significant s (p < 0.000), and after leaving out the age variable (p > 0.05), the model predicted age–sex reference values as

$$u_{\text{SF}} \left( {a,s} \right) = 0.8487 + 0.0237 \cdot s$$

Of the 555 distinct ICD-9 diagnoses in the MEPS medical conditions file, there were 373 diagnoses for which a severity weight was obtainable from at least one individual (‘relaxed’ definition) and 124 diagnoses that at least 10 individuals had as their sole diagnosis (‘strict’ definition). After omission of individuals with non-weighable diagnoses, for the two different severity-adjustment criteria, the procedure retained 36,599 (91.92 %) for the relaxed inclusion and 25,858 (64.94 %) for the strict inclusion. The adjustments were non-trivial: The fraction of respondents who obtained a (rounded) SWNoD which differed from their originally computed NoD category ranged from 9.23 % (Analysis 8: ‘strict’ SF-6D-based SWNoD) to 44.97 % (Analysis 3: ‘relaxed’ EQ-5D SWNoD).

The computed model selection statistics (see Table 3) illustrate a good fit for all models, with very high adjusted R 2 values throughout. The RMSE column shows that all three models give good fitted versus observed values. Furthermore, the quadratic Model B, with its additional parameter, tends to outperform the two other models with respect to this metric. The L1O-RMSR gives a different picture: Here the Model B under-performs, suggesting over-specification. The Model C performs slightly better than the Model A according to the L1O-RMSR metric; however, this gap is closed after adjusting for age, sex, and severity. In the L1O-RMSR metric, all models improve their fit as adjustments are made, except for analyses 4 and 8 which correspond to the strict inclusion. The RMSD column reports the distance between the predictions of the Models A and C. This column shows that the difference between the two models’ predictions is smaller than the difference between the two models’ respective predictions and the observed values.

For better visualization of the results presented in Table 3, Fig. 1a, b provides a graphical image of two of the models (Models 4A–C and 8A–C). We see that in both cases, the three Models A–C provide similar fits and that the immediate impression is that the parsimonious linear model describes the trend well.

Fig. 1
figure 1

Illustration of model fit for Models A, B, and C for the fully adjusted analyses (age, sex, and severity) for the EQ-5D (a) and SF-6D (b). With reference to Table 3, a corresponds to analyses 4-A, 4-B, and 4-C; b corresponds to analyses 8-A, 8-B, and 8-C

Discussion

The most striking property of the result reported in Table 3 is the similarity between the three models. Models A, B, and C display very similar fit indices, and the RMSE values suggest that all three models estimate the data well. Before adjustments for age, sex, and severity, Models B and C slightly improve the fit compared to the linear Model A. As expected, with its one extra degree of freedom, the quadratic Model B tends to beat the two other with a few thousands of a unit; however, looking to the L1O-RMSR column, it appears over-specified. After adjustments are made, Model A outperforms or matches Model C.

Examining the RMSE and L1O-RMSR for Models A and C does not identify either as being superior. If we assume that the adjusted analyses are the most appropriate, the improved fit of Model A suggests a possible underlying true additive relationship. The results also suggest that Models A and C are more similar to each other than to the underlying data, as reflected by the RMSD values being smaller than the two models’ RMSE statistics.

On average, little is gained from adding a quadratic term to a linear model for predicting HRQoL loss associated with extra diagnoses. This suggests that the general trend, on average, is adequately captured by a linear model. That this in conflict with many studies from the joint-state literature may be due to the fact that an additive model, working directly with the HRQoL losses associated with a single-state condition, does not account properly for the HRQoL loss present also among those with no diagnoses.

Strengths and limitations

While several studies investigating the impact of having two simultaneously existing diagnoses (the joint-state literature) have been carried out [913], the more general and underlying question of how diagnoses impact HRQoL has not been previously addressed. The study undertaken by Sullivan et al. does to some extent overlap with this study because they both incorporate respondents with multiple diagnoses. However, whereas Sullivan’s model is designed to predict individual HRQoL, given rich information about the individuals’ age, sex, diagnoses, and other covariates, our model is solely focusing on the independent impact of diagnoses on HRQoL. Put simply, Sullivan focuses on the HRQoL of individuals, with a rich model, while we use a sparse model to focus on the functional relationship between the number of diagnoses and HRQoL.

Previous studies [9, 1113] have used the clinical classification categories (CCCs) as a crude measure of disease. The CCCs also include V-codes; ‘supplementary Classification of Factors Influencing Health Status and Contact with Health Services (V01.0–V91.99) is provided to deal with occasions when circumstances other than a disease or injury (Codes 001–999) are recorded as a diagnosis or problem’ [23]. As such, V-codes carry with it information about other factors than morbidity qua morbidity. Using the truncated CCC information—or defining NoD-stratum—without omitting the V-codes thus may lead to groups with possibly biased HSUV values. The working directly with the ICD-9 diagnoses in this study permits omitting V-codes and is a strength of our analyses.

The interpretation of our results depends on patients with n + 1 diagnoses being comparable to patients with n diagnoses with the exception of the additional health problem. The major concern is that patients with more diagnoses may be afflicted with problems of different severity from the ones with patients with fewer diagnoses. However, the extent to which this is a problem for our analyses directly transfers to all attempts at determining the functional form for addition of health problems. Adjusting for severity goes some way toward ensuring such comparability. Still, the validity of our findings regarding the relationship between number of diagnoses and HSUVs depends on the generalizability of the MEPS data with regard to that relationship. Our analyses are made under the assumption that sampling error and missingness are random with respect to the functional relationship under scrutiny.

Since we did not gather the data ourselves, we have limited control of the quality of the data. However, it is unlikely that there should be any systematic biases in the collection process that would affect HRQoL values as a function of NoD. The data were collected in an outpatient setting, meaning that we cannot necessarily generalize to, e.g., a hospitalized population.

Even though our analyses are carried out on mean values computed over populations with 1000+ members, only nine strata were included. This means that the linear relationship observed may not describe the actual trend for patients with nine or more diagnoses. We do not suggest that the regression models are useful in themselves, only that they help investigate the underlying relationship between morbidity, as measured by diagnoses, and HRQoL.

The observed range of mean HRQoL values in our sample (0.948–0.742 for EQ-5D and 0.862–0.686 for SF-6D) may limit our ability to distinguish between the predictions from the additive and the multiplicative approaches. The problem could be ameliorated by looking specifically at severe diagnoses, but this would come at the cost of substantially reducing the number of available observations. As it is, the observed range of HRQoL values is based on more than 93 % of the population sample, suggesting that we are covering most of the relevant range of disease in the population.

Conclusions

The three model specifications explored in this analysis—the linear (A), the linear with a quadratic term (B), and the multiplicative (C) (see the "Methods" section)—were virtually identical, indicating that a linear model adequately represents the trend on average. Occam’s razor suggests that the simplest model should be preferred. On this basis, we recommend discontinuing the search for a general multiplicative model. The study does not support the general notion of declining marginal disutility of health.

The observation that the average over thousands of patients with hundreds of different diagnoses match a linear function through number of diseases does not indicate that there exists a general linear model that can predict the mean HRQoL for a given combination of diagnoses from the HRQoL of the constituent diagnoses; the averages in question collapse a wide distribution of diagnoses that mask each other, exacerbate each other, or behave erratically in combination. The use of any general model, including the additive, is likely to lead to predictions that deviate substantially from reality in most cases even if the deviation is unbiased across studies. We recommend using empirical estimates of the HRQoL for patient groups with combination health states where this is possible. When such estimates are unattainable, any non-empirical estimates should be made based on expertise that allows predictions of the manner in which the constituent health problems should be expected to interact.