Journal Pre-proof INTERACTION EFFECTS IN HEALTH STATE VALUATION STUDIES: AN OPTIMAL SCALING APPROACH

OBJECTIVES
To introduce a parsimonious modelling approach that enables the estimation of interaction effects in health state valuation studies.


METHODS
Instead of supplementing a main effects model with interactions between each and every level, a more parsimonious optimal scaling approach is proposed. This approach is based on the mapping of the levels onto domain-specific continuous scales. The attractiveness of health states is determined by the importance-weighted optimal scales (i.e. main effects) and the interactions between these domain-specific scales (i.e. interaction effects). The number of interaction terms depends only on the number of health domains, not the number of levels in the instrument. As a result, our modelling approach uses less than 3% of the number of parameters that are required by a standard MIXL model with a full set of two-way interactions and can be used for health-state valuation studies based on instruments such as the EQ-5D-3L, 5L, and SF-6D.


EMPIRICAL APPLICATION
The proposed model with and without interactions is fitted on an existing dataset of N=3,699 respondents who each completed 16 EQ-5D-3L discrete choice tasks.


RESULTS
Important interactions between the EQ health domains are found, which confirms that the accumulation of health problems within health states has a decreasing marginal effect on health state values. A similar effect is obtained when a so-called N3 term is included in the model specification. However, such a term is found to provide an inferior model fit and becomes insignificant once a full set of two-way interactions is included in the model specification.


CONCLUSIONS
The proposed interaction model is parsimonious, produces estimates that are straight-forward to interpret, and accommodates the estimation of interaction effects in health state valuation studies with more attractive and feasible sample size requirements.


INTRODUCTION
Generic measures of health-related quality of life (HRQoL), such as the EQ-5D and SF-6D, are commonly used in the economic evaluation of health care interventions. [1] These measures include a descriptive system and a corresponding utility value set for use in the estimation of quality adjusted life years (QALYs). [2] Traditionally, standard gamble (SG) and time trade-off (TTO) methods have been used to obtain these value sets, although more recently discrete choice experiments (DCEs) have increased in popularity. [3] Irrespective of the valuation method, however, currently available value sets rely on statistical estimations and are thereby prone to model misspecification.
A well-known form of model misspecification is the estimation of a main-effects model in the presence of interaction effects. The traditional approach to accommodating interaction effects is to extend the main effects specification with interactions based on the number and severity level of the health problems in each health state (e.g. [4][5][6] ), including so-called N terms (e.g. N3 or N5) that provide an extra decrement when at least one of the attributes is presented at its most severe level.
Alternatively, a full set of two-way interactions between each of the levels of the HRQoL instrument can be included in the model specification (e.g. [7][8][9] ). Unfortunately, both approaches are problematic: the first fails to differentiate between the various dimensions of the instrument and introduces undesirable health-state ordering artifacts whereas the inclusion all two-way interactions is only feasible for relatively small instruments, such as the EQ-5D-3L, and produces estimates that are difficult to interpret. Moreover, for larger instruments, such as the EQ-5D-5L or SF-6D, the number of interaction terms becomes unwieldy and the results virtually impossible to interpret -even apart from the required increase in sample size requirements and difficulty of accounting for preference heterogeneity.
In this paper, we therefore propose a new modelling approach to including interactions in health state valuation studies. The proposed approach relies on an optimal scaling method to constructing a parsimonious set of interaction effects, see e.g. [10][11][12][13] . As will be shown, these interaction effects are not only parsimonious but also more intuitive to interpret than a full set of two-way interactions. More specifically, the interactions are included between the health domains instead of between the various levels of each health domain; as a result, the interactions directly capture what happens when more severe health problems in one domain are combined with more severe health problems in another domain. Apart from the more intuitive interpretation, the proposed modelling approach scales well for larger instruments. This means that it can be used for the EQ-5D-5L, SF-6D or even larger instruments if combined with an appropriately powered TTO or DCE experimental design.

Including interactions using an optimal scaling approach
The proposed solution is to avoid interactions between each and every level of all domains and instead to include interactions between the health domains themselves. This can be achieved using an 'optimal scaling' approach, where each attribute level is transformed into a value on a domain-specific continuous scale. [10][11][12][13] The value on this scale then affects the health-state value. Without interactions, this optimal scaling model is an exact re-parametrization of the standard main-effects model (see Figure 1).
More formally, let denote whether level k of health dimension d is present in valuation task t.
The standard dummy-coded health-state decrement specification is then given by in which the beta-parameters measure the decrement of each non-base case level. In the optimal scaling approach, this specification is re-parametrized by first creating domain-specific optimal scales ( , ) that represent the value of the levels of each domain on a continuous scale: and then multiplying the value of the optimal scales with domain-specific beta-parameters ( ): To statistically identify this decomposition, it is necessary to impose a constraint on the optimal scales. One possibility is to impose a 'zero mean/variance of one' constraint, which is, for example, proposed by Van Rosmalen et al. [13] However, in the case of health-state valuation instruments it makes more sense to restrict the length of the optimal scales, which would imply that the domainspecific parameters directly capture the domains' relative importance. As shown in Figure 1 we propose to constrain the optimal scales to have unit length, ranging from 0 (no health problems) to 1 for all health dimensions = 1, . . , . Although equation 4 does not impose monotonically increasing severity levels within each dimension this can easily be included as an additional constraint in the construction of the optimal scales, for example by transforming an unrestricted set of parameters φ , * as follows: ) for = 2, . . , − 1 , Equation 5 implies -parameters that are monotonically increasing and bounded between 0 and 1. It is also the specification that is used in the three empirical applications of the optimal scaling method in this paper -although equations (4) and (5) of course only differ for health domains with 4 or more levels. Figure 2 provides a graphical representation of the optimal scaling model applied to the EQ-5D-3L instrument. As shown, similar to the default dummy-coded utility specification, 10 parameters are required: 5 phi-parameters to position the levels 2 on the optimal scales and 5 beta-parameters to capture each domain's importance.
The optimal scaling decomposition so far only represents a reparameterization of the standard dummy-coded health-state decrements. The main advantage of the optimal scaling approach, however, is that each domain is now represented by a single, continuous scale. This allows for the convenient and parsimonious inclusion of interaction terms in the model specification: in which the -main parameters are complemented by the set of D*(D-1)/2 two-way interactions between the constructed optimal scales. Note that the number of interaction parameters in this specification is considerably smaller than the number of possible level-interactions. Moreover, the number of interaction parameters only depends on the number of health dimensions of the instrument J o u r n a l P r e -p r o o f and not on the total number of levels. Hence the number of interaction parameters does not increase when moving from, for example, the EQ-5D-3L to the EQ-5D-5L instrument, even though the latter would require many more interaction parameters if all levels were to be interacted.

Empirical application #1.
In the first empirical example, the proposed optimal scaling model was applied to the Australian EQ-5D-3L composite time trade-off (cTTO) dataset of Viney et al. [14] This dataset is unique in the sense that it comprises a large selection of health states (i.e. 198 out of 243 possible EQ-5D-3L health states) based on a design that was confirmed to be adequately powered for the recovery of two-way interaction effects based on a simulation study. 45 health states that were considered 'implausible' were excluded from the design, i.e. all health states that combined level 3 on mobility with either level 1 on usual activities or self-care. Each of the 417 participants in the valuation study completed 12 cTTO tasks under the supervision of a trained interviewer and received $60 for their participation.
Following state-of-the-art modelling of cTTO data, 17 participants with a positively sloped relationship between their cTTO valuations and the misery index of the health states were excluded from the analysis and the data of the remaining 400 respondents were analyzed using a heteroskedastic Tobit model with censoring at -1. [15] This model assumes the existence of a true but incompletely observable variable (cTTO*) that underlies the observed cTTO values. The likelihood then includes the probability of the cTTO* value being beyond the censored value for all observations with cTTO=-1. Accordingly, the observed cTTO values for respondent i in task t were censored as with * assumed to be normally distributed with the mean ( ) and standard deviation ( ) reflecting the average health state value and variation among respondents in their valuation of the health state presented in task t of respondent i, respectively.

J o u r n a l P r e -p r o o f
In the standard dummy-coded specification the mean is defined as μ it = 0 + 1 MO2 + 2 MO3 + 3 SC2 + 4 SC3 + 5 UA2 + 6 UA3 + 7 2 + 8 3 + 9 2 + 10 3 , which is subsequently reparametrized as for the optimal scaling model without interactions. The optimal scales MO, SC, UA, PD, and AD were defined conform equation (5) and Figure 2, that is, constrained to 0 for level 1, estimated as φ d for level 2 of domain d, and constrained to 1 for level 3.
Similar to Pickard et al. [16] the standard deviation ( ) of the normal distribution was modelled as a 4 th -order polynomial of the health state values which ensured that the variances of the predicted values can flexibly depend on the health state severity. Finally, the optimal scaling model was also extended with a full set of two-way interactions: Empirical application #2.
In the second empirical example, the optimal scaling model was applied to the EQ-5D-3L DCE dataset of Nicolet et al., [9] which comprises a nationally representative sample of 3,669 Dutch respondents. Each respondent completed 16 pairwise choice tasks based on a Bayesian D-efficient DCE design that contained 400 unique choice tasks split into 25 equally sized design blocks. Similar to the Australian cTTO study, the DCE design was specifically optimized to accommodate the estimation of all main and two-way interaction effects and implausible health states were not excluded from the DCE design. All respondents completed the DCE tasks as part of an online, unattended survey and received a small financial compensation from the survey sample provider.
Given the large sample size and number of questions per respondent, we were able to follow the stateof-the-art modelling approach for DCE data using a mixed logit (MIXL) model specification. [17] This means that all model parameters were assumed to be individual specific to account for heterogeneity in health state valuations across individuals. Conform standard MIXL assumptions, the mixing distribution of the model parameters, and φ, was assumed to be multivariate normal and the observed choices which equal 1 if alternative j was chosen by individual i in choice task t and equals zero otherwise, were assumed categorical distributed with the probability of choosing each choice option defined as Here HSvalue itj denotes respondent i's health state value for choice alternative j in choice task t. was reparametrized as: for the optimal scaling specification without interactions. The optimal scales MO, SC, UA, PD, and AD were again defined conform equation (5), that is, constrained to 0 for level 1, estimated as φ i,d for level 2, and constrained to 1 for level 3.
In the third empirical application the optimal scaling model was applied to the EQ-5D-5L dataset of Jonker et al., [18]  with discount rate r. One of the advantages of this specification is that different types of discounting functions can be used to define the present value (PV) of the health states. Here the most commonly used exponential discounting function was used, which is defined as ( , ) = (− * ). After working out the summation of the PV's combined with an exponential discount function, the NPV is the standard annuity with exponential discount rate r: When the discount rate is equal to zero, all future life years receive the same unit weight and the NPV equals T. When the discount rate is larger than zero, which is typically the case, future life years that are further in the future receive less weight -with the amount of discounting determined by the discount rate.
As with the other empirical applications, the optimal scaling model without interactions was also extended with a full set of two-way interactions: The online supplemental materials provide similar results for the optimal scaling model extended J o u r n a l P r e -p r o o f with an N5 term, which equals 1 if one or more attributes are presented at level 5 and equals zero otherwise; this specification had an inferior model fit compared to the model with the N45 term.

Model estimation
The optimal-scaling models can, similar to the standard dummy-coded heteroskedastic Tobit and MIXL models, be estimated using classical and Bayesian methods. In this paper, Bayesian Markov Chain Monte Carlo (MCMC) methods were used, which involves the selection of prior distributions for the unknown parameters and updating these via the likelihood of the observed data. For the heteroskedastic Tobit models, uninformative normal priors (i.e., with a mean of zero and standard deviation of 10) were assigned to the , , and φ * . For the MIXL models, the same uninformative normal priors were assigned to the population mean parameters ( , ) and an uninformative Wishart prior with an identity scale matrix with degrees of freedom equal to the number of parameters to the inverse covariance matrix and a uniform (0,1) prior to the interest rate.
For the MIXL models, standard Gibbs update steps were used to update the population level mean and covariance parameters, slice update steps to update the discount rate, and a custom-implemented Metropolis-within-Gibbs algorithm with antithetic sampling to update the individual level MIXL and parameters in the cTTO model.

Model comparisons
The statistical fit of the optimal scaling models with and without interactions was compared based on the Watanabe-Akaike information criterion (WAIC). [25] In addition, the impact of the values are rescaled from 0 (for the worst health state) to 1 (for the best health state) to allow for a comparison between the models with and without interactions.   Table 1, almost all interaction parameters are positive and half are significant (i.e. have 90% or 95% CI that do not comprise zero). Only one interaction, the interaction between self-care and usual activities, is negative (and significant). In terms of WAIC, the optimal scaling model with a full set of two-way interactions provides the best model fit, considerably better than the other specifications. Finally, as shown in Table 1, the inclusion of interaction effects increases the value of the worst possible health state from -0.39 in the main-effects model to -0.23 in the twoway interaction model. Tables 2 and 3 provide the MIXL estimates of the optimal scaling models with and without interactions for the latent scale EQ-5D-3L and EQ-5D-5L with duration samples, respectively.

RESULTS
Because the presented models are all on their own latent utility scales, the estimates cannot be compared directly. It is nonetheless possible to compare the estimates in terms of signs, relative magnitudes, and significance (i.e. whether the CI comprise zero or not). As can be seen, a major difference between the two DCEs is the relative importance of the mobility domain. In the EQ-5D-  One of the major advantages of the optimal scaling model is that it produces estimates of main effects and interaction terms that are straight-forward to interpret. This is uniquely different from previous attempts, in which the mixture of positive and negative parameter estimates with varying sizes and statistical significance was difficult, or even impossible, to interpret. [7-9,26] Based on the subset of the most significant interaction effects, previous investigations did find that the inclusion of interactions provides a more accurate description of respondents' preferences. [7][8][9] Moreover, the most significant interactions were "generally positive and therefore in the opposite direction to the main effects", which concurs with our own findings. This suggests that the diminishing marginal effect of accumulated health problems on the health state values is relevant to many valuation studies, and also relevant for the valuation of bolt-on dimensions and/or the application of reduced parameter estimation models. [27] Another advantage of the optimal scaling model is that it is sufficiently parsimonious to be applicable to a wide range of HRQoL instruments. As shown, it can be used for EQ-5D-3L and EQ-5D-5L instruments and for cTTO, latent scale DCE, and stand-alone DCE applications alike. Also, in comparison to the conventional approach of including all two-way interactions between the levels of each of the domains, an enormous reduction in the number of model parameters is achieved. For example, the cTTO heteroskedastic optimal scaling Tobit model with fixed parameters and a full set of two-way interactions requires 54% and 81% fewer parameters for the EQ-5D-3L and EQ-5D-5L, respectively. For the MIXL models with random parameters, the difference is even more substantial with a corresponding reduction of 87% and 98% for the latent scale DCE and 88% and 99% for the DCE with duration model. Consequently, the optimal scaling model allows for the inclusion of interaction effects in health state valuation studies for which it was previously not feasible to include a full set of two-way interactions.
In addition, even without the inclusion of interactions in the model specification the optimal scaling model has several attractive properties. For example, the optimal scaling model conveniently allows the levels within each health dimension to be monotonically constrained (e.g. via a logistic-normal distribution as in equation 5). Of course, the model can also be estimated without imposing a logical ordering of the level decrements (cf. equation 4), but in health state valuation studies it often makes sense to preclude illogical preference reversals. In addition, unlike the standard MIXL model, the J o u r n a l P r e -p r o o f optimal scaling model does not comprise a single variance-covariance matrix for all health-state decrements. Instead, the decomposition of the level decrements allows for separate decisions about whether to allow for respondent-specific parameters in the health-dimension importance (β) and within-optimal scale level severity ( ) parameters. For larger HRQoL instruments this provides additional flexibility to create more parsimonious models than the standard MIXL model. This may not be necessary for instruments such as the EQ-5D-3L and EQ-5D-5L but could be essential for the valuation of substantially larger instruments like the WOOP and EQ-HWB. [28][29] In terms of data quality, the latent scale DCE dataset provided an ideal testing ground for the optimal scaling model. It was based on a Bayesian d-efficient DCE design that was optimized for the measurement of two-way interactions, included N=3,699 respondents, and there were no combinations of health state levels systematically blocked from the DCE design. In contrast, the DCE with duration sample was based on a DCE design that was not explicitly optimized for the identification of two-way interactions, did not comprise a QALY-balanced selection of health states, and was also not optimized for the measurement of non-linear duration. The latter are more recent design improvements that have been found to be particularly important to obtain unbiased DCE with duration results. [19][20] The sample of N=788 respondents still provided adequate statistical power to detect interaction effects, which is likely related to the efficient design optimization using supercomputer facilities. The Australian cTTO study was based on a large selection of health states A more practical limitation of the optimal scaling approach is that the inclusion of the interaction effects requires an experimental design that identifies the main effects and interactions and also requires a larger sample size compared to that of a main-effects only model specification. This is less of a problem for future health state valuation studies, for which the experimental design can be adjusted and sample sizes increased, but implies that the model will not be estimable for many existing studies that are optimized for a main-effects only specification.

CONCLUSION
In conclusion, we have introduced an interaction model that is not only parsimonious but also produces estimates that are straight-forward to interpret and accommodates the estimation of interactions in health state valuation studies for which it was thus far not feasible to include interactions. Substantial evidence of decreasing marginal effects of the accumulation of health problems in health states on health state values was found. Therefore, we recommend future valuation studies to power their data collection for the inclusion of interaction effects and to include interactions based on the proposed optimal scaling approach in the statistical analyses. J o u r n a l P r e -p r o o f J o u r n a l P r e -p r o o f Figure 2. Graphical representation of the optimal scaling model without interactions for the EQ-5D-3L instrument *,** * In the optimal scaling model without interactions, 5 phi-parameters (φ1-5) and 5 beta-parameters (β1-5) are required to construct the optimal scales and to capture the 5 EQ domains' importances, respectively. ** 10 additional phi-parameters would be required to construct the optimal scales for the EQ-5D-5L instrument. For both the EQ-5D-3L and 5L instruments, only 10 interaction parameters (β6-15) are required to include a full set of two-way interactions between the optimal scales.   . Comparison between the predicted values for all EQ-5D health states based on the optimal scaling model without interactions (on the horizontal axis) and a) the default main-effects models (without interactions), b) the optimal scaling model with N3/45 term, and c) the optimal scaling model with full set of two-way interactions (on the vertical axis).
J o u r n a l P r e -p r o o f