Econometric modelling of multiple self-reports of health states: The switch from EQ-5D-3L to EQ-5D-5L in evaluating drug therapies for rheumatoid arthritis

EQ-5D is used in cost-eﬀectiveness studies underlying many important health policy decisions. It comprises a survey instrument describing health states across ﬁve domains, and a system of utility values for each state. The original 3-level version of EQ-5D is being replaced with a more sensitive 5-level version but the consequences of this change are uncertain. We develop a multi-equation ordinal response model incorporating a copula speciﬁcation with normal mixture marginals to analyse joint responses to EQ-5D-3L and EQ-5D-5L in a survey of people with rheumatic disease, and use it to generate mappings between the alternative descriptive systems. We revisit a major cost-eﬀectiveness study of drug therapies for rheumatoid arthritis, mapping the original EQ-5D-3L measure onto a 5L valuation basis. Working within a comprehensive, ﬂexible econometric framework, we ﬁnd that use of simpler restricted speciﬁcations can make very large changes to cost-eﬀectiveness estimates with serious implications for decision-making.

1 Introduction: EQ-5D-3L and EQ-5D-5L The quality-adjusted life year (QALY) is one of the most widely used health benefit measures in economic evaluations of interventions, services or programmes designed to improve health. The QALY reflects concerns for both quality and length of life and allows health care decision makers to use a consistent approach across a broad range of disease areas, treatments, and patients. QALY estimation is based on patient-reported outcome measures (PROMs), of which EQ-5D is a leading example. EQ-5D is recommended by the English National Institute for Health and Care Excellence (NICE) for its technology appraisals, but it has wider international significance: public bodies in at least ten other countries also recommend EQ-5D as a basis for cost-effectiveness analysis. 1 It is also increasingly used as a measure of performance in wider economic contexts, and as a generic health measure in population surveys (Devlin and Brooks, 2017). There is continuing debate about the basis of economic appraisal in health policy, with interest in wider outcome measures based on wellbeing or capabilities, income-variation valuations, and the use of weights for different aspects of disease such as burden of disease or rarity (Brazier and Tsuchiya, 2015). Nevertheless, for the foreseeable future, it seems inevitable that cost per QALY will continue to be the main driver of decisions in many public health services around the world. EQ-5D measures patient outcomes across five dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. The original version of EQ-5D, which has been used in a large number of cost-effectiveness evaluations, measures each domain on a scale with three severity levels (no problems, some or moderate problems, extreme problems).
Up to 3 5 = 243 states of health can be described in this way, and each has been assigned a utility score on the basis of an analysis of preferences over length and quality of life using understanding. 2 The maximum number of health states that can be described with the new version is 5 5 = 3125. Several studies have reported better measurement properties in moving from the EQ-5D-3L to EQ-5D-5L in both specific patient and general population samples (Pickard et al., 2007;Janssen et al., 2013;Scalone et al., 2013;Agborsangaya et al., 2014;Jia et al., 2014). Utility value sets for EQ5D-5L have been proposed for England , Japan (Ikeda et al., 2015), Canada (Xie et al., 2016), Uruguay (Augustovski et al., 2016), Netherlands (Versteegh et al., 2016) and Korea (Kim et al., 2016) and similar work is underway in many other countries. Many studies now include EQ-5D-5L instead of the standard version. Since these studies will form part of the evidence in future economic evaluations, it is important to assess the likely consequences for economic evaluation decisions of moving across the two different versions of EQ-5D, and to develop a basis for using the very large stock of existing evidence based on the 3L version.
If both variants of the EQ-5D instrument are observed in the same dataset and a utility score is available for each, it is possible to use a conditional statistical model to map directly from the 3L utility score to the 5L score or vice versa. However, that direct approach has three major disadvantages. First, utility scores have highly irregular empirical distributions and the most widely used mapping methods often fit poorly (Hernández-Alava et al., 2012).
Second, use of a single utility score to summarise the 5-dimensional observed response fails to exploit all of the information contained in the observed EQ-5D responses. Third, the direct approach is necessarily specific to the particular scoring system used to construct utility values for the 3L and 5L health descriptions, making it hard to explore sensitivity to variations in the choice of scoring system. The alternative approach known as 'response mapping' (Gray et al., 2006) models the statistical relationhip between the 3L and 5L responses and only brings utility scoring in at the final stage. By separating the logically distinct components of health state measurement and utility scoring, response mapping gives (in our view) a more natural way to proceed.
Although statistical mapping is often treated as a routine and arcane statistical task, it can have a critical impact on the outcome of economic decision-making, and the econometric assumptions used for mapping between alternative PROMs need to be examined very carefully. Those assumptions include: the choice of covariates for the mapping model, distributional specification, and independence or dependence of responses across the five domains of EQ-5D. Various statistical specifications appear in the small existing literature. Some authors have assumed conditional independence between the five domains of EQ-5D, estimating a separate model for each domain. Using this approach, van Hout et al. (2012) developed a mapping between EQ-5D-3L and EQ-5D-5L to construct an interim scoring system for EQ-5D-5L derived from the Dolan (1997) scores for EQ-5D-3L. However, independence is an implausible assumption: medical conditions may simultaneously affect multiple aspects of life -for instance severe pain may be accompanied by depression and curtailment of activities. Also, there may be individual-specific styles of questionnaire response which affect responses in all domains -some people tend to look on the bright side, while others do not.
The conventional normality assumption built into the univariate or multivariate ordered probit model is also a strong one, and consistent estimation is not achieved in general if error distributions are non-normal, even if the model is correctly specified in all other respects.
In section 3 of the paper, we develop a multi-equation model that allows for the discrete EQ-5D response scales and uses a flexible mixture-copula specification of the error distributions. Importantly, we do not impose the assumption that responses in the five domains of EQ-5D are statistically independent. In section 4, we apply the model to investigate the consistency of the responses to the two descriptive systems and the implied differences in the utility values. We derive the appropriate mapping technique in section 5 and compare the results from mapping in both directions between the two variants of the EQ-5D instrument.
To explore the implications of modelling strategy for real-world policy decisions, we report an application to cost-effectiveness of treatments for rheumatoid arthritis (RA). We focus on RA partly for its inherent importance -among the 291 medical conditions covered by the 2010 Global Burden of Disease Study (Murray, 2012), RA ranked as the 42nd greatest contributor to global disability, measured in Years Lived with Disability (YLD), ranking immediately after malaria. It is also a rapidly growing problem; between 1990 and 2010, the estimated global burden of RA (adjusted for population growth and ageing) grew 15% in terms of YLD and 44% in terms of disability-adjusted life years (Cross et al., 2014). But data availability is another advantage; we have access to the National Data Bank for Rheumatic Diseases (NDB), which provides a unique RA-specific reference dataset that observes both versions of EQ-5D and also contains detailed clinical outcome measures. This allows us to explore one of the most important features of the mapping process, by varying the information provided by the covariates of the mapping model.
In section 6, we re-visit the important CARDERA cost-effectiveness study (Choy et al., 2008;Wailoo et al., 2014) comparing four drug therapies for RA. We use statistical mapping to convert EQ-5D-3L responses into EQ-5D-5L QALYs, and find a large impact of the choice of statistical assumptions on the evaluation results. Our evidence suggests that the potential to move from EQ-5D-3L to EQ-5D-5L will pose significant methodological questions and may raise questions about some past decisions. We begin in section 2 by describing the NDB data that we use for the EQ-5D-3L and EQ-5D-5L comparison -one of the few datasets available in which both variants of the instrument are carried in the same questionnaire.

The NDB dataset
The NDB is a register of patients with rheumatoid disease, primarily recruited by referral from US and Canadian rheumatologists. Information supplied by participants is validated by direct reference to records held by hospitals and physicians. 3 Full details of the recruitment process are given by Wolfe and Michaud (2011). The EQ-5D responses and other patientsupplied data are collected by various means, primarily postal and web-based questionnaires completed directly by patients. Data collection began in 1998 and continues to the present, in waves administered in January and July of each year. In 2011, there was a switch from 3L to the 5L version of EQ-5D and both versions were collected in parallel during the January 2011 wave, to allow the effects of the switch to be accommodated in analyses spanning the whole period. Our principal aim is to use data from that wave of the survey to estimate a joint model of the 3-and 5L responses, which can then be used to map from 3-to 5L EQ-5D during the pre-2011 period and from 5-to 3L EQ-5D after January 2011. It then becomes possible to investigate the consistency of the two versions of EQ-5D and assess the impact of mapping between them. versions of each domain of EQ-5D. There are clear differences between the distributional shapes for different domains: self-care and anxiety/depression have a dominant mode at the 3 A minority of cases come by self-referral, with medical details obtained by NDBRB in the same way. first category; the mobility and usual activities domains also have a decreasing profile but with a heavier central section, while the pain/discomfort domain shows a strong mode in the centre of the distribution. This variation in the shape of the component distributions underlines the need to use a suitably flexible model specification to analyse the relationship between variants of EQ-5D.

Utility scores
For each possible combination of EQ-5D responses, there is a utility value which allows overall health-related quality of life to be estimated and compared across individuals and conditions. We use the value sets produced by Dolan (1997) and Devlin et al. (2016) for the 3-and 5L versions of the instrument which, at present, are the standard choices for QALY measurement in England. Dolan (1997)   . Figure 2 shows kernel density estimates of the distributions of utility scores in the NDB data, aggregated across all five domains. The distribution is smoother for the 5L version, particularly towards the top of the range, and this finer structure is a major reason for its adoption in practice. The distribution of utility scores for the 3L version of EQ-5D has two particularly worrying features. There are ranges with probability mass at or close to zero, particularly around 0.8-1.0 and 0.3-0.45. Consequently, methods for mapping to and from EQ-5D-3L which implicitly assume a smooth positive density can give very poor results (Hernández-Alava et al., 2012). The second striking feature of the distribution for EQ-5D-3L is the large group of cases with utility values close to zero, implying that a non-negligible proportion of patients with rheumatoid arthritis (RA) are in a state comparable to, or worse than, death. The outcomes of evaluation studies often rest on the ability of a therapy to improve quality of life for patients in very poor health, so the (perhaps implausibly) large frequency of such cases is a potential source of bias in NICE recommendations.  shows a similar picture. There is a high correlation between the two variants of EQ-5D, but the 5L version has greater sensitivity, since correlations with demographics and clinical outcomes (in the lower panels of Table 1) are uniformly higher for EQ-5D-5L. Table 2 shows that there is a systematic difference in the 3L and 5L utility scores, with the old system generating utilities averaging (in the NDB data) only 87% of the utility values given by the new system. This alone could make a significant difference to some evaluation results. It would be inadvisible to address the issue with a simple proportional adjustment, since the ratio of mean scores is not constant but decreases as both general severity and pain increase, so the differences are minor at the top end of EQ-5D and much larger at the bottom. Table 2 gives means classified by levels of general disability (in three groups, scores 0-1, 1-2 and 2-3) and pain (in five groups 0-2, 2-4, 4-6, 6-8 and 8-10), as classified by the Stanford Health Assessment Questionnaire (HAQ). The HAQ is widely used by clinicians to measure treatment outcomes; see Bruce and Fries (2003) for a review.
Mapping from 3L to 5L involves two changes: a shift from the 3L health descriptive system to the 5L system, made using a predictive statistical mapping model; and a shift from the utility tariff developed for EQ-5D-3L to the utility tariff applicable to EQ-5D-5L. These two  (1)  changes occur jointly, so it is not possible to disentangle fully the effect on cost-effectiveness calculations of mapping from the effect of the change in utility structure. However, within a fixed framework dictated by the given 3L and 5L utility tariffs, it is possible to compare the results produced by alternative specifications of the mapping model. This is our strategy, implemented within a comprehensive and flexible econometric approach.

A correlated copula model with mixture marginals
Our aim is to develop an econometric model of responses to the ten items of the 3L and 5L instruments. The specification is guided by six important considerations, intended to avoid unnecessarily strong restrictions on the data. The model should: (i ) Treat the 3L and 5L responses symmetrically so that it can be used for 3L→5L and 5L→3L mapping in a mutually consistent way.
(ii ) Avoid the assumption that the 5L response scale is simply a more detailed categorisation than the 3L scale of the same underlying concept -structural differences between the two responses are permitted if empirically necessary.
(iii ) Allow for the effects of covariates -here, age, sex and clinical outcome measures, without assuming that they necessarily influence 3L and 5L responses in the same way.
(iv ) Capture the strong association between 3L and 5L responses within each health domain, without necessarily assuming that the strength of the association is the same in all parts of the health distribution -for example, someone who has experienced extreme pain may answer the pain questions in a more focused and coherent way than someone without experience of chronic pain. To achieve this, we use a copula approach (Trivedi and Zimmer, 2005) to specify the bivariate distribution of each 3L, 5L pair of responses.
(v ) Be sufficiently flexible to fit the diverse response patterns shown in Figure 1, so we generalise the usual assumption of normally-distributed errors by allowing for a 2-part normal mixture distribution, which can capture a wide range of distributional shapes.
(vi ) Allow dependence across the five domains of EQ-5D, reflecting common underlying causes and individual-specific response styles; we achieve this by incorporating a random latent factor influencing responses in all domains.
In advance of the empirical analysis, there is no way of knowing which of these considerations is most important, so the resulting model is complex. Define 1 ≤ Y 3id ≤ 3 and 1 ≤ Y 5id ≤ 5 as the reported outcomes for the dth domain (d = 1 . . . 5) of the 3-and 5L forms of EQ-5D.
The model is a system of ten latent regressions, arranged in the five domain groups, with domain d containing the equations for Y 3id and Y 5id : where i indexes independently sampled individuals, X i is a collection of row vectors of covariates, β 3d , β 5d are corresponding coefficient vectors and U 3id , U 5id are unobserved errors which may be stochastically dependent and non-normal. The latent dependent variables Y * 3id , Y * 5id are not observed directly but they have observable ordinal counterparts, Y 3id , Y 5id , generated by the following threshold-crossing conditions: where Q k = 3 or 5 is the number of categories of Y kid and the Γ kqd are threshold parameters, High-dimensional ordinal-variable applications present major computational problems.
Currently, there is only a single published model of EQ-5D responses that relaxes independence (Conigliani et al., 2015), using a 5-equation correlated multivariate ordered probit model to predict EQ-5D responses from aggregate SF12 scores. Using that model in our 10-dimensional 3L-5L mapping context would involve estimation of 45 residual covariance parameters, with a likelihood requiring numerical integration over a 10-dimensional rectangle.
Past experience with similar maximum simulated likelihood problems, using best-practice simulation methods like Halton sequences, tells us that likelihood-based tests and fit statistics are not robust enough for model comparisons to be reliable. The conventional ordered probit model also involves normality assumptions that are critical to its consistency property and which we want to relax.
Possible solutions to the dimensionality problem work by imposing structure on the joint distribution of the latent Y * kid . In the copula literature, the most common approach is to build it up from bivariate component distributions, often using vine structures (Bedford and Cooke, 2002;Panagiotelis et al., 2012). However, that is most convincing when there is a natural ordering of the observed variables, particularly temporal sequencing (as in the application by Panagiotelis et al. (2012) to a sequence of four observations on headache spaced through the day). In our case, although the component items of EQ-5D-5L were asked in sequence and then the items of EQ-5D-3L later in the questionnaire, that ordering does not correspond at all to the natural connections between the 3L and 5L items through their shared meaning. For that reason, we adopt a different approach, using five separate bivariate copulas for the five domains of EQ-5D, and connecting the domains via a latent factor V which represents common influences on the respondent's responses. The error U kid is decomposed into the latent factor V i and a specific error ε kid correlated within but not between domains: where the ψ kd are a set of ten parameters. We make the standard assumptions that, conditional on X i : V i is independent of all the ε kid ; the ε kid are all mutually independent, except that ε 3id , ε 5id are possibly dependent within any health domain d.
We use a copula representation to capture dependence between the 3L and 5L responses for any domain. Suppressing the i subscript, define F d (ε 3d , ε 5d ) as the distribution function Their joint df for domain d is specified as: where G kd (.) is the marginal df of ε kd and θ d is a parameter controlling the dependence between ε 3d and ε 5d . The function c d (.) is known as a copula and, together with the marginals G 3d (.), G 5d (.) it uniquely characterises the bivariate distribution of ε 3d , ε 5d . It has the properties c d (0, u) = c d (u, 0) = 0 and c d (1, u) = c d (u, 1) = u for any 0 ≤ u ≤ 1 (Trivedi and Zimmer, 2005). We consider the following candidate forms: Gaussian: where Φ(., .; θ) is the distribution function of the bivariate normal with correlation coefficient The Gaussian and Frank copulas are similar in that both allow for positive or negative dependence, symmetric in both tails, but the Frank form generates dependence weaker in the tails and stronger in the centre of the distribution. The Clayton copula allows only positive dependence, with strong left tail dependence and relatively weak right tail dependence; thus, if two variables are strongly correlated at low values but less so at high values, then the Clayton copula is a good choice. To show the effect of copula choice, Figure 3 shows simulated scatter plots generated using these three copulas. 4 The Gumbel and Joe copulas (not illustrated) display weak left tail dependence and strong right tail dependence, which is stronger for the Joe than the Gumbel copula.
Conditional on X, the probability of observing any values Y 3d = q and Y 5d = r is: We use Gauss-Hermite quadrature with 15 integration points to evaluate the integral in (7) at each observation to give the likelihood function.

Modelling results
Our aim is to estimate the joint distribution of the responses to the 3L and 5L variants of the EQ-5D survey instrument, conditional on demographic characteristics (age and gender), and clinical measures of the severity of the underlying rheumatic condition. We use seven covariates: age, gender, the HAQ disability score, the pain scale, and the squares and product of the HAQ and pain scales.
The HAQ is based on patient self-reporting of the degree of difficulty experienced over the previous week in eight categories: dressing and grooming, arising, eating, walking, hygiene, reach, grip, and common daily activities. It is widely used by clinicians to measure health outcomes. It is scored in increments of 0.125 between 0 and 3 (although it is standard to consider it fully continuous), with higher scores representing greater degrees of functional disability. The HAQ instrument also includes separately a patient self-report of pain scored on a Visual Analogue Scale (0-10).

Domain-specific modelling
We start by examining each of the five domains of EQ-5D separately using a bivariate approach, implemented in the Hernández-Alava and Pudney (2016)   variants for each of the five domains, where we retain the standard assumption of Gaussian marginals. There is no single best choice of copula: the Gaussian form fits best for dimensions 1 and 3 (mobility and usual activities), the Frank copula fits best for dimensions 2 and 5 (self-care and anxiety/depression) while the Gumbel copula fits best for the pain/discomfort dimension. This coincides with differences in the empirical distributions of Figure 1 between these three groups of domains. The Frank copula (which allows weaker dependence in the tails than the centre of the distribution) works better than the Gaussian copula when the tails of the response distribution are relatively heavy. The Gumbel copula which has asymmetric dependence in the tails (stronger dependence at higher values) fits better when there is a central mode and implies different patterns of dependence in both tails of the distribution. This finding shows that the effect of the move to 5 levels is not simply a uniform re-alignment of the response level. 5 The assumption of normal marginals for the errors ε kd was acceptable in terms of the Akaike (AIC) and Bayesian (BIC) information criteria for the mobility, self-care and anxiety/depression domains, but there was significant evidence of modest departures from normality for the usual activities and pain/discomfort domains. Table 4 summarises the preferred specifications for those two domains, comparing them with the simpler Gaussianmarginal models. Note that the conclusions about the equality of coefficients are not affected by non-normality.
5 Note that these are formally tests of the hypothesis that the coefficient vectors are equal after each error variance is normalised to unity. Since the extreme points on the 3L and 5L scales are (mostly) given the same verbal labels to act as anchors, the assumption seems reasonable. Also, where differences are statistically significant, the 3L and 5L coefficient vectors are clearly not scalar multiples of each other. Best-fitting models in bold type (all models have 15 parameters). Statistical significance: * = 10%, ** = 5%, *** = 1%. § No convergence.
(to a lesser degree) usual activities domains. Moreover, the two threshold parameters for the 3L model fall respectively between the bottom two, and top two thresholds in the 5L model (Γ 52d <Γ 32d <Γ 53d andΓ 54d <Γ 33d <Γ 55d ), which is consistent with the idea of a simple re-alignment of responses. However, for the mobility and pain/discomfort domains, the differences between dfs are sizeable and statistically significant, with the pain/discomfort domain displaying the largest difference. For both mobility and pain/discomfort, one of the 3L threshold parameters lies outside the range covered by the 5L threshold parameters, which is inconsistent with the simple realignment hypothesis.

Mapping
The best method of mapping between alternative preference-based measures depends on the nature of the cost-effectiveness study in which the measure is to be used. Suppose, for example, that the study is to be done on the new 5L basis, but the available evidence comes from a clinical trial in which the older EQ-5D-3L scale is measured. The key concept is the mean QALY, which should be constructed as E {Q(υ 5 (Y 5 ))}, where E{.} is the expectation with respect to whatever population is potentially affected by the treatment.
There are two technical issues to be considered in mapping from 3L evidence to 5L-based evaluation. First, the form of the function, Q(.), which maps utilities into QALYs. In most evaluation studies, the QALY calculation Q(.) is a linear function of the utilities, so that E {Q(υ 5 (Y 5 ))} = Q (E{υ 5 (Y 5 )}). In other words, we can simply predict the utility outcome υ(Y 5 ) and use that prediction in calculating QALYs. If the predictor is an unbiased (or consistent) estimator of E [υ(Y 5 )], it will give an unbiased (consistent) evaluation of the expected QALY.
The second issue is the choice of predictor for υ(Y 5 ). We have argued here that a predictor based on a full model of P r(Y 5 Y 3 , X) uses more information and is capable of giving better results than the alternative approach to mapping, which attempts to model , X) directly -often using methods like linear regression which are not well suited to the non-standard distributions involved. When using our approach, it is important to realise that the utility scales υ(.) are nonlinear functions of the vector Y , so . We should not map the observed 3L health description Y 3 into the 5L descriptive system Y 5 and then apply the utility scale υ 5 (.). Instead, the appropriate method is to use the model estimated from NDB data to evaluate the probability of each possible configuration of Y 5 conditional on Y 3 , X and use those probabilities as weights to evaluate the conditional expectation of υ. The conditional df of the valuation υ 5 is: where U Υ is the set {Y 5 ∶ υ 5 (Y 5 ) ≤ Υ} and Υ is any given constant. The mean is: where S 5 is the set of 3125 possible values that the vector Y 5 might take. 6 The choice of covariates is critical here. Mapping from Y 3 rather than direct observation of υ 5 (Y 5 ) introduces no bias in the calculation of mean QALYs if the conditional mean function E(υ 5 (Y 5 ) Y 3 , X) in the population represented by the reference sample used for mapping is identical to E(υ 5 (Y 5 ) Y 3 , X) in the population represented by the trial subjects. In general, reference samples and trial samples are drawn in quite different ways, and there is always a possibility that the statistical relationship between Y 3 and Y 5 could differ substantially between the two populations, leading to mapping bias. The use of covariates can reduce this risk by allowing for factors which might cause the Y 3 , Y 5 association to differ across samples.
Thus, even if E(υ 5 (Y 5 ) Y 3 ) differs between the reference and trial samples, may not, for a judicious choice of covariates. We explore this in the next section.
Several authors have commented on the loss of variation induced by mapping (Brazier et al., 2010;Longworth and Rowen, 2011;Fayers and Hays, 2014). The sample variance of the mean predictor (9) will always be lower than the variance of the unknown true υ 5 (Y 5 ), because the modelling process can only predict variation in υ 5 (Y 5 ) arising from Y 3 and X, not the other "unexplained" components of variation. In standard cases where the QALY calculation is linear in utilities, this does not matter, since only the conditional mean of υ 5 (Y 5 ) is required. If the aim were to estimate the variance of υ 5 (Y 5 ), one would not do it by using the variance of the predictor (9); instead, the appropriate method is to calculate directly the variance of the distribution (8), which gives a consistent estimate of var(υ 5 (Y 5 )) if the mapping model is correctly specified and estimated.
If we evaluate (8) and (9) at each observation Y i3 , X i , and then average over the sample, the result is a consistent estimator of the distribution of υ 5 (Y 5 ) or its mean E[υ 5 (Y 5 )].
This can be done empirically for the pre-January 2011 waves of the NDB dataset and in reverse (predicting Y 3 conditional on Y 5 ) for the post-January 2011 waves. Figure 6a uses the set of domain-specific bivariate models (assuming independence across domains) to compare the predictive df n −1 ∑ n i=1 P r (υ 5 (Y 5 ) ≤ Υ Y i3 , X i ) and the directly-observed empirical df  There are two striking features of Figures 6 and 7, with important implications for the economic evaluations carried out for public bodies like NICE. First, the predictive and actual distributions of the 5L variant of EQ-5D are similar and much smoother than the corresponding distributions for the 3L variant. This is an encouraging finding: if a decision maker elects to recommend the use of the new 5L instrument and associated scoring, it may be possible to continue to use older 3L-based evidence with appropriate mapping to 5L. Second, there is a large difference between the 3L and 5L distributions of EQ-5D scores, whether directly observed or mapped. Utility scores tend to be systematically higher under the 5L scoring scheme, so the df for EQ-5D-3L lies entirely to the left of the df for EQ-5D-5L. If no other adjustment were made, this alone might be enough to change many evaluation results, in the absence of offsetting adjustments to the evaluation methodology.  Table 6 shows average values of directly-measured υ 3 (Y 3 ) and the prediction for the 2010 wave of NDB, and of the prediction E[υ 3 (Y 3 ) Y 5 , X] and directly-measured υ 5 (Y 5 ) for the 2012 wave using the joint model. Results are given for the whole sample and subgroups defined in terms of disease severity and demographic characteristics; sample standard deviations of the measured and predicted utilities are are also shown. As expected, there are higher mean values and smaller standard deviations for the EQ-5D-5L scores (whether predicted or directly observed) than for EQ-5D-3L, resulting from the different scoring of poor health states by the two value sets. Another consequence of this is the much steeper severity gradient for the mean EQ-5D-3L utilities than for EQ-5D.
There is a slight tendency for both the 3L and 5L utilities to decline over time as the health states of those individuals who appear in both waves tend to worsen. However, the means of predicted and directly-observed versions of each measure are remakably close both overall and in terms of their severity and demographic profiles.
We also see the anticipated smaller standard deviations of the predicted than directlyobserved utilities as a consequence of the use of expected value prediction. This is of no importance for the evaluation described in the next section (since the criterion is based on the mean QALY), but it would be a concern for any evaluation that aims to investigate the distributional pattern of QALY gains within each population group. In that case, appropriate measures constructed from the full distribution (8) would need to be used. Table 6: Means and standard deviations of actual and predicted (joint model) EQ-5D-3L and EQ-5D-5L by severity of condition, age and gender.

The impact on cost-effectiveness analysis
We now use a published cost-effectiveness study to examine the potential consequences of moving from EQ-5D-3L to EQ-5D-5L as a basis for economic evaluation. We first replicate the economic evaluation results in Wailoo et al. (2014), which use EQ-5D-3L data collected as part of a trial. Then we repeat the analysis using EQ-5D-5L obtained using the mapping models developed in this paper. Wailoo et al. (2014)  The key criterion used in cost-effectiveness analysis is the Incremental Cost-Effectiveness Ratio (ICER), defined as the difference in costs between two different treatment strategies, expressed as a ratio to the difference in the QALYs that they achieve. Treatments with ICERs below a certain threshold are usually considered cost-effective. In the UK, NICE guidance on technology appraisal refers to a specific range £20,000-£30,000 (NICE, 2013), but see also Claxton et al. (2015) who argue for a lower threshold.
Resource use (prescription drugs, hospitalizations, tests, imaging, surgical procedures and community care visits) was directly observed over the two years of the trial and costed using 2011-2012 figures. The mean discounted cost of each treatment strategy is shown in the first row of Table 7, based on the sample of patients with complete data (n=241).
QALY estimates were derived from EQ-5D-3L responses observed at baseline and 6, 12, 18 and 24 months and the discounted QALY total was estimated as the area under the linear 7 Initially dosed at 60mg/day, reducing to 7.5mg/day at 6 weeks and stopped by 34 weeks.
interpolation of the five points. We then repeated the QALY estimation using EQ-5D-5L predicted from the full mixture-copula model presented in section 4.2, conditional on the demographic and clinical covariates and EQ-5D-3 responses observed in the trial. Note that, since this construction is a linear function of the EQ-5D responses Y , our use of E(Y 5 Y 3 , X) as a predictor does not introduce bias into the QALY evaluation, as it would for a nonlinear function of Y .
The cost-effectiveness results are presented in the first two panels of Table 7. 8 Of the four treatment strategies, triple therapy is the least costly and most effective, thus dominating all other strategies. Among the remaining three treatment strategies, the MTS+CS combination is dominated by MTX plus steroid, being more costly and less effective. Monotherapy is more costly but also more effective than MTX plus steroid, with an ICER of £13,714 which lies comfortably below a conventional cost-effectiveness threshold of £20,000 per QALY.
The effect of mapping is to increase the estimated dominance of the triple therapy over all others and also the dominance of MTX+PNS over MTX+CS. The ICER for monotherapy versus MTX+PNS increases from £13,714 to £17,264, which remains below the conventional threshold. Thus, mapping has increased the magnitude of estimated ICERs, but without changing any of the decisions that would be likely to follow.
The mapped EQ-5D-5L QALYs are larger (by 15-24%) than the directly-measured EQ-5D-3L QALY estimates; but critically, they also vary less proportionately -the range of QALYs is 20% of the smallest for EQ-5L-3L but 12% for mapped EQ-5D-5L. Because the QALY is in the ICER denominator, the six ICERs for pairwise comparisons of the therapies increase in magnitude -by more than 100% in some cases. This result is partly due to the significant response differences to the mobility and pain questions, but also to the large negative values built into the Dolan (1997) utility scoring system which tends to increase the coefficient of variation of 3L scores relative to 5L scores. Thus a substantial part of the increase in ICERs when using mapping is attributable not to mapping per se, but to the different structures of the 3L and 5L scoring systems. This suggests that we can expect to see similar results if we adopt EQ-5D-5L in many other evaluation settings -perhaps warranting a future reassessment of the cost-effectiveness threshold by bodies such as NICE.
Preliminary work by  tends to support this view.
We can explore the impact of mapping in the remainder of Table 7  Simplifying the covariate list has the effect of greatly increasing the apparent dominance of the triple therapy over all others, with the ICER relative to monotherapy rising by almost 50% in magnitude. Again, it is unlikely that cost-effectiveness decisions would differ from those made with direct measurement of EQ-5D-3L.
The second simplified version of the mapping model retains the full set of covariates but imposes the restriction of independence across health domains by eliminating the random effect V through the parameter restrictions ψ kd = 0, which are strongly rejected by direct tests. Relative to the full mapping model, most ICERs increase in magnitude under the independence restriction and, in the case of monotherapy versus the MTX/steroid combination, the increase takes the ICER beyond the £20,000 threshold, which would bring the cost-effectiveness of monotherapy into question in a comparison between the two. That ICER is almost 50% greater than the estimate derived from direct observation of EQ-5D-3L.

Conclusions
There are three clear conclusions. First, econometric modelling based on a flexible mixturecopula specification has revealed significant differences between the 3L and 5L versions of the EQ-5D descriptive system for health states. These differences are particularly striking for the mobility and pain domains, where the two versions of the instrument give significantly different pictures of the relationship between individual health states and their demographic and clinical determinants. Third, our re-examination of evidence from a trial of combination drug therapies for rheumatoid arthritis shows that switching to the newer 5L version of EQ-5D and using the utility scoring system recently proposed by Devlin et al. (2016) can make a substantial difference to the conclusions from cost-effectiveness studies. This is partly a consequence of the different utility tariffs developed for EQ-5D-3L and EQ-5D-5L which itself may call for some adjustment to the way that such studies are translated into funding decisions.
But, working within a comprehensive and flexible framework that models 3L and 5L jointly, we have shown that econometric specification can also have a separate large impact. In particular, making the simplifying assumption of independence across health domains, or using a restricted set of covariates that excludes clinical information, may cause large shifts in cost-effectiveness ratios -of up to 50% in our application to rheumatic disease.
Appendix: full parameter estimates