Analysis of SF-6D Health State Utility Scores: Is Beta Regression Appropriate?

Background: Typically, modeling of health-related quality of life data is often troublesome since its distribution is positively or negatively skewed, spikes at zero or one, bounded and heteroscedasticity. Objectives: In the present paper, we aim to investigate whether Bayesian beta regression is appropriate for analyzing the SF-6D health state utility scores and respondent characteristics. Methods: A sample of 126 Lebanese members from the American University of Beirut valued 49 health states defined by the SF-6D using the standard gamble technique. Three different models were fitted for SF-6D via Bayesian Markov chain Monte Carlo (MCMC) simulation methods. These comprised a beta regression, random effects and random effects with covariates. Results from applying the three Bayesian beta regression models were reported and compared based on their predictive ability to previously used linear regression models, using mean prediction error (MPE), root mean squared error (RMSE) and deviance information criterion (DIC). Results: For the three different approaches, the beta regression model was found to perform better than the normal regression model under all criteria used. The beta regression with random effects model performs best, with MPE (0.084), RMSE (0.058) and DIC (−1621). Compared to the traditionally linear regression model, the beta regression provided better predictions of observed values in the entire learning sample and in an out-of-sample validation. Conclusions: Beta regression provides a flexible approach to modeling health state values. It also accounted for the boundedness and heteroscedasticity of the SF-6D index scores. Further research is encouraged.


Introduction
A large number of preference-based measures of health-related quality of life (HRQoL) have been used to measure quality-adjusted life years (QALYs) to be applied in cost-effectiveness analyses (CEA). Such measures comprise the EQ-5D [1], HUI2 and 3 [2,3], AQoL [4], QWB [5] and the SF-6D [6], along with several condition-specific preference-based measures [7,8]. A key practical issue with these measures is that they produce many distinct health states, and so it may not be practical to get a direct valuation for every health state. Toward that end, statistical models are then needed to estimate utilities for all states produced by the measure based on the valuations of a representative sample of health states.
Typically, the distribution of health state utilities obtained from valuation techniques like time tradeoff (TTO) or standard gamble (SG) is positively or negatively skewed, truncated, hierarchal as well as noncontinuous, all of which impose a key challenge for statistical modeling methods. Despite this challenge, earlier statistical methods of such data have used ordinary least-squares regression (OLS) [6,9,10]. However, the error terms in the OLS regression are assumed to be homoscedastic, so such assumption may not be met for bounded variables as the variance of scores

Selection of Respondents
Lebanese adults above 18 years old were recruited at AUB and were stratified via sex (female/male) and profession (faculty/staff/students). Respondents were approached by telephone and/or email to arrange for a one-to-one interview session. Out of the 170 respondents who were initially contacted, a total of 126 people agreed to participate in the study and conducted the interview session (response rate = 74%). Reasons for declining to participate were mainly due to time constraints and disinterest in the study. More on the selection method is given in Kharroubi et al. [24].

Selection of Health States
The sample of health states was chosen using the Orthoplan procedure of SPSS (SPSS Inc., Chicago, IL, USA) that generates 49 out of the 18,000 possible states to fit an additive model. The 49 health states were then distributed over 7 sets with 7 health states in every set. Additionally, all respondents valued the "pits" state. More details are presented in Kharroubi et al. [24].

Interviews
During the interview, the respondent was asked to rank and then value eight SF-6D health states via the McMaster "ping pong" variant of SG, where every respondent is asked to value each of the seven certain health states versus the best state and the "pits" [25]. As for the eighth SG question, the respondent was asked to value the "pits" state. Based on their valuation of the "pits", each respondent was asked to choose between either: (A) the certain prospect of being in the "pits" and the uncertain prospect of best health or immediate death; or (B) the certain prospect of death and the uncertain prospect of best health or the "pits". The chances that the best outcome occurring are varied until the respondent is indifferent between the two prospects. Any negative value was bounded to −1, and it referred to worse than death [26]. The seven earlier health state valuations (SG) were adjusted onto the scale 1 to 0, with 1 being the perfect health state and 0 indicating death [6], via the following formula: where P is the valuation of the "pits". The resulting SGADJ values represent the response variable (y) in the models described below.

Study Sample
As already mentioned, each respondent was asked to value eight SF-6D states, including "pits". This results in 1008 health state valuations across 49 health states. There were two respondents whose valuations did not change between the eight health states, and so they had to be excluded from the analysis, leaving in total 992 (98.4%) SG valuations across 49 health states for the analysis. Every health state was valued 17 or 18 times, with average values ranged from 0.322 (pits state) to 0.890 (state 211,111) and standard deviations ranged from 0.042 (211,111) to 0.265 (614,434). However, the median values were higher, indicating the left skewness of the data. Negative values did not occur, and over 19% of observations were above 0.9. Finally, no states were evaluated at 1, indicating that all respondents were inclined to risk a worse state to get a chance for a better one. More details on health states, values are available in Kharroubi et al. [24].

Model Development
As already mentioned, all studies that modeled SF-6D index scores as a function of respondent characteristics relied on linear regression using OLS [6,9,10]. However, linear regression assumes that error terms are normally distributed with constant variance (homoscedastic), both of which may not be met for bounded variables. Here, we develop alternative regression methods based on beta distribution to analyze the SF-6D health state utility scores and respondent characteristics.
Three different models were fitted for the SF-6D via Bayesian Markov chain Monte Carlo (MCMC) simulation methods. These comprised a beta regression, random effects and random effects with covariates. For every model, the SF-6D index score was used as the dependent variable. The dummy explanatory variables for each level above 1 from each of the six dimensions of the SF-6D were all used as independent variables.

Beta Regression
The beta distribution is a natural choice for modeling continuous variables observed in the standard unit interval. The probability density function of a random variable y following a beta distribution is defined as: where Γ(.) denotes the gamma function [27]. Note that µ (0 < µ < 1) represents the expected value of y. The parameter φ denotes a precision parameter since µ = E(y) and Var(y) = µ (1 − µ)/(1 + φ) and, hence, for fixed µ, 1 + φ would be inversely proportional to Var(y). Hence, if y has the probability density function (1), we write y ∼ beta(µφ, (1 − µ)φ) Now, assume y 1 , . . . , y n to be n independent and identically distributed random variables such that y i ∼ beta(µ i φ, (1 − µ i )φ), i = 1, . . . , n. By definition, the mean µ i = E(y i ) maps the unit interval onto the real line, so the logit link was proposed as a convenient link function. That is, where x T i denotes a vector of covariates for the i-th subject (i.e., the 25 dummy explanatory variables described above), and β denotes a vector of regression parameters.
The beta regression (BR) model outlined above could also involve random effects that account for variation within and between respondents. This type of model recognizes that n observations from m individuals are not treated as n × m observations on different respondents. The logit link function (2) is now: Note that b i indicates the "i-th respondent" random effect for the HRQoL utility value. This allows many respondents to have consistently low or high average HRQoL utility scores. Note also that Var(b i ) = σ 2 1 , i = 1, . . . , n, represents the variability in the average HRQoL among respondents with comparable characteristics.
The BR model could also involve participant characteristics, for example, age, gender, marital status, education level, housing type and income, with the aim of capturing the impact of the participant covariates.
The Bayesian beta regression model is finalized by assigning prior distributions for all unknown parameters. In the case when no specific prior information exists, it would be possible to assign non-informative prior distributions for all these parameters. Typically, multivariate normal prior distributions are assigned for the fixed effects with mean equal to zero and with large variance. A common choice for the prior distributions of the variance of random effects, σ 2 1 , and the precision parameter φ would be an inverse gamma distribution. More specifically, prior distributions were specified as follows: β ∼ N 0, 10 6 , σ 2 1 , φ ∼ IG(0.001, 0.001) Gelman [28] suggests a flat prior for σ, hence the prior for σ 2 1 is proportional to σ −1 . Gelman also suggests that the prior distribution for φ = U 2 , where U ∼ U(0, a) with big a (a = 50 say) is less informative than an inverse gamma prior distribution.

Linear Regression
The linear regression (LR) model is specified as follows: where y i is the SF-6D utility score of respondent i and ε i are random error terms with E(ε i ) = 0 and Var(ε i ) = σ 2 2 for all i. Note here that the error terms are homoscedastic given they have constant variance for any i; however, they could be relaxed to take on non-constant variance (heteroscedasticity). The LR model could also involve random effects in addition to the participant characteristics (for example, age, gender, marital status, education level, housing type and income) to estimate the impact of the participant covariates. Vague prior distributions were defined as follows: β ∼ N 0, 10 6 , σ 2 2 ∼ IG(0.001, 0.001) See Natarajan and Kass [29] for more choices of non-informative prior distributions.

Model Estimation
All models described above were evaluated via Gibbs sampling MCMC simulation methods and were implemented in the WinBUGS software package [30,31]. This environment permits great flexibility in model specification, which allows nonstandard models to be fitted in a straightforward manner. The relevant WinBUGS code is available from the author upon request. Following some initial test runs, it was decided that a burn-in of 5000 runs would be sufficient to reach convergence, and these runs were discarded from the final analysis. Convergence was examined using the Gelman and Rubin diagnostic [32]. Two parallel chains were set from broadly spread initial values, and the ratio of the within-chain to between-chain variance was then monitored and converged to about 1, indicating convergence had been reached. Following this, further 10,000 runs were executed for parameters estimation purposes.

Model Reliability and Validation
The proposed models were compared in terms of their predictive performance via mean prediction error (MPE), root-mean-square error (RMSE) and the Bayesian deviance information criterion (DIC) [30]. The DIC is defined by: where D is the posterior mean deviance and P D is the effective number of parameters that represents model complexity. The model with the lowest DIC is preferred as the best fitting model [31].

All Models
The posterior means and their associated 95% credible intervals (CI) for each of the regression coefficients are reported in Table 1. Notice that the estimates shown in bold are those who had CIs, excluding zero. For the LR model, we see that all regression coefficients had the expected negative sign apart from the second and third levels of vitality as well as the third level of role limitation. This is also the case for the BR model with one addition, level 2 of mental health. However, the CI of these coefficients for both models includes zero, which means that it is a weak inconsistency. We also see that 16 out of the 25 coefficients on the main SF-6D domains have CIs excluding zero in the LR model, in comparison to 15 coefficients for the BR model. The MPE, RMSE and DIC for each model are also reported in Table 1. We notice that the BR model scored better MPE, RMSE and DIC with 0.128, 0.049 and −1069, respectively, in comparison to the LR model (MPE = 0.126, RMSE = 0.053, DIC = −689.1).
Upon including random effects, we see that all regression coefficients were negative across both LR + RE and BR + RE models, except for level 2 of vitality and level 3 of pain. In addition, the 95% CI for these coefficients included zero in both models. Overall, the beta regression model was found to perform better than the normal regression model across all types of models, with/without random effects and with/without covariates. Given including covariates for both LR + RE and BR + RE models did not provide any further improvement in terms of MPE, RMSE and DIC, let alone the 95% CI for covariates coefficients contained zero in both models, we carry on our analysis with model BR + RE as the best performing model in the study. The estimates are compared to the LR + RE model using the different prediction criteria described above.    Table 2 shows the inference for the mean health state valuations of the 49 health states used in the sample. For every state, Table 2 displays the posterior mean and standard deviation from both LR + RE and BR + RE models, along with the actual mean utility. Across the 49 states that were used in the study, the predictive performance of BR + RE is generally better than that of LR + RE, with MPE and RMSE of 0.023 and 0.053, respectively, for the BR + RE model compared to 0.027 and 0.059 for LR + RE model. Additionally, no negative utilities were obtained, so no health state is predicted as being worse than death for both models. Table 2 also shows potential differences between the two models. For example, the predicted mean utility value for the pits state is 0.346 for LR + RE model compared to 0.331 for BR + RE model, whereas the actual mean for this state is 0.322. Further, Table 2 shows the posterior standard deviations of the estimates are mostly bigger for LR + RE model than those for BR + RE model.

BR + RE vs. LR + RE
Another aspect to show the difference between LR + RE and BR + RE models is presented in Figure 1, which displays, respectively the predicted utilities from the LR + RE model (Figure 1a) and from the BR + RE model (Figure 1b) versus the actual utilities of the 49 health states, along with perfection prediction indicated by a 45 • line of unity (solid line). Theoretically, the predicted mean utilities from both models are anticipated to lie roughly on the perfect line; that could be treated as a good agreement. In practice, the plots suggest that the valuations from BR + RE model tend to be closer to the unity line, whereas the valuations from LR + RE model have a larger scatter, so estimates depart largely from the perfect line. This implies that the BR + RE model generates better predictions and much more precise estimates than LR + RE model.
Other spotted performance differences between LR + RE and BR + RE models include the logical monotonicity property that utility should decrease as health decreases in any dimension. Of the total 18,000 SF-6D health states, 10,000 states were selected at random without replacement. Theoretically, each state of the 10,000 has 6-12 states adjacent to it (in a sense, they only differ in one dimension). Then, as a result of selecting one health state at random from these 6-12 states, 10,000 adjacent pairs were obtained. Both models have problems with the non-monotonicity of the predicted mean values for some of these states. However, the extent of non-monotonicity was found to be less for the BR + RE model than the LR + RE one. More specifically, out of these 10,000 adjacent pairs, 15% display non-monotonicity in the LR + RE model versus 10% for the BR + RE model.
A better test of the validity of the model is to investigate its ability to predict the values for states that have not been used in the estimation. Toward that end, we conduct an out-of-sample leave-one-out prediction, where ten health states were randomly selected from the estimated data; then consecutively, valuations relating to every selected health state were left out, and the two models were implemented on the reduced set of 48 states. Table 3 presents the observed means for the 10 left-out states, along with their predicted means and standard errors from the LR + RE and BR + RE models estimated on the reduced data. Results revealed that the BR + RE model predicts the left-out data quite well and better than LR + RE overall, with RMSE values of 0.091 and 0.107, respectively. Figure 2a,b presents the Q-Q plots of standardized predictive errors for the 10 out of sample health states for LR + RE and BR + RE models, respectively. The solid line in every case represents the theoretical normal distribution. Theoretically, the quantiles of these standardized predictive errors are anticipated to lie roughly on the solid line; hence their distributions are identical. The plots suggest that the quantiles from BR + RE model tend to be closer to the theoretical line, whereas in contrast, the quantiles from LR + RE model have a larger scatter, so estimates deviate largely from the theoretical line. Almost identical results were also obtained across five replicates.  Other spotted performance differences between LR + RE and BR + RE models include the logical monotonicity property that utility should decrease as health decreases in any dimension. Of the total 18,000 SF-6D health states, 10,000 states were selected at random without replacement. Theoretically, each state of the 10,000 has 6-12 states adjacent to it (in a sense, they only differ in one dimension). Then, as a result of selecting one health state at random from these 6-12 states, 10,000 adjacent pairs were obtained. Both models have problems with the non-monotonicity of the predicted mean values for some of these states. However, the extent of non-monotonicity was found to be less for the BR +

Discussion
Typically, the distribution of health state values obtained from valuation techniques such as TTO and SG is skewed, truncated, hierarchal and noncontinuous, all of which impose a key challenge for statistical modeling methods. The present study considered the Bayesian framework for beta regression based on three different approaches in order to address the aforementioned characteristics of the HRQoL data. More specifically, this study examined if Bayesian beta regression is appropriate for analyzing the SF-6D health state utility scores and respondent characteristics in a populationbased Lebanese health study. For the three different approaches, the beta regression model was found to perform better than the normal regression model under all criteria used.
Further, the beta regression with random effects model performed best, with MPE (0.084), RMSE (0.058) and DIC (−1621), given including covariates did not provide any further improvement in terms of MPE, RMSE and DIC, let alone the 95% CIs for all covariates coefficients contained zero. It is clear that from the analysis presented here that the predictive ability of the beta regression model is better than the previously used normal regression model overall, as reflected in better RMSE within the entire estimation data and better predictive errors in the out-of-sample validation data.
The use of the SF-6D has become widespread across the globe, reaching the US, UK, Europe and Asia, and is largely available for use in datasets since it is derived from the commonly used SF-36.

Discussion
Typically, the distribution of health state values obtained from valuation techniques such as TTO and SG is skewed, truncated, hierarchal and noncontinuous, all of which impose a key challenge for statistical modeling methods. The present study considered the Bayesian framework for beta regression based on three different approaches in order to address the aforementioned characteristics of the HRQoL data. More specifically, this study examined if Bayesian beta regression is appropriate for analyzing the SF-6D health state utility scores and respondent characteristics in a population-based Lebanese health study. For the three different approaches, the beta regression model was found to perform better than the normal regression model under all criteria used.
Further, the beta regression with random effects model performed best, with MPE (0.084), RMSE (0.058) and DIC (−1621), given including covariates did not provide any further improvement in terms of MPE, RMSE and DIC, let alone the 95% CIs for all covariates coefficients contained zero. It is clear that from the analysis presented here that the predictive ability of the beta regression model is better than the previously used normal regression model overall, as reflected in better RMSE within the entire estimation data and better predictive errors in the out-of-sample validation data.
The use of the SF-6D has become widespread across the globe, reaching the US, UK, Europe and Asia, and is largely available for use in datasets since it is derived from the commonly used SF-36. There is potential to borrow strength from existing countries' valuations to generate better representative utility estimates [33][34][35]. The beta regression model provides a key potential advantage as it allows making use of findings of one country as informative priors to improve those of another, and as such, generated utility estimates of the second country would be much more accurate than collecting and analyzing its data separately. This is a promising approach for other countries with smaller population sizes or low-and middle-income countries (LMIC) where the cost of undertaking large valuation surveys and collecting data via face-to-face interviews with techniques such as SG and/or TTO to develop country-specific value sets could be restrictive. Further research is underway to explore whether using the already existing UK value sets may contribute substantial informative prior to the analysis of the new valuation SF-6D study in Lebanon, and preliminary results sound very promising.
It is to be noted that for the analysis presented here, we have used the SF-6D version 1 (SF-6Dv1). A new revised version of the descriptive system was derived, the SF-6Dv2 [36,37]. This descriptive system still contains the same six dimensions (physical functioning, role functioning, vitality, mental health, social functioning and pain), but these differ from the original SF-6D. The new descriptive system was developed using a recently developed methodology involving a combination of Rasch analysis with classical psychometric analysis. The severity levels of each dimension were selected and worded to ensure there is no ambiguity in the severity levels, and all dimensions are positively worded. Toward this end, it would be interesting to investigate whether or not different results would be obtained when the new instrument (SF-6Dv2) is applied.
The models show that the coefficients associated with pain and mental health are smaller than those for physical functioning or social functioning. These results are different from other studies, including the SF-6Dv1 [6], SF-6Dv2 in the general population [36], SF-6Dv2 in a specific population of patients [38], however, similar to others [39,40]. In a systematic review conducted by Poder and Gandji [41], it is noted that countries populated by native English-speaking persons (UK, USA, Canada and Australia) give more importance to pain than countries outside this cultural world (e.g., Latin and Asian countries). Kharroubi [24,42] argues in this sense too. This motivates the importance of developing a value set for each country as using other countries' utility values carries some risks in not representing the views and preferences of the population in question, given differences in the sociodemographic and health profile and cultures of the countries [43][44][45].
Limitations of this study include the use of a small sample of 126 people, so this may limit the generalizability of the preference values found. The small number of health states valued could impact the accuracy of econometric modeling. Therefore, the generated SF-6D preference-based coefficients from this pilot study should not be regarded as necessarily representative of the general population of Lebanon. An additional limitation is that all the models outlined above used "vague" prior distributions, which may produce little influence on the overall findings. In a fully Bayesian framework, including external information within the prior distribution would be eminently possible. Such information would be desirable to study the impact that future predicted healthcare reforms may have on HRQoL, and hence allow for more pertinent prediction of future utilities. Another limitation is that we did not explore the predictive ability of other complex regression models. Other investigators have explored generalized linear models, TPM and survival-type models, to estimate HRQoL outcomes [46,47]. Although these models had greater predictive ability compared with linear regression in these studies, it is apparent from the literature that the ideal model is heavily dependent on the derivation data. Linear regression models are the most widely used type of predictive models and, despite their theoretical limitations, appear to be robust in practice. All of the above are the subject of further research.
The standard gamble technique was chosen as the main method for eliciting preferences since it has a strong theoretical foundation in expected utility theory, asking respondents to trade changes in health against risk. It is to be noted that gamble theory has been applied extensively in the health literature on issues such as whether parents act as a risk or protective factor for their children's alcohol consumption [48,49], sexually transmitted infections [50] or even fear of missing out [51]. Future research is then recommended to explore whether physical functioning, role limitations, social functioning, pain, mental health, and vitality would act as risk or protector factor in Lebanese society. It would also be interesting to investigate whether this perspective would be different with beta and linear regression models and to know whether the risk factors detected with one regression model are problematic factors when another model is applied [52].
The practical importance of the differences in average health state values between the beta and linear regression models depends on their impact on the relative cost per QALY of different treatments. The impact on cost per QALY relies on the nature of the efficacy (e.g., with it having more of an impact for survival differences), the severity of the patients' state of health (e.g., the relative importance of the differences is going to be higher for patients with lower utility scores), cost and how close the incremental cost per QALY of treatment is to the willingness-to-pay threshold of the reimbursement agency. The consequences of the models presented should be further explored for patient groups [53].
The beta regression Bayesian model presented here provides a flexible approach to modeling health state values for different instruments like EQ-5D and HUI-II, in addition to more specialized, condition-specific measures. It would then be of interest to determine and compare the minimally important difference for the different measures and conditions for various datasets [53,54]. Further, such a model provides important information for the planning of future services and budgets and could also be implemented to inform CEA, in particular, assessing the effectiveness of healthcare spending among the LMIC/Asian economies [55,56].

Conclusions
In conclusion, this present paper considered the Bayesian framework for beta regression based on three different approaches for analyzing the SF-6D health state utility scores and respondent characteristics in a population-based Lebanese health study. For the three different approaches, the beta regression model was found to perform better than the normal regression model under all criteria used. Overall, the beta regression with random effects model performed best and, compared to the traditionally used linear regression model, provided better predictions of observed values in the entire learning sample and in an out-of-sample validation. Such a model will be of significant use to investigators modeling and predicting utility weights for use in economic evaluations. We hope that this work will provide applied researchers with a practical set of tools to appropriately model outcomes in CEA.