Developing a Valuation Function for the Preference-Based Multiple Sclerosis Index: Comparison of Standard Gamble and Rating Scale

Objective The standard gamble (SG) and rating scale (RS) are two approaches that can be employed to elicit health state preferences from patients in order to inform decision making. The objectives of this study were: (i) to contribute evidence towards the similarities and differences in the SG and the RS to reflect patient preferences, and (ii) to develop a multi-attribute utility function (MAUF) (i.e., scoring algorithm) for the PBMSI. Study Design Two samples were recruited for the study. The first sample provided cross-sectional data to generate the preference weights which were then used to develop (D) the MAUFD. The distribution of SG and RS were compared across levels of perceived difficulty. The second sample provided additional data to validate (V) the MAUF, termed MAUFV. Results The mean RS values ranged from 0.39 to 0.65, whereas the mean SG values were much higher ranging from 0.80 to 0.91. Correlations between the two methods were very low ranging from -0.29 to 0.15. Bland-Altman plots revealed the extent of differences in values produced by the two methods. Conclusion In contemplating trade-offs in the selection of a preference-based elicitation approach for a MAUF that could guide clinical decision making, results suggest the RS is preferable in terms of feasibility and validity for MS patients. The PBMSI with patient preferences shows promise as a measure of health-related quality of life for MS.


Background
Multiple Sclerosis (MS) is a progressive, demyelinating disease of the central nervous system (CNS) that affects all aspects of an individual's life. MS produces a range of unpleasant and debilitating symptoms, including fatigue, muscle weakness, loss of memory and concentration, to name a few. These can have a profound impact on daily functioning, relationships, and social and leisure activities.
Health-related quality of life (HRQL) is a multi-dimensional outcome, represented by physical, mental and social well-being. HRQL measures are frequently used to evaluate rehabilitation interventions. This is especially true for chronic conditions like MS, as the management of these diseases are rehabilitative in nature, rather than curative. A recent systematic review in MS [1] identified that HRQL was an important and common outcome in clinical trials of exercise, self-management and cognitive-behavioural therapy.
One approach to assessing HRQL is through the use of health profiles. Examples of generic health profiles include the generic Short Form-36 (SF-36) [2] and disease specific health profiles include the MS Quality of Life-54 (MSQOL-54) [3], which are both scored by sub-scale. A challenge with using health profiles in clinical trials is that if a treatment has a positive effect on physical health but a negative one on mental health, it is impossible to determine whether the intervention resulted in a net improvement or decline in HRQL. [1] Another approach to measuring HRQL is through the use of preference-based measures. Examples of such measures include the Euro-QoL-5 Dimension (EQ-5D) [4] and the Health Utilities Index Mark 3 (HUI3) [5]. These measures not only provide descriptive information on the various dimensions of health, but also provide a value for each one. They have the advantage of leading to a single number (generally from 0 to 1) that balances gains in one domain against losses in another.
A feature of all these measures is that they are generic, and the preference weights are obtained by asking members of the general population to consider the health-impact of each item, whether or not they have experienced the effect. Although general population weights are important for economic evaluation, they have little relevance in clinical research and decisionmaking. When making clinical decisions about which treatment is better or worse for a given patient, the patient's perspective on the benefits and risks is important. Patient' preferences for health states have been shown to differ systematically from those obtained from the general population [6], with patients valuing sub-optimal health states higher. Furthermore, the psychometric properties of these generic preference-based measures in MS have recently been reviewed and limitations identified [7].
There are two steps involved in developing a preference-based measure. The first step is to develop a classification system with items and response options. The second step involves asking patients or the general population to indicate their preferences for (or how much they value) each of the items in the classification system, using one or more standard techniques. These preferences are then combined in a scoring algorithm, also known as a multi-attribute utility function (MAUF), to provide a score from 0 (dead) to 1 (perfect health) for any individual who completes the questionnaire.
Two of the most well-known methods of valuing health states are the standard gamble (SG) and the rating scale (RS). [8] The RS typically asks individuals to place a given state on a vertical ruler-like scale (i.e. feeling thermometer). With the SG, respondents are asked to indicate the extent to which they would risk dying (with a treatment) that can return them to full health. To date, no agreement has been reached in terms of which method should be used in the valuation of health states. Table 1 summarizes the key characteristics associated with each method. There are such strong conceptual differences between the two methods that could affect patients' capacity to understand and respond appropriately to the task demanded, a head-to-head comparison was thought to be of use in the context of MS and in the context of developing a preference based measure.
A MS specific classification system, titled the Preference-Based Multiple Sclerosis Index (PBMSI), was recently developed based on semi-structured interviews from 185 patients with MS [9][10][11]. The PBMSI classification system consists of 5 items with 3 response levels per item, producing 243 (3 5 ) different health states, or combination of responses. The aim of this paper is to complete the PBMSI by asking patients with MS to indicate their preferences for the different items in the classification system. Therefore, the objectives of this study were: (i) to contribute evidence towards the similarities and differences in the SG and the RS to reflect patient preferences, where contrasts were on absolute values and level of difficulty, and (ii) to develop a MAUF (scoring algorithm) for the PBMSI. Fig 1 presents the methodological steps for this study. Two different samples were recruited. The first (development) sample provided cross-sectional data to generate the preference weights for the valuation of health-states which were then used to develop ( D ) the MAUF D . For the development sample, the distribution of SG and RS were compared across levels of perceived difficulty in completing the valuation. The next step was to produce the MAUF D based on valuations obtained from both the SG and RS. The second sample provided additional cross-sectional data to validate ( V ) the MAUF, termed MAUF V . The parameters from the MAUF D were applied to the validation sample to produce the MAUF V and the distribution compared across key measures known to reflect the impact of MS.

Selection of Subjects
The development sample for the valuation of health states was recruited through advertising in three venues: MS Society of Canada website; the 2012 Quebec Summit on Multiple Sclerosis; and outpatient MS clinic of the Montreal Neurological Hospital. To participate, individuals had to be diagnosed with MS and be older than 18 years of age. The study was approved by the McGill University Health Center Research Ethics Board and written informed consent was obtained from participants prior to doing the online survey.
The validation sample was subjects with MS who were participating in a clinical trial of exercise (ClinicalTrials.gov; Registration Number: NCT01611987). The protocol for this study has been published. [12] Briefly, participants were recruited from 3 MS clinics in the Montreal area and were aged 19-65, diagnosed after 1994, ambulatory, and able to speak and read

Measures
The main measure for this study was the PBMSI, administered to both the development and validation samples. Two methods of valuing the health states from the PBMSI were the SG and RS used to derive MAUF D . Measures of global disability, walking capacity and general health perception were used to validate MAUF V . PBMSI. The PBMSI is a brief self-administered questionnaire consisting of five items: walking, fatigue, mood, concentration, and roles and responsibilities. Each item has three response options, and the recall time frame is 'over the past week'. The classification system produces 243 (3 5 ) health states.
Selection of health states for valuation. Each patient valued 12 health states: 5-single attribute level states, 5 corner states, all worst and all intermediate states. These states are as follows: • Single-attribute level states: a given item was described at less than full function (response level 2) while all other items were set at their best level (response level 1).
• Corner states: a given item was described at its worst level (response level 3) while all other items were set at their best level (response level 1).
• All worst was described as the worst level on all items (response level 3), and all intermediate was described as less than full function on all items (response level 2). Patients also assigned  Preferences for the above health states were obtained from patients with MS using an online survey. In the survey, patients were asked to fill out the PBMSI and answer certain socio-demographic and clinical questions. Then they were asked to value selected health states using the SG and RS.
Standard gamble. Patients were asked to rate the single-attribute and corner states using the SG (SG). In the SG, patients were presented with a less than perfect health state (i.e. a corner state or single-attribute state), and asked to imagine themselves in that health state for the rest of their life. Then they were asked to imagine that they were given a treatment. If the treatment was successful, they would be restored to full health. But if the treatment were to fail, they have a probability of dying immediately. Essentially respondents are asked to indicate the highest risk of death (in percentage) they would accept with the treatment. However, questionnaire that elicited these probabilities, referred to death as "failure". This is a common procedure in the literature. [13][14][15][16][17] The response options are given in a drop down menu, as follows: '0% chance of 'failure' (100% chance of 'success'). . .5% chance of 'failure' (95% chance of 'success'). . .etc.' Patients were asked to select only one response option from the list provided. The probability of 'success' that they were willing to accept with the treatment was their SG value (i.e. 100% 'success' is equal to a SG value of 1.0, 95% 'success' is equal to a SG value of 0.95 etc.) The format also allowed for the assessment of states worse than dead if respondents indicated that they would take the treatment even if it had 0% chance of 'success' (100% chance of 'failure').
Rating scale. Patients were asked to rate each of the single-attribute and corner states on a RS from 0 to 100, where zero was the worst imaginable health state and 100 was the best imaginable health state. Patients were also asked to provide on the RS a value for the state 'dead'. If state dead was identified as the worst state and was placed at the 0 end of the scale, then preferences were simply equal to the scale value given to each health state. If death was not identified as the worst state but was placed on some intermediate point on the scale (d), then preferences were measured as: (x-d)/(1-d), where x was the rating given to a health state and d was the rating given to death.
Difficulty. At the end of the survey patients were asked to rate how difficult it was to answer the PBMSI items, the RS, and the SG. Responses were recorded on a four-point Likert scale (very easy, fairly easy, fairly difficult, and very difficult).
Global disability. Global disability was measured using Patient-Determined Disease Steps (PDDS), self-reported outcome of disability in MS. [18] It has nine ordinal levels ranging between 0 (normal) and 8 (Bedridden) and PDDS scores can be converted into classifications of mild, moderate, or severe disability. [19] The PDDS is a surrogate measure of the Expanded Disability Status Scale (EDSS) and has shown to be strongly correlated with the EDSS. [20] Functional exercise capacity. The 6-Minute Walk Test is a simple performance-based test that measures functional exercise capacity. The reliability of the 6-Minute Walk Test has been assessed in persons with MS. The intra-class correlation coefficient is 0.96 for test-retest reliability and 0.93 for inter-rater reliability. [21] General health perception. The first question of the RAND-36 measures general health perception, and formulated as, "In general, would you say your health is. . .," with five nominal response options ranging from excellent to poor. [22] General health perception is easy to measure and can provide information on the person's well-being and overall HRQL. Furthermore, it has been shown to be a predictive factor in the progression of disease. [23,24] EQ-5D. The EQ-5D [25] is a generic preference-based measure of HRQL that consists of two parts. The first part includes 5 separate domains; mobility, self-care, usual activities, pain/ discomfort and anxiety/depression. Each domain has 3 levels: no problems, some problems, extreme problems. [25] The MAUF D was compared against the EQ-5D, as it is a commonly used preference-based measure in MS and is recommended by the National Institute for Health and Care Excellence for economic evaluation.

Statistical Methods
For the development sample, the distribution of SG and RS values was obtained for each health state and plotted by quartile; Pearson correlation coefficients were also calculated.
Concordance between the reported levels of difficulty for the SG and RS was presented and agreement assessed using un-weighted and weighted and Kappa. Generalized estimating equations were used to assess the impact that reported level of difficulty had on SG and RS values, considering the correlation arising from multiple valuations per person.
Two MAUF (i.e. scoring algorithms) were developed (MAUF D ): one based on SG values and the other based on RS values. The methodology used to develop the MAUF D closely followed the procedures described in the manual for the development of the HUI3. [26] The person-mean approach was used to develop the valuation functions. [26] In other words, the functions were estimated from the mean responses of the sample for the single-attribute health states and corner states.
A utility scale runs from 0.0 (dead) to 1.0 (all best/perfect health). Disutility equals one minus utility (disutility = 1 -utility). Thus, the disutility scale ranges from 0.0 for all best/perfect health to 1.0 for dead.
If the sum of the disutility corner states is equal to 1.0, then the valuation function is additive. However, if the sum of the corner states is not equal to one, then the valuation function is multiplicative. The multiplicative function, as specified by MAUT was: where, u 0 is the required disutility of any PBMSI health state on the perfect health = 0.0, dead = 1.0 scale; j is the number of PBMSI items which was 5; c j is the person-mean disutility for the corner state; u i j is single-attribute level disutility score; and The scaling parameter c was calculated by iteratively solving the following equation: is the product of all (1+c Ã c j )from c 1 to c 5 ; and c j is the person-mean disutility for the corner state. The scaling parameter c depends on the sum of the corner disutility states: if X 5 j¼1 c j ¼ 1 thenc ¼ 0; and the valuation function is additive; ð3bÞ and if If the valuation function is additive, c = 0 is the only root of Eq 2. If the valuation function is not additive, Eq 2 will have 2 roots: (i) a trivial solution (c = 0) and (ii) a non-trivial solution (c 6 ¼ 0). We will be searching for the non-trivial solution, and the sum of the corner states will tell us where to search for it (i.e. if sum of corner states is greater than 1, then -1 < c < 0; if sum of corner states is less than 1, then c > 0).
Excel Solver was used to iteratively solve for the scaling parameter c. All other analyses were conducted using SAS9.3.
We estimated the sample size for this valuation to yield a 95% confidence interval (95%CI) around the mean value for the SG and RS of ± 0.05 points. Clinically meaningful difference on the SG (as well as the RS) is approximately 0.10 points [27]; half the difference was chosen as it would not be meaningful and, therefore, this CI would indicate precision in the estimates of value.
Calculation of the 95% CI requires an estimate of the population standard deviation (SD). To our knowledge, there are no studies have reported the SD for the SG in people with MS. Therefore, sample size calculations were based on the values obtained for the RS in the MS Life-Impact Study [9,10,28] conducted in a similar population. The SD of the RS value for 'best imaginable health' was 0.08. Based on this information the number of people required per health state was equal to 10 (calculated using the following formula: 1:96 Ã 0:08 = ffiffi n p ¼ 0:05).
As there were 5 corner states, the required sample size for this study was 50 people. Agreement between the SG MAUF and RS MAUF for both samples was depicted using scatter plots. For perfect agreement, all data points are expected to be on the diagonal line, the line of equality. For both the development and the validation samples, the Bland-Altman method was used to analyze agreement between the SG MAUF and the RS MAUF. This method contrasts the mean difference between two MAUF (y axis) against the average of the two MAUF, which represents the latent trait of "utility". The graph shows 95% limits of agreement around the mean difference (1.96 SD). Perfect agreement between the SG MAUF and the RS MAUF would be indicated by a mean difference equal to 0 and no pattern across the latent trait. A paired t-test was used to contrast values between the MAUF SG and MAUF RS.
The distribution of items on the PBMSI obtained from the clinical trial validation sample was identified. The known-groups method was used to test the discriminative ability of the standard gamble and RS MAUF V against different measures of disability, namely the PDDS, the 6MWT and the general health perception item of the RAND-36. The MAUF V was also compared against the generic preference-based measure EQ-5D. The linear test for trend was employed to test if gradients across levels of disability was statistically significant. Table 2 presents the demographic and clinical characteristics of the two samples, development and validation. These samples were chosen using quite different sampling frames, and hence were expected to differ somewhat. However, the two samples were similar on age (mean~47 years) and proportion women (75%-79%). The clinical trial (validation) sample was comprised of people recruited into an exercise intervention trial and showed lower disability in walking (level 1), lower fatigue, better mood, but more challenges with regular roles and responsibilities. Also shown is the number of people in the most common health states. For example, 2% of the development sample and 8% of the validation sample had the health state 11111, reflecting the best level on all 5 dimensions. Furthermore, approximately 13% of the samples had the health state 22111, reflecting some problems with walking and fatigue, but no problems with mood, concentration, and roles and responsibilities. No statistical comparison between samples was done because it was known from the outset that these two samples did not arise from the same population. Table 3 presents for the development sample the mean RS and SG values for level 2 and level 3 of each item in the PBMSI as well as two multi-attribute health states, all at level 2 and all at level 3. All health states were rated lower using the RS than the SG. The mean RS values ranged from 0.20 to 0.65, whereas the mean SG values ranged from 0.60 to 0.91. Also presented are the correlation coefficients between the RS and SG; weak correlations were observed ranging from -0.29 to 0.15.  Table 4 presents the percent agreement between the levels of difficulty reported by patients for the SG (rows) and RS (columns). Across all levels of difficulty, 38% (23/61) found both methods to be of equal difficulty (diagonal cells); 50% (30/61) rated the SG at a higher level of difficulty than the RS (cells below the diagonal). Only 5 people rated the RS harder than the SG (cells above the diagonal), but the 6 people rating SG as "very easy" scored all health states with virtually the same value, 0.95 (data not shown). Chance corrected agreement as estimated using un-weighted Kappa was poor (k 0.09; 95% CI: 0.08 to 0.25) and poor using weighted Kappa (k 0.13; (95% CI: -0.08 to 0.34).

Sample
To answer the question as to whether level of difficulty had an impact on health state values, we regressed method of valuation (SG, RS), on the 12 health state values, using generalized estimating equations which considered the correlation (non-independence) of the valuation, including the interaction between method and health state. The model was health state value = method (RS/SG) + item (1-12) + method Ã item. As the interaction term was non-significant, it was dropped. For the RS, the effect of difficulty across all items when compared to the SG was equal to -0.25. When the model was adjusted for difficulty, the difference was accentuated to -0.32. The difference between RS and SG did not depend on item (non-significant interaction).  Preference-Based Multiple Sclerosis Index Table 5 presents the parameters used to develop the MAUF D based on the RS and SG values obtained in the development sample. The first column presents the mean RS and SG utility values for each response level, where level 1 was the best, level 2 was intermediate, and level 3 was the worst. The first level of each item was 1.0 (perfect health). As expected, there was a drop in utility values from level 1 to level 2 to level 3. For each item, response level 3 was the corner   Table 4 presents the disutility values (1-utility) for each of the item response levels. The third column presents the mean utility values rescaled so that the third response level of each item was 0.0, and the first response level was 1.0. The fourth column is the rescaled mean disutility score, which is equal to 1-the rescaled mean utility score (presented in third column). These are the parameters used to develop valuation function (MAUF D ). Table 6 presents the MAUF D developed using the SG values presented in Table 4. The sum of the corner states was equal to 0.85, which is less than 1.0, therefore the MAUF D was multiplicative and yielded two solutions for Eq 2. Based on Eq 3c, the non-trivial solution was greater than 0. Using the iterative solution (Eq 2) an exact value for the non-trivial solution c was calculated, and found to be equal to 0.4821.
The SG MAUF D for the PBMSI in dis-utilities was: Where the values of u' 1 , u' 2 , u' 3 , u' 4 , u' 5 (the single-attribute mean disutilites) are selected from Table 7 depending on the individual's responses to the PBMSI items. The calculated disutility on the perfect health = 0.0, dead = 1.0 scale can then be converted into a utility score on a dead = 0.0, perfect health = 1.0 scale: Table 7 presents the MAUF D based on the RS values. The procedure used to develop the RS MAUF D was identical to the process described for the SG MAUF D . Using the RS values, the sum of the corner states was equal to 3.65 and the scaling parameter was calculated to be equal to -0.9987. The full valuation function can be found in Table 7. Fig 4 presents, for the development sample, a scatter plot to assess agreement between PBMSI scores obtained using the RS MAUF D against scores obtained using the SG MAUF D . As none of the data points were on the line of equality there was no agreement between the two methods. Scores produced by SG MAUF D were consistently considerably higher than scores produced by the RS MAUF D , yielding a strong correlation (0.8), but no agreement.    trait. The average difference between the methods was 0.46 (represented by the middle red line), and 95% of patients had a difference in scores between 0.24 and 0.68. A clinically meaningful difference on the SG or RS is 0.10; therefore the mean difference between the two methods was almost 5 times greater than the clinically meaningful difference. Additionally, there was a distinct pattern to the values such that, at the low end of the latent trait (poor health state) the differences were small; as latent health state improved, the difference between the methods increased. A paired t-test revealed that the difference in scores was statistically significant (p-value <0.0001). Fig 6 presents, for the validation sample, a scatter plot of the PBMSI scores obtained using the RS MAUF V against scores obtained using the SG MAUF V . Similar to the results obtained for the development sample, there was no agreement between scores produced by the two MAUF V . Fig 7 presents the Bland Altman plot for the validation sample, which shows that the mean difference between the SG MAUF V and RS MAUF V is 0.44, 4 times greater than the clinically meaningful difference of 0.1 points. A paired t-test between scores indicated that the difference in scores between the SG MAUF V and RS MAUF V was statistically significant (p-value <0.0001). Table 8 presents for the validation sample, the ability of the SG and RS MAUF V to discriminate between different clinical subgroups, assessed using the PDDS, 6MWT and the general health perception item of the RAND-36. Both the SG MAUF V and the RS MAUF V were able to differentiate between different levels of disability measured using the PDDS. However, the RS MAUF V had a wider range of values than the SG MAUF V . The EQ-5D valuation function was not able to differentiate between moderate and severe levels of disability. For the 6MWT, both the SG and the RS MAUF V were able to differentiate between different levels of walking capacity, however, the values produced by the RS MAUF V were lower than the SG MAUF V . The EQ-5D was also able to differentiate between different levels of walking capacity. As for general   Preference-Based Multiple Sclerosis Index health perception, the SG MAUF V was able to differentiate between all levels of health perception. However, the RS MAUF V was only able to differentiate between excellent, very good and good health, but not between good and fair health. The EQ-5D also presented with problems discriminating between different levels of health perception, specifically between very good and good.

Discussion
To fill a gap in outcome measurement for interventions targeting HRQL, this study elicited patient preferences for items in a new measure, the PBMSI, using two standard methods, the SG and the RS. A MAUF was developed based on values obtained using each of the methods, and the validity of these scoring algorithms were tested in a separate sample of MS patients (i.e. validation sample). In contemplating trade-offs in the selection of a preference-based elicitation approach for a MAUF that could guide clinical decision making, results suggest the RS is preferable in terms of feasibility and validity for MS patients.
The SG and the RS produced considerably different results from each other. On a scale from 0 (dead) to 1 (perfect health), values produced by the SG were consistently higher than those produced by the RS. The median values for the corner state items were between 0.4 and 0.6 on the RS, and between 0.90 and 0.95 on the SG. With the SG, 50% of the sample rated having severe walking impairments, severe fatigue, severe impaired concentration and depression close or equivalent to perfect health (1.0). For these same items, none of the respondents gave a value of 1.0 on the RS.
Our results are similar to previous studies that have compared the SG and the RS. Jansen and colleagues [29] compared the two methods in 51 women with breast cancer. They asked patients to value a hypothetical chemotherapy scenario, and reported that values elicited using the SG (mean~0.9) were consistently higher than the RS (mean~0.6). Juniper and colleagues Preference-Based Multiple Sclerosis Index [30] compared the SG and RS in 40 patients with asthma. In their study, more than half of the patients (n = 23) rated their current health equal to 1.0 (perfect health) on the SG, even though they represented patients at the more severe end of the spectrum (80% required inhaled steroids). Sullivan and colleagues [31] interviewed 52 patients with diabetes mellitus on various health states describing different levels of disease severity in diabetic peripheral neuropathy. The SG scores were considerably higher than the RS. The highest median preference score for the SG was 0.96 (mild neuropathy) and the lowest was 0.65 (below-knee amputation). On the RS, the highest median score was 0.89 (mild neuropathy) and the lowest was 0.23 (below-knee amputation). In our study, correlations between the SG and the RS were very weak (r~0.1), thus reinforcing the fact that there were considerable discrepancies in the values elicited by the two methods. These low correlations were similar to what others have reported in cancer (r = 0.18), [29] chronic musculoskeletal pain (r = 0.21) [32] and asthma(r = 0.18). [33] The SG is a method that assesses the probability an individual would risk death to regain perfect health. As death is a highly undesirable state, patients may be inclined to stop the gambling earlier, thus resulting in an overestimate of the value associated with an impaired health state. [17,34] In the context of MS, the possible risk of dying after treatment is far from realistic as existing medical treatments are rarely life threatening. Instead treatment is directed at slowing the progression of disease or disability. As the RS does not involve risk or decision making under uncertainty, values elicited with this method tend to be systematically lower than the SG.
Fifty percent of our sample rated the SG at a higher level of difficulty than the RS. These findings are concordant with previous studies that have compared the SG with the RS. In patients with cancer, Dobrez and Calhoun [35] reported that 17% of their sample did not comprehend the SG method. Similarly in HIV/AIDS patients, Sakthong [36] and colleagues reported that the SG was more difficult for patients to understand compared to the RS (p = 0.002), and that the completion time for the SG was much longer than the RS (average 5 minutes per health state vs 0.9 minutes per health state).
The SG method may be difficult for patients to comprehend because the concept of probabilities is a challenging one to grasp and far from everyday experience. [37] Lack of comprehension of the method is an important issue in the valuation of health states, as it can compromise the accuracy or reliability of the data collected. [38] There were several notable features of this study. First, we used an internet based approach to value health states, rather than the traditional interviewer based approach. Traditionally the SG requires the use of a trained interviewer with props, where researchers must either go to the participant's home or offer sufficient incentives to bring the participant to the lab, which are both expensive. The advantage of using an online survey is that patients can complete the survey in the convenience of their home, resulting in greater recruitment or participation. Although other studies have used the internet to elicit preferences, [39,40] the validity and reliability of this approach requires further study. Second, in the SG, rather than alternating the proportion of success and death in a "ping-pong" manner we simply asked individuals to indicate the maximum risk of death they were willing to take with the hypothetical treatment. This may have resulted in a higher value of utilities than the former approach. Finally, alternate methods of valuation such as the time trade-off (number of years patients are willing to trade off for perfect health) were not assessed in this study.

Conclusions
This study elicited patient preferences for various items from a MS-specific classification system using two different valuation methods, the SG and RS. We compared these two methods in terms of the values they produced, their difficulty of use and impact on the MAUF. Our findings demonstrated that, the SG compared to the RS, produced higher utility and was more difficult for patients to understand. Although RS is considered inappropriate as a basis for obtaining Quality Adjusted Life Years (QALYs) for economic evaluation, in contemplating trade-offs in the selection of a preference-based elicitation approach for a MAUF that could guide clinical decision making, results suggest the RS is preferable in terms of feasibility and validity for MS patients.
Furthermore, in the broader policy arena of allocating resources across multiple health conditions, the standard approach of using generic preference-based measures with general population weights would be difficult to disapprove. However, in the context of use here, which would be to evaluate the effect of interventions that are expected to impact widely on the health of individuals with MS, the PBMSI with patient preferences shows promise.