What works better for preference elicitation among older people? Cognitive burden of discrete choice experiment and case 2 best-worst scaling in an online setting

Abstract To appropriately weight dimensions of quality of life instruments for health economic evaluations, population and patient preferences need to be elicited. Two commonly used elicitation methods for this purpose are discrete choice experiments (DCE) and case 2 best-worst scaling (BWS). These methods differ in terms of their cognitive burden, which is especially relevant when eliciting preferences among older people. Using a randomised experiment with respondents from an online panel, this paper examines the cognitive burden associated with colour-coded and level overlapped DCE, colour-coded BWS, and ‘standard’ BWS choice tasks in a complex health state valuation setting. Our sample included 469 individuals aged 65 and above. Based on both revealed and stated cognitive burden, we found that the DCE tasks were less cognitively burdensome than case 2 BWS. Colour coding case 2 BWS cannot be recommended as its effect on cognitive burden was less clear and the colour coding lead to undesired choice heuristics. Our results have implications for future health state valuations of complex quality of life instruments and at least serve as an example of assessing cognitive burden associated with different types of choice experiments.


Introduction
Developments like ageing populations and rapid advances in medical technology create challenges for budgets of publicly funded health care systems (de Meijer et al., 2013). Policy makers increasingly have to decide about which health care services to include in the basic benefits package, which should only be made available to certain subpopulations, and which should not be funded at all. Health technology assessment (HTA) generates valuable insights to support this decision-making process, using tools like cost-utility analysis. There, the benefits of health technologies are typically expressed in the incremental amount of health changes they produce. This is calculated based on data from generic, multidimensional quality of life instruments, and a weighting algorithm for the levels of the dimensions based on population or patient preferences (Neumann et al., 2016). Given that health and social care, for instance aimed at older persons, may affect more than health-related quality of life alone, more recently, broader well-being measure have been developed (Makai et al., 2014). These could facilitate cost-utility analyses with a broader scope in terms of relevant outcomes but require obtaining preferences for different 'well-being states' ideally anchored on death.
The measurement of population and patient preferences in health care is a rapidly developing field, with a plethora of qualitative and quantitative methods to the disposal of researchers and practitioners (Soekhai et al., 2019). One of the most popular methods over the last decade was the discrete choice experiment (DCE). Increasingly, population and patient preferences in health care are obtained using DCEs (Soekhai et al., 2019). The 'standard' DCE entails asking respondents to choose between two or more alternatives (Ryan et al., 2008) and is widely used for weighting quality of life instruments . 1 Another preference elicitation approach that gained traction over the last years also in this context, is best-worst scaling (BWS). There are three different forms of BWSobject case, profile case, and multi-profile case. The following will focus on profile case, or also called case 2 BWS, where individuals have to select a best and a worst option from a list of dimension levels or items (Flynn and Marley, 2014). Case 2 BWS was applied to value different quality of life instruments before (Cheung et al., 2016). This includes the ICECAP-O, a well-being measure specifically aimed at older people (Coast et al., 2008).
While both DCE and BWS provide numerical estimates of the relative importance of the different levels and dimensions of the respective quality of life or well-being instrument, previous research directly comparing DCE and BWS has shown that the choice between these approaches is not neutral as resulting preference estimates can differ (see e.g. Krucien et al., 2017). According to a recent review comparing DCE and BWS, there seems to be no conclusive evidence yet on which of the methods should be preferred in terms of the validity of the estimates (Whitty and Oliveira Gonçalves, 2018). Both methods assume different choice processes and ultimately may be seen to answer more or less subtly different questions. Some researchers prefer DCEs because the modelled choice processes have a strong theoretical foundation in random utility theory (Louviere, 2004). Providing choices between multiple alternative profiles can also be considered as a more realistic way of the decision-making process compared to selecting a best and worst option from a list of items. Another advantage of DCEs in the context of health state valuation, is that utilities can more easily anchored onto the full health (or well-being)-dead scale. On the other hand, some argue that profile case BWS is to be preferred as it is a more efficient way of collecting data compared to DCE since each task entails two choices. Moreover, cognitive burden of BWS tasks may be lower, since individuals only need to focus on one set of attributes and levels in each choice task, compared to multiple in DCEs. Some specifically claim that it would be recommendable to choose case 2 BWS if DCE tasks are considered to be too burdensome (Flynn, 2010;Potoglou et al., 2011). However, Whitty and Oliveira Gonçalves (2018) conclude that there is no clear evidence for an advantage of BWS regarding participant acceptability in terms of feasibility of administration or response efficiency. The response efficiency, that is, the cognitive burden associated with choice tasks, is important as it influences choice consistency, respondent fatigue and the use of simplifying choice heuristics (Jonker et al., 2019), which could subsequently influence the validity of the preference estimates.
Due to the ageing of the population, the need for economic evaluations of health and social care services targeted at older people can be expected to increase. This makes accurately measuring and weighting quality of life dimensions in this population very important, and choosing the appropriate methodology to do so, all the more relevant. If one decides, as we do here, that an instrument aimed at older people should be weighted using older peoples' preferences, 2 one needs to be aware of an additional aspect: Since there is a large variation in the level of cognitive abilities within older people, the design of choice experiments for this population should especially be wary of the complexity and subsequent cognitive burden of the choice task format in order to enable obtaining valid and reliable responses (Milte et al., 2014). Measuring and weighting quality of life or well-being outcomes inaccurately may ultimately lead to sub-optimal policy recommendations for resource allocation to health or social care services aimed at older people.
Specific evidence about the cognitive burden of DCE and case 2 BWS for valuing quality of life measures among older people is lacking. Therefore, the main aim of this study was to assess the cognitive burden and incidence of simplifying choice heuristics in DCE and case 2 BWS choice tasks among older people in this context. Another aim was to test the impact of the use of colour coding on the cognitive burden and choice behaviour of case 2 BWS tasks, which has been assessed for DCEs before (Jonker et al., 2019).

Methods
We set up a randomised experiment with three study arms to examine the cognitive burden and choice behaviour attached to three respective choice task formats for valuing a quality of life instrument: a colour coded and level overlapped DCE (5 out of 9 dimensions), a case 2 BWS and a colour coded case 2 BWS. 3 In the applied colour coding, five shades of one colour correspond to the five levels of attributes of the used instrument, with darker shades representing the least desirable levels. The rationale behind this type of coding in the DCE is that it helps respondents to identify differences between the alternatives, and higher and lower levels, while not nudging respondents to only focus on the differing attributes, what e.g. exclusively highlighting the non-overlapped levels would do, or introducing strong prejudgments on the severity of the levels by using e.g. a traffic light colour coding.
We chose an online setting with participants from an online panel for our study, as this administration and sampling mode facilitates reaching a sufficiently large number of respondents for health state valuation studies, which is also why it is used in most such studies by now .
The quality of life measure used in the experiment was the recently developed Well-being of Older People instrument (WOOP) (Hackert et al., 2019). Examining the cognitive burden of a valuation task is especially important in the context of this new instrument for measuring the general/overall quality of life of older people: First, the WOOP consists of nine dimensions with five levels each, which requires complex choice tasks. Second, as preferences should be based on an older population, cognitive burden is of special relevance. The profiles shown to respondents in both DCE and BWS tasks corresponded to well-being states, described using the nine dimensions of the WOOP (i.e. physical health, mental health, social life, receive support, acceptance and resilience, feeling useful, independence, making ends meet, living situation). 4 In designing the choice tasks and their visual representation, we followed methodological work on the use of colour coding and level overlap in DCEs aimed to reduce task complexity (Jonker et al, 2018(Jonker et al, , 2019Maddala et al., 2003). To enable a more direct comparison and to test the impact of colour coding on task complexity in BWS, which has not been studied before, the randomised experiment included a colour coded BWS and a regular BWS. Important to note here is that the design was generated to test the cognitive burden and choice behaviour of older people, not to provide model estimates for the different methods. Due to the large descriptive system of the WOOP, this would have required estimation of 36 parameters in the DCE and 45 parameters in the BWS, a blocked design and a much larger sample size. While a comparison of model estimates would have been interesting, this was not our current research aim.

Survey structure and randomisation
The structure of the experimental survey is shown in Fig. 1. First, respondents were asked to complete the WOOP instrument to become familiar with its dimensions and levels. Afterwards, they were randomized 1:1:1 to the three study arms: colour coded DCE (1), colour coded BWS or BWSc (2), and regular BWS (3). The randomisation was preferred over having the same respondents completing both DCE and BWS tasks, to have avoid the different parts of the experiment influencing each other and to stay as close as possible to standard DCE and BWS experiments. Furthermore, two full sets of valuation tasks per respondents were considered to be too burdensome. Respondents were familiarized with the presentation of well-being states in the subsequent experiment by showing them their own profile in DCE or BWS format based on the answers they previously gave to the WOOP instrument. The choice task formats were introduced by a simple DCE or BWS task, where participants had to select between two types of fruits or chose the best and worst type of fruit from a list. The second part of the warm-up comprised of a choice task, as used in the subsequent experiment, providing further instructions. Subsequently, a block of six choice tasks was administered, followed by two simple break questions on an unrelated topic to interrupt the monotony and reduce respondent fatigue of answering the choice tasks. Then, a second block containing seven tasks concluded the randomized part of the questionnaire, leading to a total of 13 choice tasks per respondent. All respondents subsequently had to fill in three blocks of evaluation questions on a 5-point Likert scale, before providing some sociodemographic information at the end of the survey.

Survey administration and participants
The survey was programmed using Sawtooth software version 9.7.2 (Sequim, WA). We used Prolific.co to recruit survey participants, a platform for online subject recruitment specifically for research purposes (Palan and Schitter, 2018). Given our aim to assess the cognitive burden of the choice tasks in a sample of older people, being aged 65 or above was used as inclusion criteria (which is also the target population of the WOOP). Since this age group was underrepresented in the online panel, we had to combine respondents from the two largest country panels of Prolific.co, UK and U.S. residents, to obtain a reasonably sized sample. At the time of data collection, in October 2019, the potential respondent pool contained around 1,000 individuals. Using quota sampling, we aimed for 150 respondents for each of the three study arms. Respondents received a monetary compensation for participating, which was oriented on the mean completion time and averaged to an aggregated hourly reward of £7.62. To test the functionality of the survey and whether respondents understood the choice tasks, six think-aloud interviews with UK residents aged 65 and above were conducted (two per study arm) prior to the main data collection. These interviews showed that participants understood and appropriately engaged in the choice tasks (i.e. traded-off or considered multiple items).

Experimental design of DCE and BWS
Attributes and levels in the DCE and items of the BWS were based on the dimensions and levels of the WOOP instrument (Appendix A). This created a rather complex DCE setup with nine dimensions with five levels each and a BWS instrument with 45 items. WOOP well-being states were consequently defined by selecting one of the five levels from each of the nine dimensions for both DCE and BWS. In the DCE, respondents were repeatedly presented with two well-being states and asked to indicate, which of the two they preferred. An opt-out option was not included as this is uncommon in DCEs for health state valuation . In the BWS, a list of nine well-being items corresponding to one well-being state was shown to respondents. Participants then had to select the aspect that they most preferred (best) and the aspect that they least preferred (worst). 'Most' and 'least' is one of the options that are used for describing a best and worst choice (Huynh et al., 2017). 5 To ensure that the choice tasks had a similar level of complexity compared to a regular choice experiment, choice tasks were created using standard design methodology as outlined in the subsequent paragraph. The literature on health related DCEs specifically targeted at older people was reviewed (in total 22 papers were studied) to inform the number of choice tasks. The number of choice tasks per respondent varied between 6 and 16 with a mean of 9.2. We opted to select a number of choice tasks at the upper end of this range (13) to capture fatigue effects (examples of this literature are e.g. Arendts et al., 2017;Franco et al., 2016;Milte et al., 2014) and because we anticipated this might be close to the approximate number in the actual valuation study of the WOOP. The 13 choice tasks consisted of 10 DCE choice tasks, two that repeat one of them, and one choice task to test for dominance. The ten DCE choice were selected with help of Ngene design software (Version 1.2.1). To accommodate for level overlap (five out of the nine dimensions), which has been shown to reduce task complexity by Maddala et al. (2003) and Jonker et al. (2018), Ngene required a dataset including all possible candidate sets, i.e. combinations of two health states with five overlapped levels. To pragmatically reduce this to a feasible number, 5,000 out of the 1,953,125 possible health states were randomly selected and combined in MATLAB (MathWorks). Out of the obtained 25 million possible sets, we excluded the ones without the specified amount of overlap and randomly selected 1,000 sets out of the remaining 386,030 overlapped sets. Ngene was then used to select 10 choice tasks out of the 1, 000 candidate sets by optimizing for a conditional logit, main effects model (Appendix C contains the utility function) with 36 parameters corresponding to four of the five levels of each of the nine dimensions of the WOOP instrument. Small priors ranging from 0 to − 0.25 were assumed, following the logical ordering of the WOOP levels. Besides the think-aloud interviews no further pilot testing was conducted.
An orthogonal main effects plan using Sawtooth software version 9.7.2 (Sequim, WA) was applied to generate 1,000 blocks of 10 choice tasks for the BWS experiment. Multiple levels from the same WOOP dimension were prohibited to appear in the same task. Following Flynn et al. (2015), to prevent uninformative sets, we reduced the occurrences of tasks with either only one top or bottom WOOP level by deleting all versions where this occurred more than 3 times in the 10 tasks. Out of the remaining 78 versions, one version was randomly selected to be used in the experiment.
We selected one of the created DCE and BWS choice tasks to appear as the second choice task and repeated the tasks at position 8 and 13, to test choice consistency, adding two choice tasks to the original 10 created tasks. In order to reduce the amount of noise in the answers, we chose tasks, which were expected to have a certain degree of utility difference between profiles in the DCE arm or provided somewhat clear BWS choices (the repeated choice tasks are shown in Appendix B). When this task was repeated the second time, the intensity colour coding of the BWS task was intentionally reversed, to mislead respondents in order to assess the dependence on the colour codes. A dominant DCE choice task and a BWS task, which was expected to have a clear best and worst choice were additionally created and added at position 6 to test the attention level of respondents, adding a third and final choice task to the original ten created tasks. 6 The order of the dimensions (or attributes) was the same for all respondents within elicitation method and fixed for both DCE and BWS tasks to further reduce task complexity. The only difference in attribute order between DCE and BWS tasks was that physical and mental health attributes where positioned in the middle of the BWS tasks, as we anticipated that these would be important dimensions and wanted to avoid respondents making their best and worst choice merely on the top without going over the remaining items. All respondents received the same 13 DCE tasks in study arm 1. Respondents in study arms 2 and 3 received the same 13 BWS tasks.

Visual presentation of choice tasks
The general visual representation of the choice tasks followed current practice, with the exception that intensity colour coding was added to the choice tasks in study arms 1 and 2. Different shades of purple represented the different attribute levels, with the darker shades of purple highlighting the worse and the lighter shades and light blue expressing the better WOOP attribute levels in both the DCE and the colour coded BWS tasks. In the explanation of the colour coding in the survey, 'better levels' (e.g. very well able to cope, feeling very independent, no problems with physical health) were formulated as 'positive aspects' and 'worse levels' (e.g. barely able to cope, feeling very dependent, severe problems with physical health) as 'negative aspects' (e.g. Fig. 2). This type of colour coding was previously used for DCEs by Jonker et al. (2017Jonker et al. ( , 2018Jonker et al. ( , 2019 and was found to reduce task complexity as well as attribute non-attendance, and was especially effective in combination with attribute level overlap. It was also shown that colour-coding does not introduce bias in the choices and does not affect the relative importance of attributes (Jonker et al., 2019). The purple colour scheme was specifically designed to accommodate for the most prevalent forms of colour blindness. Additionally, shades of purple do not prompt natural or perceived value judgements, as opposed to for example traffic light colour coding. Fig. 2 shows an example of the layout of the colour-coded (light blue to deep purple) and overlapped (five out of the nine dimensions) DCE choice task. Level descriptions of the WOOP instrument (Appendix A) were shortened for clarity, level labels were highlighted in bold, and attribute descriptions appeared merely as mouseovers on the attribute labels to reduce the amount of text. Fig. 3 shows examples of both colour coded and non-colour coded BWS tasks. Descriptions of attributes were also included as mouseovers, while the item text contained the full WOOP level descriptions.

Statistical analysis
To assess and compare the cognitive burden and possible choice heuristics associated with the three formats of choice tasks, three types of data were analysed. First, objective measures including mean choice task completion time, development of time per task (assessing learning effects) and drop-out rates were calculated and compared. Second, mean response scores of the three blocks of debriefing questions on perceived choice complexity, the number of choice tasks, and choice strategies used, were obtained. The latter aimed to identify the extent to which respondents engaged in simplifying choice heuristics. This included two statements relating to the number of attributes commonly considered during the choice tasks, also known as attribute non-attendance (Yao et al., 2015), and a statement on deciding that all attributes/dimensions are equally important. This statement implies that respondents merely count up the attribute level positions instead of trading-off attributes in the DCE, or focusing mostly on the level positions, irrespective of attribute, in the BWS format.
Third, revealed cognitive burden regarding choice consistency and (simplifying) choice behaviour was assessed based on the actual choices of respondents. This included calculating the proportion of respondents providing the same answers to the twice repeated choice task. For the BWS arm, a consistent response was defined as providing the same answer for either best or worst option, following   Krucien et al. (2017). Furthermore, we estimated a lexicographic score, which provides information on trading between attribute levels and dominant choice behaviour. This score was obtained also following an approach applied by Krucien et al. (2017): First, the proportion of choices based on one attribute on an individual level was calculated. Assuming respondents exhibit dominant preferences for an attribute given proportions above 90% (DCE) and 50% (BWS), the lexicographic score was obtained by calculating the proportion of respondents with such preferences.
To test the impact of colour coding on the choice behaviour and strategies in the BWS study arms, the shares of responses based on top and bottom levels of the WOOP dimensions were calculated. Additionally, results from the second repeated choice task, where the intensity colour coding was reversed, was used to assess the dependence on the colour scheme.
Statistical significance was assessed using Wilcoxon-rank sum tests for the Likert scale data (de Winter and Dodou, 2010) and chi-squared tests or Fisher exact tests for proportions. A significance level of 10% was used throughout the analysis. Stata 15 was used for all calculations.

Sample characteristics, dropouts, and completion time
A total of 477 participants successfully started with the experiment and were randomly allocated to the three study arms. No respondent dropped out in study arm 1 (DCE). One of the three dropouts in study arm 2 (BWSc) occurred during the choice tasks and two afterwards. Of the five respondents dropping out in study arm 3 (BWS), four dropouts occurred during answering the BWS tasks and one at a later stage. Fisher exact tests indicated that the difference in total drop-out rates was significantly lower in study arm 1 compared to study arm 3 (0% vs. 3.2%, p-value = 0.029). The difference to study arm 2 was not significant (0% vs. 1.9%, p-value = 0.248).
The characteristics of the remaining sample, split by study arm, are shown in Table 1. The randomisation lead to well-balanced samples regarding most sociodemographic aspects, health status (EQ-5D-5L) and well-being (WOOP). 63.7% of the overall sample was younger than 70 years, 34.6% was aged between 70 and 79 years, and 1.7% were aged 80 years and above with 87 years as the maximum age observed.
The average time it took respondents to complete all 13 choice tasks was 6.0 min (SD 3.1) for the DCE tasks, 7.6 min for the colour coded BWS tasks (SD 4.9) and 7.2 min for the standard BWS tasks (SD 4.6). T-tests indicated that choice task completion time was significantly lower for the DCE tasks compared to the two sets of BWS tasks (p < 0.001 and p = 0.007). Fig. 4 plots the mean and median completion times for each choice task separated for each study arm. Differences were most pronounced in the beginning with choice task completion time following a downward trend, likely resulting from learning effects. Finding large differences in mean, but moderate in median answering time in the beginning indicates that some respondents found it particularly difficult to work with and understand the BWS question format compared to the DCE format. On aggregate, respondents in study arm 1 answered each choice task faster compared to the BWS study arms, except for one choice task. Differences within the two BWS study arms were less pronounced with the notable exception of choice task 13, where the intensity colour coding was reversed (e.g. light blue corresponded to the worst level and deep purple to the best).

Self-reported cognitive burden of tasks and number of choice tasks
Mean response scores of the three blocks of debriefing questions and results from significance tests comparing the mean scores across study arms are shown in Table 2. DCE choice tasks appeared to be superior in terms of clarity of the tasks and whether tasks were comprehensible from the beginning. Respondents found the presented states easier to image in the BWS tasks, which admittedly confronted participants only with one well-being state instead of two in the DCE. Colour coded BWS choice tasks were evaluated to be less clear then non-colour coded BWS tasks.
Results from the second block of questions indicated that participants from the DCE study arm found the number of choice tasks easier to manage, were more able to stay concentrated over all choice tasks, and could have answered more tasks, compared to the BWS study arms, with most differences being statistically significant. Colour coding the BWS tasks appeared to have a positive effect on the

Choice strategies and choice behaviour
Most respondents strongly agreed with the statement that they compared all dimensions/items before making their choices, with no significant differences between study arms (Table 2). There were mixed results concerning the use of simplifying choice heuristics or strategies comparing DCE and BWS study arms. While DCE participants agreed to a lesser extent that they decided that all dimensions/ items are equally important, they also reported to a larger degree to having based their decisions on the same 1 or 2 well-being dimensions, which implies some level of attribute non-attendance. Table 3 lists results for the analysis of choice behaviour. The lexicographic score (see section 2.5), was significantly lower in DCE respondents, indicating more trading and less dominant choice behaviour. In the DCE, dominant preferences were observed only for the physical health attribute. In the BWS, such behaviour was also observed for the mental health and making ends meet attributes, with physical health still being the most prevalent one.
In the DCE study arm, 4.4% of respondents did not provide the same answer to the repeated choice task, when it appeared again for the first time (position 2 and 8), with the same colour code. When it was repeated again as the last choice task, that share was 2.5%. Up to 20% of respondents did not provide either the same best or worst answer in the repeated BWS tasks. 7 When defining consistency as providing the same answer to both best and worst, this share increased to around 60%. There were no significant differences between BWS study arms regarding the choice consistency of the first repeated instance. Almost half of respondents did not provide a consistent best or worst answer to the repeated BWS choice task, where the intensity colour coding was reversed (position 13). This share was 72.8% when defining consistency in terms of selecting the same best and worst items.
We further calculated the percentage of best and worst answers based on either the top and bottom levels of the WOOP dimensions on individual level and aggregated that by taking the average. The average share was between 60 and 75%, with higher values observed for the colour coded BWS tasks (significant difference for 'best').

Discussion
To assess the cognitive burden of different types of choice tasks for valuing well-being states for quality of life measures in older people, a randomised experiment was conducted, allocating respondents to either a DCE, a colour coded BWS, or a regular BWS format using an online setting. Our study contributes to the literature by providing empirical evidence on 1) whether DCE or BWS choice tasks are associated with lower cognitive burden in the context of health or well-being state valuation in an older population sample, and 2) whether colour coding of BWS tasks affects cognitive burden and to a lesser extent validity of BWS experiments.
Finding a lower drop-out rate and lower choice task completion time in the DCE study arm compared to the BWS study arms implies that, for older people, DCE choice tasks are less tiring and faster to complete than BWS tasks. Lower completion time was also observed by van Dijk et al. (2016). In terms of self-reported measures, our results indicate that the DCE tasks also were perceived as less cognitively burdensome, and that a higher number of DCE choice tasks was regarded as more acceptable than was a higher number of BWS tasks. The former has also been reported in related studies in different contexts (Whitty and Oliveira Gonçalves, 2018). The latter is especially relevant to consider when thinking about the number of choices per respondent, and hence the required sample size, when selecting DCE or BWS format. Finding lower cognitive burden associated with DCE tasks compared to BWS tasks, in general, is at odds with what has been reported before (Netten et al., 2012). The authors of that study also compared cognitive burden of DCE and BWS tasks for valuing a large descriptive system of a quality of life instrument, but the design of their study was fairly different. The authors used cognitive interviewing, a qualitative approach, in a small sample (N = 30), split the DCE task into two parts to reduce the Note: † p < 0.10 of chi-squared tests comparing study arms 1, 2, and 3 (if applicable). a For BWS defined as providing either the same best or worst answer. b Choice task with intensity colour coding being reversed.
difficulty of the task and showed both DCE and BWS tasks to respondents. 8 Whether the difference in findings relates to the differences in design of the studies, is difficult to say. In terms of (simplifying) choice strategies and choice behaviour, which co-occur with larger cognitive burden, our results are mixed regarding the self-reported behaviour, and less clear cut. We did observe a considerably higher choice consistency and lower degrees of dominant choice behaviour for DCE respondents, with their measurement to some degree accommodating for the methodological differences. However, these results may relate more to artefacts of the type of choice task and may be unrelated to cognitive burden. As stated also by Whitty and Oliveira Gonçalves (2018), the probability of answering consistently to a DCE task by pure chance is already 50%. With nine dimensions this probability is much lower (22%) for the BWS task (defined as providing either the same best or worst answer). Nevertheless, finding that around 60% of BWS respondents did not provide the same best and worst answers when a choice was repeated for the first and the second time, is somewhat worrisome on its own. A higher degree of trading and lower degrees of dominant choice behaviour in DCEs were also reported in the related literature before (Krucien et al., 2017;Whitty et al., 2014) with a similar caveat as for analysing choice consistency.
Comparing colour coded with non-colour coded BWS, we found a similar drop-out rate for both tasks (1.9% and 3.2%, respectively). In the study by Jonker et al. (2018) (study arms 1 and 2), colour coding of the DCE tasks decreased the dropout rate from 13.9% to 9.8%. Further results from the same study set up showed that colour coding alone did not lead to differences with respect to the self-reported cognitive debriefing questions (Jonker et al., 2019). Our results for BWS regarding these questions are mixed. While participants of the colour coded BWS on average agreed to a higher extent that they could have answered more choice tasks, the non-colour coded BWS choice tasks appeared to have been clearer to respondents. Given no conclusive evidence on cognitive burden, and the fact that the colour coding increased the already high focus on top and bottom levels of the quality of life instrument in the BWS tasks, colour coding BWS cannot be recommended for health or well-being state valuation studies among older people.
The overall implications of our analysis must be interpreted considering several limitations. First, the rather small sample size did not provide us with enough statistical power to be able to use several blocks of choice tasks, which then also would have allowed us to estimate DCE and BWS models. During the design stage, we aimed for 150 respondents per study arm due to the small overall pool of individuals aged 65 on online platforms. While the choice sets were created according to standard design methodology, it could be the case that either of the two choice sets is more difficult to answer in general, irrespective of choice task format, due to smaller utility difference within the shown profiles. As utility weights for the WOOP are not available yet, it was not possible to account for that in the selection of choice set. This risk could have been reduced if multiple blocks would have been used. A second, related, limitation is that DCE and BWS models could not be estimated, which prevented us from analysing the actual choices people made. Testing for choice consistency or overall noise in the data would have given us an indication on the quality of the responses However, such a comparison between DCE and BWS responses would have come with additional limitations.
In terms of the generalisability of our results, we need to acknowledge the following: Our study was conducted in an online setting, with respondents from an online panel. As certain subpopulations with varying levels of cognitive abilities may self-select into such panels (especially in older ages), the representativity to the general population aged 65 and above may be limited. However, the purpose of our study was to provide an indication of cognitive burden of different methods specifically using respondents from online panels, which by now are the most frequently used sampling formats for these types of analyses . Therefore, our results should only be generalised to similar online settings. Our sample likely was on the upper end of the spectrum of cognitive abilities of people aged 65 and above (highly educated and rather healthy, see Table 1). It is not certain, whether our conclusions would be the same in a sample with average or low levels of cognitive abilities, as we did not measure cognitive abilities directly. However, using years of education as an imperfect proxy for overall cognitive abilities, we could not observe an education, and therefore cognitive ability, gradient in our results (i.e. the direction of our results remained stable, when splitting our sample into a lower and a higher educated group). To increase the representativeness of the sample in a full-scale valuation study among the elderly using online panels, it will be necessary to implement further age stratification by setting appropriate age group quotas.
As for the generalisability towards other online panels, the following limitation applies: Per online platform rule, the recruitment of respondents involved a monetary compensation which is rather high compared to standard online panels, and which can be reduced if the researcher is not satisfied with the quality of responses. While this is a good thing for respondents and their motivation, this led to very low dropout rates and could have also affected other parts of the analysis Another caveat of our analysis is that the applicability of our results to the comparison of DCEs without overlap and colour coding, and BWS is limited. However, the use of level overlap in similar DCEs as strategy to reduce task complexity seems to be increasing (e.g. King et al., 2018;Mulhern et al., 2019). Not really a limitation, but important to note in terms of cognitive burden is the following: In the DCE setup, it was possible and logical to reduce the level descriptions compared to the full level text in the BWS, as the attributes were already included on the left side of the task (Fig. 3). This may also have contributed to DCE tasks being perceived to be easier to handle.

Conclusions
Overall, we found evidence that level overlapped, and colour coded DCE choice tasks are less cognitively burdensome than BWS choice tasks, in a complex health (or, here, well-being) state valuation exercise among older people in an online setting. This has implications for future valuation studies, especially since the complexity of the measures to be valued seems to increase when moving 8 Although it does not become clear from the paper, whether respondents had to answer full sets of choice tasks or only one task per method. from health-related to overall quality of life; see, for instance, the WOOP (Appendix A), the current plans of the E-QALY project (https://scharr.dept.shef.ac.uk/e-qaly/), or another ongoing study developing a quality of life measure for older people (Ratcliffe et al., 2019). Cognitive burden should be an important factor in deciding about which method to choose for valuing such descriptive systems, but at the same time, statistical and theoretical aspects need to be considered as well. Although our results may not be easily generalisable to other topics of study within or outside health care and to other study populations, our analysis may at least serve as a good example of how to assess cognitive burden associated with different types of choice experiments.

Funding
Sebastian Himmler receives financial support from a grant from the European Research Council (ERC) under the European Union's Horizon 2020 Research and Innovation Programme (grant agreement No. 721402). The funding agreement ensured the authors' independence in designing the study, interpreting the data, writing, and publishing the report.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Well-being of Older People (WOOP) instrument
For each section, select the description that is most appropriate for you today.
Physical health. Consider physical conditions or ailments and other physical impairments that affect your daily functioning. Receive support. Everyone needs help or support sometimes. Consider practical or emotional support, for example from your partner, family, friends, neighbours, volunteers or professionals. This concerns being able to count on support when you need it, as well as the quality of the support.
•I'm very satisfied with the support I get, when needed •I'm satisfied with the support I get, when needed •I'm reasonably satisfied with the support I get, when needed •I'm dissatisfied with the support I get, when needed •I'm very dissatisfied with the support I get, when needed Acceptance and resilience. Consider your acceptance of your current circumstances and your ability to adapt to changes to these, whether or not with support of your religion or belief.
• I'm very able to deal with my circumstances and changes to these •I'm able to deal with my circumstances and changes to these •I'm reasonably able to deal with my circumstances and changes to these •I'm not able to deal with my circumstances and changes to these •I'm not at all able to deal with my circumstances and changes to these Feeling useful. Consider meaning something to others, your environment or a good cause.
•I feel very useful •I feel useful •I feel reasonably useful •I do not feel useful •I do not feel at all useful Independence. Consider being able to make your own choices or doing the activities that you find important.
•I feel very independent •I feel independent •I feel reasonably independent •I feel dependent •I feel very dependent Making ends meet. Consider having enough money to meet your daily needs and having no money worries.
•I'm more than able to make ends meet •I'm able to make ends meet •I'm reasonably able to make ends meet •I'm not able to make ends meet •I'm not at all able to make ends meet Living situation. Consider living in a house or neighbourhood you like.