FormalPara Key Points for Decision Makers

The cancer-specific preference-based measure European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Utility-Core 10 dimensions (QLU-C10D) is now a validated instrument for health economic evaluations in the lung cancer population.

Generally, there is a good concordance for the health state utility values derived by the generic measure EQ-5D-3L and the cancer-specific measure EORTC QLU-C10D.

The EORTC QLU-C10D shows improved measurement precision towards the upper and lower end of the scale compared to the generic measure EQ-5D-3L.

When utilising the EORTC Quality of Life Questionnaire (QLQ-C30) for clinical outcomes assessment in lung cancer trials, the QLU-C10D now complements its parent instrument with an algorithm useful for health economic decision making.

1 Introduction

Lung cancer is the second most common malign tumour worldwide [1] and the most common cause of death from cancer [2]. Non-small-cell lung cancer (NSCLC) represents a massive burden, economically [3] as well as clinically [4] for healthcare systems. It accounts for up to 85% of all new lung cancer diagnoses [4]. The majority of patients with NSCLC are already in an advanced disease stage (IIIB or IV) at diagnosis and in desperate need of treatment [4]. With less than 20% of newly diagnosed NSCLC patients alive after 5 years, the overall global 5-year survival rate for patients with NSCLC is very low [5]. Inoperable stage III NSCLC is typically treated with radiotherapy and standard chemotherapeutic agents like cisplatin, while the use of targeted treatments remains under investigation [6]. Patients with activating somatic mutations of the tyrosine kinase domain of epidermal growth factor receptor (EGFR) constitute up to one third of patients with NSCLC [7, 8].

Patients with EGFR mutations tend to have better clinical outcomes when treated with EGFR tyrosine kinase inhibitors (TKI) rather than chemotherapy [9, 10]. EGFR TKIs, including afatinib, have been investigated as therapeutic agents for treating advanced NSCLC [11,12,13]. The EGFR is currently the most established molecular target in NSCLC. Afatinib is an irreversible epidermal growth factor family inhibitor. It improved first-line progression-free survival compared with chemotherapy in two large phase 3 trials in patients with EGFR mutation-positive advanced lung adenocarcinoma, as well as improving overall survival in patients with the EGFR del19 mutation [14,15,16].

Complementing the major challenge of improving survival, the assessment of patient-reported outcomes (PROs) and health-related quality of life (HRQoL) is an important aspect from the clinical [17] and economic perspective [3] in NSCLC patients. PRO data can be used to assess treatment options and be of particular interest if marginal differences are observed in overall survival or toxicity profile of different agents. PRO data are therefore valuable in health economic decision making [18]. Considering the significant burden NSCLC poses on both patients and healthcare systems [3], the evaluation of treatments and the assessment of the patients and societal preferences remains an essential task [19]. The assessment of societal preferences in allocating healthcare resources can be supported by assessing PROs using preference-based measures (PBMs) that allow the estimation of quality-adjusted life years (QALYs) for use in cost-utility analysis (CUA) [20], which is widely used by Health Technology Assessment (HTA) agencies in most industrialised countries.

PBMs are based on a health state classification system in conjunction with a value set consisting of utility decrements used to determine health state utility values [21, 22]. These health state utility values generally range from 0 (dead) to 1 (full or perfect health) [23]. Several generic PBMs such as the EQ-5D-3L [24] and others [21, 25] are used to facilitate health economic evaluations in various disease settings. Generic PBMs assess general and universally applicable health domains and are therefore used to estimate utility values across a variety of medical conditions. The comparability of results across disease groups and patient populations [26] currently makes generic PBMs the primary instruments to facilitate CUA assessments [27]. Complementing the generic PBMs, disease-specific PBMs have been developed to assess health state utility values in specific patient groups [26, 28,29,30]. Disease-specific PBMs are conceptualised to capture the most relevant aspects of health in certain disease and patient populations. Therefore, health state changes relevant to a certain population (e.g. changes in nausea, fatigue, or appetite loss for cancer patients) [22] can be accounted for when deriving health state utility values and performing CUAs [31].

For the estimation of health utility values in the cancer patient population, the European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Utility-Core 10 dimensions (QLU-C10D) was previously developed. For its development, the structure and content of the widely used HRQoL questionnaire EORTC Quality of Life Questionnaire (QLQ-C30) was utilised to identify the most relevant HRQoL domains to cancer patients. The EORTC QLU-C10D is therefore designed to provide a scoring algorithm for EORTC QLQ-C30, allowing the calculation of utility values from the respective PRO data [22].

In view of the different PBMs and their dis/advantages, there is an ongoing discussion regarding which PBM to use in a certain scenario (e.g. palliative care, elderly people) and condition (e.g. cancer, chronic disease) [28, 32,33,34]. Various aspects of generic versus disease-specific PBMs are relevant to this discussion; one of these is psychometric criteria. Validity aspects, such as criterion validity and construct validity, are considered important psychometric criteria for health status measures [35] Therefore, the aim of this study was to assess the validity of a cancer-specific PBM, the EORTC QLU-C10D, in the NSCLC patient population using data from four LUX-Lung trials [36,37,38,39], and to compare it with the validity of the generic PBM, the EQ-5D-3L. In this article, we focus on three quality criteria for questionnaires assessing health status [35], namely floor and ceiling effects, criterion validity, and construct validity; while these concepts are widely used in psychology, they are similarly important when evaluating instruments for use underpinning estimation of QALYs for CUA. Additionally, we investigate the impact of applying different country tariff utility decrements to the PRO data. This provides insight whether different country tariffs affect the validity parameters of the EORTC QLU-C10D. This article contributes to the body of knowledge concerning the use of the EORTC QLU-C10D as a PBM in different cancer populations and treatment contexts.

2 Methods

2.1 The EORTC QLQ-C30

The EORTC QLQ-C30 version 3.0 [40] consists of 30 questions that form 15 scales, five of which are functioning scales (Physical, Role, Emotional, Social, and Cognitive), nine are symptom scales (Fatigue, Nausea and Vomiting, Pain, Dyspnoea, Insomnia, Appetite Loss, Constipation, Diarrhoea, and Financial Difficulties), and one is a global health status/quality of life scale. Responses are provided on a 4-point Likert scale (“not at all”, “a little”, “moderate”, “very much”) for all questions except for the global health status and quality of life, which are rated from 1 “very poor” to 7 “excellent”.

2.2 EORTC QLU-C10D

The EORTC QLU-C10D [22] comprises two components. The first is a health state classification system based on 13 of 30 items of the EORTC QLQ-C30 [22] that form ten dimensions (Role Functioning, Sleep, Appetite, Social Functioning, Bowel Problems, Emotional Functioning, Pain, Fatigue, Physical Functioning, and Nausea) with the same response options as the EORTC QLQ-C30 (“not at all”, “a little”, “moderate”, “very much”). The second is a series of country-specific preference-based scoring algorithms that allows derivation of QLU-C10D health utilities from QLQ-C30 data. These are derived from valuation studies that use a standardised discrete-choice experiment to elicit the health state preferences of the general population of each country [41]. To date, country-specific tariffs have been developed for six European countries and three English-speaking countries in addition to the UK [42,43,44,45,46,47,48,49,50], with further tariffs in development in Europe as well as three Asian countries.

2.3 EQ-5D-3L

The EQ-5D-3L is an established generic PBM [24, 51] that is frequently used in health economic studies. It comprises five items on generic health issues (Mobility, Self-Care, Usual Activities, Pain/Discomfort, Anxiety/Depression) with three response options each (“no problems”, “some problems”, “severe problems/unable to”) as well as a visual analogue scale (VAS) (0–100) assessing self-rated overall health. For its use as a PBM, the EuroQol measurement system typically relies on the descriptive system, not on the VAS. Valuation studies for the EQ-5D-3L rely on time trade-off, VASs, or discrete-choice experiments, in general populations [52], guided by the EuroQol Valuation Technique manual. National tariffs are available for a large range of countries and are available at the EuroQol website [52].

2.4 Data Sources

For this analysis, we used data from four LUX-Lung studies, multicentre randomised controlled trials (RCTs) that assessed the treatment benefit of afatinib in patients with EGFR-mutated adenocarcinoma (LUX-Lung 1 [LL1], LL3, LL5) [36-38] or patients with squamous cell carcinoma (LL8) [39]. Only adult patients with pathologically diagnosed NSCLC stage IIIb or IV were included in these trials. Patients were positive for the EGFR mutation or had squamous cell carcinoma. Most patients were either pre-treated with at least one cycle of chemotherapy and/or a treatment of erlotinib/gefitinib. The intervention arms all included treatments with afatinib, the control arms were either placebo (LUX-Lung 1), cisplatin/pemetrexed (LUX-Lung 3), investigators choice of chemotherapy (LUX-Lung 5), or erlotinib for patients with squamous cell carcinoma (LUX-Lung 8). All trials collected HRQoL data using both the EQ-5D-3L and the EORTC QLQ-C30. The EQ-5D-3L and the EORTC QLQ-C30 were administered at the same time point. Baseline data from patients who completed both EORTC QLU-C10D and EQ-5D-3L measures are included in the current analyses. Further details regarding the original trials can be found elsewhere [36,37,38,39].

2.5 Statistical Analysis

2.5.1 Data Selection

Data from these four Lux-Lung studies were pooled, giving a total sample of 1736 patients. Patients were included in the current analyses if they had complete data for both the EORTC QLU-C10D and the EQ-5D-3L. For the assessment of core validity parameters, the analysis relied on baseline data only. In a subsequent publication, further psychometric criteria of the EORTC QLU-C10D will also be evaluated using the longitudinal data of the LUX-Lung studies. All analysis were done using the software R [53].

2.5.2 Descriptive Statistics

Sociodemographic characteristics of the pooled sample are descriptively presented as absolute frequencies, means and standard deviations (SDs). EORTC QLU-C10D and EQ-5D-3L scores were calculated in accordance with the respective valuation studies. Here, we used the utility decrements of Australia [46, 54], Canada [47, 55], Italy [43, 56], the Netherlands [44, 57], Poland [43, 58], and the United Kingdom [49, 59].

2.5.3 Criterion Validity

Criterion validity is defined as the association of a health status measure with a gold-standard [60]. It was assessed by correlating the utility scores of the EORTC QLU-C10D with those of the EQ-5D-3L using Pearson correlations. The EQ-5D-3L serves here as a comparator measure as it has been considered a standard in performing health economic evaluations [61]. Criterion validity was considered established by a correlation coefficient of at least 0.7 [35].

2.5.4 Construct Validity

Construct validity is defined as the extent to which the results of an instrument are consistent with prespecified hypotheses, such as, but not limited to, the relationships with the results of other instruments [62]. It was evaluated by investigating hypothesised high correlations (convergent validity) and low correlations (divergent validity) of theoretically converging domains (Physical Functioning–Mobility, Role Functioning–Usual Activities, Social Functioning–Usual Activities, Emotional Functioning–Anxiety/Depression, Pain– Pain/Discomfort) or diverging domains (all other domain pairs) of the EORTC QLU-C10D and the EQ-5D-3L using Spearman correlations. The hypothesised theoretically converging and diverging pairs of domains can be found in Table 1. At least 75% of all the hypothesised directions of correlations should be observable in order to establish a good construct validity [35]. The standard classification of weak (r = 0.30–0.49), moderate (r = 0.50–0.69), and strong (r ≥ 0.70) [63] was used to categorise the correlations between scores. Furthermore, we plotted Bland-Altman plots that graphically illustrate the scattering of scores across the measurement continuum, as well as the level of agreements (defined as 1.96 × the SD of the mean scores) of the two measures.

Table 1 QLU-C10D and EQ-5D domain and health state description analogies

2.5.5 Floor and Ceiling Effects

Floor and ceiling effects were estimated as the frequencies of the highest/lowest possible score, once for the utilities and once for each of the domains of the EORTC QLU-C10D and the EQ-5D-3L. Floor and ceiling effects are present if > 15 % of patients achieve the lowest/highest possible score [35]. The presence of floor or ceiling effects would indicate a reduced reliability of the measures towards the lower/upper end of the scale [35]. When performing valuation studies, floor and ceiling effects potentially limit the correlation and agreement of scores between the two measures, as the measurement sensitivity in the lower/upper ranges of the scale may be limited for one instrument while the other instrument has sensitivity to health state differences at those extremes.

3 Results

3.1 Sample Characteristics

A total of 1736 patients were included in this analysis (see Table 2). The different Lux-Lung studies contributed 522 (LL1), 291 (LL3), 194 (LL5), and 729 (LL8) patients to the composition of the sample. The median age was 62 years, and 41% of the patients were female. Out of the 1736 patients, 1040 were assigned to the experimental arms of the RCTs, and 696 patients were in the control arm (see Table 2).

Table 2 Data selection and sample characteristics

3.2 Criterion Validity

The correlation of utilities between the two PBMs ranged from 0.649 for the Dutch country tariff to 0.718 for the Polish country tariff. For all except for one (the Netherlands) investigated countries, correlations between utilities exceeded the predefined threshold of r = 0.7, constituting a good criterion validity for most country tariffs when calculating EORTC QLU-C10D utilities.

3.3 Construct Validity

Correlations between pairs of domains were clearly higher in corresponding domains than in non-corresponding domains but were somewhat lower than expected for most corresponding domains. A pattern different from the other countries was seen using the Dutch weights where the expected high correlation between Emotional Functioning and Anxiety/Depression was low (r = 0.147) and the correlation between Pain and Pain/Discomfort was lower than expected and lower than in other countries (r = 0.472). For details, see Table 3 and supplementary tables s1–S6 in the electronic supplementary material (ESM).

Table 3 Correlations between the EORTC QLU-C10D and EQ-5D-3L utilities and domain scores at baseline

Assessment of overall agreement between utilities depicted in Bland-Altman plots showed a similar pattern across countries with regards to a certain proportional bias (i.e. score differences are different across the measurement continuum). For all countries, with the exception of the Netherlands, the EORTC QLU-C10D resulted in lower mean utility scores than the EQ-5D-3L (Fig. 1). The mean differences (= blue bias line in the plots) lie between −0.08 (95% confidence interval [CI] − 0.090 to − 0.074) (Canadian tariff) and 0.03 (95% CI 0.0243–0.0417) (Dutch tariff), with SDs between 0.17 and 0.18 (see Table 4). Scrutinising the patterns in the Bland-Altman plots (Figure 1) suggests that high ranging health states show less variability around the bias line than low ranging health states. Hereby, the discrepancy between the utility values varies over the measurement continuum. The differences tend to become larger as the average of the two PBM scores decreases, and sporadic outliers outside the upper and lower level of agreement appear at the lower end of the health state average. Histograms displaying the distribution of mean difference of utility values are provided in the supplementary Figure S1 in the ESM.

Fig. 1
figure 1

Bland-Altman plots displaying the level of agreement between the EORTC QLU-C10D and the EQ-5D-3L for different country tariffs. The blue line indicates the mean difference of EORTC QLU-C10D and EQ-5D-3L utility scores. Red lines indicate the level of agreement of the utility scores of the two instruments. AUS Australia, CAN Canada, EORTC European Organisation for Research and Treatment of Cancer, IT Italy, NL the Netherlands, PL Poland, QLU-C10D Quality of Life Utility-Core 10 dimensions, UK United Kingdom

Table 4 Agreement between the EORTC QLU-C10D and the EQ-5D-3L

3.4 Floor and Ceiling Effects

For the EORTC QLU-C10D utilities, between 2.94% and 4.67% reached the highest possible utility value, whereby the Dutch tariff showed the highest proportion of responders (4.67%). Larger ceiling effects were observed for the EQ-5D-3L utilities, which were 22.93% across all country tariffs. For both the EORTC QLU-C10D and the EQ-5D-3L, no floor effect was observed, as the proportion of responders reaching the lowest possible utility value remained below 1 percentage point across all country tariffs (Table 5).

Table 5 Ceiling and floor effects of the EORTC QLU-C10D and the EQ-5D-3L utilities

Ceiling and floor effects for specific domains were scrutinised using the EORTC QLU-C10D and the EQ-5D-3L raw data (e.g. without a country tariff applied to the data). The highest ceiling effect in a single domain of the EORTC QLU-C10D raw data was 75.98%, for Nausea. Ceiling effects for the other symptom domains ranged from 25.0% for Fatigue to 55.8% for Bowel Problems. For the functioning domains, the domain with the highest ceiling effect was Emotional Functioning, with 54.6%, with Physical Functioning (25.6%) showing the lowest ceiling effect. Floor effects were not present for the EORTC QLU-C10D, while Physical Functioning (10.0%) showed the highest proportion of responders reaching the lowest possible score. All the other domains remained below 10 percentage points. Note that high response levels across all domains indicate a high level of functioning and a low level of symptom burden, respectively, in this patient population [22].

For the EQ-5D-3L, ceiling effects for Self-Care reached the highest value of all single domains, with 85.8%. For Mobility, the EQ-5D-3L domain which theoretically corresponds to the EORTC QLU-C10D Physical Functioning, the ceiling effect was 61.12%. For the other domains, ceiling effects were 61.1% for Usual Activities, 58.7% for Anxiety/Depression, and 38.3% for Pain/Discomfort, respectively. No floor effect was found for the EQ-5D-3L, whereby Pain/Discomfort (5.5%) showed the highest proportion of responders reaching the lowest possible value. All other domains remained below 5 percentage points. For further details see Fig. 2. Floor and ceiling effects on the domain level correspond to the lowest and highest item scores for single-item domains and to the lowest and highest domain scores for multi-item domains (EORTC QLU-C10D: Physical Functioning, Social Functioning, and Bowel Problems), respectively.

Fig. 2
figure 2

Relative frequencies per response level for the EORTC QLU-C10D and the EQ-5D-3L. AD Anxiety/Depression, AP Appetite, BO Bowel Problems, EF Emotional Functioning, EORTC European Organisation for Research and Treatment of Cancer, FA Fatigue, MO Mobility, NA Nausea, PA Pain, PD Pain/Discomfort, PF Physical Functioning, QLU-C10D Quality of Life Utility-Core 10 dimensions, RF Role Functioning, SC Self-Care, SF Social Functioning, SL Sleep, UA Usual Activities

4 Discussion

The EORTC QLU-C10D is a recently developed preference-based scoring algorithm for the widely used EORTC QLQ-C30, designed to derive utilities that could inform health economic evaluations in cancer patient populations [22, 40]. In this article, we aimed to report on three psychometric properties, criterion validity, construct validity, and ceiling and floor effects in a lung cancer population; these concepts are widely used in psychology, but are similarly important when determining the appropriateness of instruments to support HTA. Here, the widely used EQ-5D-3L [24, 51] served as a comparator measure.

Overall, our findings further support that the EORTC QLU-C10D has good criterion and construct validity, confirming previous findings [64,65,66,67,68]. Hereby, hypothetically corresponding and diverging domains were defined a priori and correlations of the utilities and the domain scores between the two measures were estimated. Correlations of the utilities were above 0.7 for all but one utility decrement set. The threshold of 0.7 for the correlation coefficient is suggested to establish the criterion validity [35]. For the hypothetically corresponding domains, higher correlations were observed than for the diverging domains. The highest correlation was found for the domain pair Pain–Pain/Discomfort (r = 0.679). All other hypothetically corresponding domains had a higher correlation than 0.5 except for the Social Functioning–Usual Activities domain pair (max. r = 0.376 across all value sets) and for Emotional Functioning–Anxiety/Depression as well as Pain–Pain/Discomfort when using the Dutch utility decrements. All the hypothetically diverging domain pairs reached a correlation coefficient of maximum 0.391.

When scrutinising the Bland-Altman plots, i.e. the agreement between the measures, the systematically lower utility scores (except for the Netherlands) of the EORTC QLU-C10D are noticeable. This is in line with findings reported for utilities of patients with neuroendocrine tumours [69], for patients undergoing esophagectomy [64], and for the QALYs of patients undergoing laryngectomy [70]. In our study, the mean difference between QLU-C10D utilities and EQ-5D-3L utilities ranged from 0.033 for the Netherlands to −0.082 for Canada. As a crude measure to interpret the difference, the minimal clinically important difference (MCID) of the EQ-5D-3L in cancer patients served as a reference. The MCID lies between 0.07 and 0.08 for lung cancer patients using the United Kingdom value sets [71]. Thus, only for one country (namely Canada) the mean difference of utility scores exceeded the MCID. This indicates that only for the Canadian tariff there is a mean measurement difference that goes beyond a threshold that is considered “minimally important”. Additionally, to the mean score difference, the difference between the utilities varied across the measurement continuum of the two PBMs. At the low and mid range of the scales, the difference is larger (both positive and negative differences), whereby these differences appear to become smaller at the upper end of the scale (for patients in good health states). This suggests that a proportional bias may be present when comparing EORTC QLU-C10D and EQ-5D-3L scores, as found for patients with gastric cancer [67]. Lastly, the minimal floor and ceiling effects (highest ceiling effect was 4.67% using the Dutch utility decrements) contribute to the claim that the EORTC QLU-C10D has good measurement properties, also in the upper and lower continuum of the questionnaire.

A few findings are worth mentioning further. Firstly, while the convergent validity of hypothetically corresponding domain pairs was not as strong as hypothesised, correlations were mostly higher for convergent domains compared to divergent domains. Secondly, several convergent correlations were smaller than expected. Specifically, we were surprised that the correlation between the EQ-5D Mobility domain and QLU-C10D Physical Functioning domain was moderate rather than high, given that the latter is essentially about mobility (ability to take long and short walks). Perhaps this is because the QLU-C10D taps the higher end of this construct (highest level is “no trouble taking a long walk” versus the EQ-5D’s “no problems in walking about”), while the EQ-5D taps the lower end of this construct (lowest level is “I am confined to bed” versus the QLU-C10D’s “quite a bit or very much trouble taking a short walk outside the house”). Also, the correlations between the Social Functioning and Usual Activities domains across all countries were weak rather than the moderate we hypothesised. This might partly be explained by the varying concepts assessed by these domains; where the EORTC QLU-C10D inquiries about “interference with social or family life”, the EQ-5D-3L refers to the “performance of usual activities”, which are distinct concepts in the international classification of functioning [72, 73]. Furthermore, data were collected in a hospital-based setting, which might influence the patients understanding of “normal” (social) activities, especially considering response shift [74]. Additionally, a weak correlation between the Emotional Functioning and Anxiety/Depression domains was observed when using the Italian and Dutch utility weights. Partly, this might again be explained by the differing concepts investigated, where the EORTC QLU-C10D inquiries about depression alone, while the EQ-5D-3L assesses anxiety as well as depression. In addition, a high influence of the country-specific scoring algorithm must be surmised as the weak correlation is only present for the Italian and Dutch country tariffs. Similar findings were reported for the Italian tariffs in patients with myelodysplastic syndrome [65] and for Dutch cancer patients [75].

It is worth noting that the correlation coefficients and the agreement between the measures vary somewhat when using different country-specific utility decrements. For example, only for the Dutch tariffs does the EORTC QLU-C10D produce higher utility values compared to the EQ-5D-3L, as previously found in the Dutch cancer population. Considering the consistent methodology and the concurrent process when developing the various EORTC QLU-C10D country-specific value sets (valuation studies) [42,43,44,45,46,47,48,49,50], it becomes apparent that the factors “methodology” and “timelines” do not influence the variability of utility decrements. Subsequently, cultural preferences for certain health states or semantic differences in the translated versions of the questionnaires are possible influencing factors in the evaluation of the preference-based health states [76, 77]. Therefore, the observed differences across the various value sets are likely to be genuine.

Taking this into consideration, a further discussion can be raised when performing CUA. Hereby, the current state of the art is to apply the value sets from the decision makers country on obtained data, even when the data are collected in multicentric trials or entirely from another country [78]. Still our discussion highlights that values sets and data from one-and-the-same country might be the best match taking cultural and linguistic aspects into consideration. Thus, we would like to argue that proper justification has to be provided when applying weights from one country on data obtained from another country.

The main limitation of our study is its retrospective nature, i.e. it uses data from studies that were not designed to assess the validity of the QLU-C10D. Although we drew on robust data from international trials, the LUX-Lung studies mainly included patients with advanced (Union internationale contre le cancer (UICC) stage III and IV) lung cancer. Thus, our finding relates to the validity of the EORTC QLU-C10D for this patient population and may not generalise to other cancer types and early-stage lung cancer. Also, the validation relied on the comparison to the EQ-5D-3L only, whereby alternative generic PBMs, such as the EQ-5D-5L [79], the SF-6D [21], or the Health Utility Index [80] were not available. The 5-level version of the EQ-5D is known to have improved psychometric properties compared to the 3-level version (such as lowered ceiling effects) in general populations [81] and in cancer patients [82] and is now among the most widely used PBMs worldwide [83]. Thus, having the 5-level version of the instrument as a comparator is highly recommended in future studies [68]. This will bring light as to what drives differences between the instruments, the measurement range only or also domain content.

We acknowledge that this paper addresses a narrow range of validity types but note that these are three core measurement properties of health state measures [35]. We maintained this focus in order to also investigate the impact of applying different country tariffs to this set of validity types. Assessing clinical validity, i.e. sensitivity to clinical differences between groups, and predictive validity (responsiveness) for health state changes was beyond the scope of this paper. The clinical validity, such as sensitivity to known groups and predictive validity for change, of the QLU-C10D has been demonstrated previously in a number of contexts and patient population [64,65,66,67,68], but not yet lung cancer patients. In a subsequent paper, we will utilise the LUX-Lung data to assess these important aspects of validity in lung cancer. Furthermore, algorithms to map EORTC QLU-C10D scores to the EuroQol measurement system would be useful so should be developed in future studies.

5 Conclusion

The EORTC QLQ-C30, which the EORTC QLU-C10 relies on, was originally developed and validated in a lung cancer population [40]. This fact underlines the importance of the analysis in the lung cancer population, whereby the hypothesised good criterion and construct validity of the EORTC QLU-C10D in the lung cancer population could be demonstrated. Even though the presented results showed significant proof for strong criterion and construct validity, further investigations must be undertaken to evaluate the clinical validity of the EORTC QLU-C10D in lung cancer and other cancer patient populations. Furthermore, the development of cross-walks or score mappings between the EORTC QLU-C10D and other PBMs [68] is necessary to further aid the comparability of the scores in health economic evaluations.