Scoping review of measures of treatment burden in patients with multimorbidity: advancements and current gaps

Objectives: To identify, assess, and summarize the measures to assess burden of treatment in patients with multimorbidity (BoT-MMs) and their measurement properties. Study Design and Setting: MEDLINE via PubMed was searched from inception until May 2021. Independent reviewers extracted data from studies in which BoT-MMs were developed, validated, or reported as used, including an assessment of their measurement properties (e.g., validity and reliability) using the COnsensus-based Standards for the selection of health Measurement INstruments. Results: Eight BoT-MMs were identiﬁed across 72 studies. Most studies were performed in English (68%), in high-income countries (90%), without noting urban-rural settings (90%). No BoT-MMs had both sufﬁcient content validity and internal consistency; some measurement properties were either insufﬁcient or uncertain (e


Introduction
The prevalence of multimorbidity, defined as the occurrence of two or more chronic conditions, is rising throughout the globe, including low-and middle-income countries (LMICs) [1,2].Patients with multimorbidity face multidimensional and often disruptive commitments to enact control over their multiple conditions.Each task loads additional time, resources, and effort, thus raising the burden attributed to their treatment [3].A high burden of treatment (BoT) can lead to lower medication adherence and poorer health outcomes, in addition to feelings of isolation, loss of independence, stigma, and adverse physical effects [4].
BoT is the workload experienced by patients due to the complexity of managing their medical care and its impacts on their functioning and well-being [5].Its unique applicability relies on its potential to act as a patient-reported outcome of multimorbidity [6] and as a quality indicator of healthcare delivery that considers the patient's perspective, values, needs, and preferences [7].This provides the basis for discriminating between low-and high-burdened patients [8], those who will experience the negative outcomes of being overwhelmed, and for measuring the impact of interventions on patients with a high burden.
Multiple measures, in the form of standardized questionnaires, were developed to assess BoT in patients with multimorbidity (BoT-MM) [9e11]; however, the evidence needed to inform the implementation of available BoT-MMs has been insufficiently examined.This evidence includes the characteristics of existing BoT-MMs and their measurement properties, which express the ability of BoT-MMs to truly measure treatment burden (i.e., validity), to produce consistent results (i.e., reliability), and to capture changes over time (i.e., responsiveness).It also includes reviewing the context in which these BoT-MMs were previously developed, which is useful when planning studies on different contexts.For example, when a BoT-MM validated in a well-educated population from a highincome country is intended to be used in a low-income country setting with different education and access to healthcare services.Other reviews have focused on qualitative data [12], on multimorbidity quality of life [13], or on broad treatment burden, but not on patients with multimorbidity [14].
Thus, this review aimed to identify and assess available BoT-MMs, considering their measurement strengths and misuses.Our study has three objectives: 1) to describe the characteristics of the studies and populations in which available BoT-MMs were a) developed, reviewed, or adapted (e.g., in validation studies) or b) reported as used (i.e., in applicative studies); 2) to describe characteristics and measurement properties of available BoT-MMs, assessing the evidence from validation studies using the COnsensusbased Standards for the selection of health Measurement INstruments (COSMIN) [15,16]; and 3) to describe BoT estimates and other measurement products of BoT-MMs from applicative studies.

Methods
A scoping review was conducted under the methodological framework proposed by Arksey H. and O'Malley L [17] and the Preferred Reporting Items for Systematic reviews and Meta-Analyses-Scoping review (ScR) guidelines (see checklist on Table A.1) [18].The study protocol was published elsewhere [19].

Search strategy
First review: MEDLINE via PubMed was searched from inception until May 2021 (see search formula in Table A.2).No language, year, or publication type restrictions were considered.
Post hoc review: PubMed was again searched to retrieve studies that cited any validation study included during the previous review, as we hypothesized that studies using BoT-MMs would cite validation studies.In addition, hand-searching involved screening the citations of included studies.

Study selection
We included studies in which BoT-MMs were a) developed, validated, or adapted (validation studies) or b) reported as used (applicative studies).We defined multimorbidity as having !2 chronic conditions.In addition, we included instruments that were developed in patients with !1 chronic condition but were not diseasespecific and thus could be applicable to individuals with multimorbidity.Studies with disease-specific questionnaires or nonstandard measures of treatment burden, protocols, systematic reviews, and case reports were excluded.
Records from literature searches were uploaded to Rayyan's online software (https://rayyan.ai/).Two reviewers (D.M-Q.and S.P-L.) independently performed the study selection and data extraction; discrepancies were resolved by consensus between reviewers.First, titles and abstracts were screened; then, articles were full-text reviewed to determine their inclusion.Previously, three calibrations were conducted to refine the application of eligibility criteria.

Data extraction
A charting data form in Microsoft Excel was designed by two reviewers (D.M-Q.and S.P-L.) and refined by two professionals with a background in validation studies.Data extracted included the following: (i) Characteristics of studies: general characteristics (title, first author, year of publication), study aim (validation vs. applicative-only studies), language, country (grouped by world regions and by income), place of residence (urban, rural), setting type (community, primary, secondary, and tertiary care, or mixed if ! 1 type setting), data on assistance during BoT-MM administration (self-reported, assisted, or both), and mode of administration (in person, remote [by post, telephone, and online], or both).(ii) BoT-MM's characteristics: language availability (original and translations), domains, constructs measured (treatment burden, a domain of it, or another construct), number of items, scores by item, range of scores (minimum-maximum), standards for interpreting raw scores (e.g., test norms from population-based reference scores), recall time, and reporting of ceiling and floor effects.Measurement properties included content validity, structural validity, internal consistency, reliability, construct validity, known-group validity, and responsiveness, according to COSMIN standards [16].Data was collected for the first version and subsequent versions of BoT-MMs.(iii) BoT estimates and study samples: numerical (total and by subscales) or categorical (%) scores, and study sample characteristics including sample size, age, female sex (%), target conditions, education level, and ethnicity.

Assessment of measurement properties and COSMIN standards
Evidence from validation studies was reviewed independently by two reviewers (D.M-Q.and C.A.A-R.) using the COSMIN methodology [15,16], in which measurement properties and their quality of evidence, that is, confidence that methods of each study ensure the accuracy of measurement properties reported, were summarized and rated according to standard criteria (Tables A. 3 and A.4).A reflective measurement model, as defined by COSMIN, was assumed for all instruments.Briefly, the methodological quality of single studies was evaluated using the COS-MIN Risk of Bias checklist; then, measurement properties per BoT-MM in each single study were collected (single result) and rated (single rating) as sufficient (þ), insufficient (À), inconsistent (6), or indeterminate (?) according to COSMIN standards (Tables A. 5 and A.6). Measurement properties per BoT-MM were qualitatively summarized (overall result) and rated (overall rating), and their quality of evidence was graded as high, moderate, low, or very low using a modified Grading of Recommendations, Assessment, Development, and Evaluations approach (Table A.7).

Data synthesis and analysis
First, the study selection was summarized with a Preferred Reporting Items for Systematic reviews and Meta-Analyses flowchart.The interrater reliability between reviewers was assessed with Cohen's kappa.The characteristics of included studies were described by study aim (validation vs. applicative).A world map and a line chart depicting the number of studies by country and the number of studies per year, respectively, were generated in the program 'R' (ggplot2 library) to describe worldwide research and publication trends over time, respectively.
Then, characteristics and measurement properties per BoT-MMs were summarized.Each measurement property was summarized with overall results, overall ratings (þ, À, 6, or ?), and their quality of evidence (high, moderate, low, very low).
Lastly, summary scores obtained by single applicative studies were organized per BoT-MM.Numerical scores were summarized with mean (standard deviation [SD]) or median (interquartile range or range), as available, and categorical scores were reported according to cutoffs from Table A.8.If only subgroup scores were available, the aggregated mean 6 SD (or median) was estimated.

Results
A total of 72 studies linked to 8 BoT-MMs were included (Fig. 1).Overall, reviewers' agreement during full-text review was 94.9% (Cohen's kappa 0.86; Table A .9).Full data extraction is available in Appendix B.

Characteristics of included studies (n 5 72)
Table 1 summarizes included studies; these were published from 2012 onwards, mostly (73%) between 2018 and 2021, depicting an increasing trend over time (Fig. 2).One-third (33%) were validation studies, and two-thirds (67%) were applicative studies.Most studies were conducted in Europe or North America (75%) using surveys in English language (68%).The predominant location of these studies was high-income countries (90%), with few conducted in middle-income countries (10%) and none in low-income countries.Place of residence was often not reported (90%), while the most common setting was primary care (46%).BoT-MMs were administered in person (43%) or remotely (41%), usually self-administered (67%), but in a third of studies (30%), the administration was assisted.Highly educated samples of white ethnicity predominated.

Measurement properties of BoT-MMs (n 5 7)
These properties are summarized in Table 3 (see the full item by item COSMIN assessment in Appendix C).One BoT-MM (NHATS) was not included in this assessment due to a lack of validation studies.
Four of nine measurement properties (content validity, internal consistency, structural validity, and construct validity) were investigated in all BoT-MMs (n 5 7).Content validity was sufficient in four BoT-MMs (MTBQ, PETS, MRB-QoL, LMQ) and insufficient in three BoT-MMs due to insufficient comprehensibility (TBQ) [9,20], or comprehensiveness (HCTD [34], MULTIPLES) [35], referred to asking patients for their understanding of instructions, items and answers given the wording of instruments or if they feel that all key aspects of BoT are covered by items, or not reported (MRB-QoL) [36].Construct validity was sufficient in most BoT-MMs (except the PETS, which had sufficient ratings in only 4 of 12 subscales), which occurred in cross-sectional studies either when BoT-MMs scores correlated well with expected scores of reference measures (e.g., higher BoT-MMs scores were associated with lower quality of life) [10,20,39] or when known subgroups had significantly different scores (e.g., patients with low literacy had higher BoT-MM scores compared to those with adequate literacy) [11].
The remaining five of nine measurement properties were investigated in five (reliability), three (responsiveness), or less/none (measurement error, criterion validity, and cross-cultural validity/measurement invariance) of seven BoT-MMs.Reliability was sufficient in three BoT-MMs (TBQ, MTBQ, and LMQ) and most (9 of 12) PETS subscales [32], and insufficient in the MULTIPLES [35] due to overall results below standard criteria.Responsiveness was sufficient in one BoT-MMs (MTBQ) [10], unclear in most PETS subscales [30,31], and insufficient in one BoT-MM (HCTD) [34].It was evaluated in longitudinal studies of 6 to 18 months of follow-up by comparing whether changes in treatment burden scores over time (follow-up baseline) were in accordance with expected changes in measures of reference variables, for example, quality of life [10,30,31,34].

BoT estimates and study samples (n 5 58)
Applicative studies showed that summary scores, shown in separate tables per BoT-MM (Table 4 and Tables A.10-A.15), are available mostly for adults with various conditions, including !1 or !2 chronic conditions or single specific conditions [21,26,28,33,48,51,53,55e63].The referred tables also show the range of mean population scores per BoT-MM; for example, mean TBQ scores between studies ranged between 20.9 and 56.2 (scale range: 0e150).There was no equivalence of scores between BoT-MMs.

Discussion
This review summarizes the available evidence of eight BoT-MMs identified across 24 validation studies and 48 only-applicative studies.Our results showed that research was limited in low-resource settings (LMIC, urban-rural disparities), and extant BoT-MMs have common limitations such as several suboptimal or under-investigated measurement properties, insufficient development (absent recall time, presence of floor effects), and unclear rationale for categorizing and interpreting raw scores.We summarize this evidence and identify issues and gaps needing attention for using BoT-MMs in research and practice.
Our study revealed a growing interest in using BoT-MMS.However, most evidence arises from high-income countries, for example, USA and UK, whereas LMIC settings are underrepresented, including Latin America, Africa, Western Europe, and parts of Asia.Noteworthy, BoT measurement is strongly influenced by the surrounding health system and cultural appropriation of patients in each country, an important reflection given that most BoT-MMs were developed in highly educated populations with access to long-term healthcare, contrary to LMIC settings, which have unguaranteed continuity of care [6], major socioeconomic disparities, limited health literacy, and higher  Abbreviations: a, Arabic; e, English; c, Chinese; mc, Mandarin Chinese; n, Norwegian; s, Spanish; sl, Slovenian; TBQ, Treatment Burden Questionnaire; MTBQ, Multimorbidity Treatment Burden Questionnaire; PETS, Patient Experience with Treatment and Self-management; HCTD, Health Care Task Difficulty questionnaire; MULTIPLES, Multimorbidity Illness Perceptions Scale; MRB-QoL, Medication-Related Burden Quality of Life; LMQ, Living with Medicines Questionnaire; BoT-MMS, measures to assess burden of treatment in patients with multimorbidity; CC, chronic condition (s); HF, heart failure; MM, multimorbidity; MS, multiple sclerosis; TB, treatment burden.
a Notation: first initial of language-measure-number of items (e.g., e-TBQ-13: TBQ validated in English, with 13 items).influence of traditional beliefs [64].This gap also affects BoT-MMs' appropriateness for rural dwellers, who typically experience cultural marginalization and poor access to local health systems, among other challenges [65].Consequently, existing and newer BoT-MMs will require further developments to serve these populations meaningfully.
Existing BoT-MMs had some limitations on their development.Recall time was included in only two BoT-MMs.In the absence of recall windows, patients may provide more information, but the likelihood of error is increased too [66].Recall times (e.g., two-to-four weeks or longer lengths) could be incorporated, ensuring they are not adding noise to the measures.Floor effects were frequently reported among BoT-MMs [9e11, 20,22,26e28,35,36,39], which means that substantial proportions of individuals obtain minimum scores; thus, the true extent of their BoT cannot be accurately determined [67].BoT-MMs developers can reduce the floor effect in future instruments by applying tools from item response theory [68].BoT-MMs were often administered by a third person, but these responses should be self-reported [69]; assisted interviews might skew potential responses but reflect a need given the potential fragility, literacy level, or involvement of caregivers among some multimorbid patients [14].Both selfreported and assisted approaches could be compared to ensure accurate findings [70].
Measurement properties of extant BoT-MMs were insufficiently investigated, as assessed by COSMIN standards, and this is essential as the information collected using tools without optimal measurement properties can be deemed of uncertain value.Unclear measurement properties included those infrequently assessed (e.g., responsiveness) and those assessed but either not meeting optimal standards (e.g., content validity), being inconsistent between studies (e.g., different factor structure or internal consistency), or having incomplete data to be rated (e.g., incomplete reporting on factor analysis).This has practical implications for users, for example, that most BoT-MMs should not be used for evaluating interventions (e.g., pre-post designs), that two BoT-MMs applied to the same population could reach different conclusions, or that same BoT-MMs applied to different populations could not be measuring same outcomes (or doing it but unequally well).All these issues could be addressed in future studies that include methodology and reporting as recommended by COSMIN standards [15,16].
For measurement properties, both content validity and internal consistency are essential and support provisional recommendation of an instrument according to COSMIN [15,16], but both were optimal in neither of the BoT-MMs.For existing BoT-MMs, additional content validity studies could be conducted to improve this property, which involves assessing the comprehensiveness and comprehensibility of items by patients with multimorbidity using cognitive interviews or equivalent qualitative testing [15].
Responsiveness is a poorly investigated measurement property that informs if changes seen in within-individual BoT-MMs scores after an intervention correspond to true changes in BoT of patients (smallest detectable changes), which is linked to investigating when these changes are clinically relevant (minimal important changes) [71].Assessment of responsiveness requires longitudinal data since cross-sectional data cannot predict it [72].Thus, optimal BoT-MMs responsiveness, found only in one BoT-MM (MTBQ), ensures accurate evaluations of an intervention effectiveness [73], making the monitoring of BoT change before and after interventions at the population level feasible while tracking patient improvement or deterioration during care.
Data from several applicative studies reported population-based summary scores, which could be used as reference norms [71].However, the interpretability of BoT-MMs scores remains challenging since few data assigned clinical value to patients' scores.For example, a study determined that TBQ scores !59 points could be used to detect patients at high BoT [25], but more of these studies are needed to enhance the interpretability of BoT-MMs.Data also showed that BoT-MMs were frequently applied in populations with single conditions rather than multimorbidity, and, in such cases, the validity of findings is unclear where local BoT-MMs adaptations were not previously conducted.

Recommendations for future use of BoT-MMs
This review will aid diverse stakeholders, researchers, and clinicians interested in using BoT-MMs.Accordingly, we provide two evidence-based recommendations.
First, when deciding which BoT-MM to use, consider evaluating comprehensively their characteristics (Table 2), their measurement properties (Table 3), and the interpretability of their scores (Tables A.8). Also, BoT estimates (Table 4 and Tables A.10-A.15) may help design future studies, meta-analyses, or conduct sample size estimations.
Second, consider following standards, for example, the COSMIN guidelines [16], when preparing future validation studies of BoT-MMs.Prioritize conducting evidence on characteristics of BoT-MMs that are insufficiently developed, such as handling of floor effects and incorporation of recall times, and on measurement properties identified as either insufficient or under-investigated (Table 3).

Strengths and limitations
This review summarizes the evidence of BoT-MMs derived from the critical review of over 70 primary studies.Nonetheless, some limitations are discussed.First, the COSMIN evaluation of content validity is rated and graded based on information available from development and content validity studies, and it also requires subjective judgment from reviewers.Underreporting of methods or findings among those studies may decrease their ratings, although having conducted validity procedures but not reporting them is unexpected [16].Second, although the COSMIN methodology has extensive criteria for conducting a comprehensive assessment of instruments, it lacks specific standards for rating the interpretability of scores; nonetheless, appropriate interpretability aspects are described as suggested by COSMIN [16], for example, describing BoT estimates and available data for interpretation of numerical and categorical scores.Third, the search was limited to one database (MEDLINE).While other databases can be searched to supplement our findings, MED-LINE was deemed to provide comprehensive coverage of biomedical repositories of studies related to treatment burden and BoT-MMs.

Conclusion
This review summarizes the evidence and gaps in the development, validation, and application of eight BoT-MMs, showing that the evidence needed to inform the application of these instruments in patients with multimorbidity remains insufficiently developed, especially in terms of suitability on their development, measurement properties, interpretability of scores, and their appropriateness to use in low-resource settings.

Fig. 1 .
Fig. 1.Study flowchart.*Excluded because the patient-reported outcome measure of treatment burden was i) disease-specific (n 5 5), ii) nonstandardized (n 5 2), or iii) not applicable (n 5 4).(For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Fig. 2 .
Fig. 2. Mapping of research on measures of treatment burden in patients with multimorbidity.The number of studies are shown by (A) country and (B) over time.TBQ, Treatment Burden Questionnaire; MTBQ, Multimorbidity Treatment Burden Questionnaire; PETS, Patient Experience with Treatment and Self-management; HCTD, Health Care Task Difficulty Questionnaire; MULTIPLEs, Multimorbidity Illness Perceptions Scale; NHATS, National Health and Aging Trends Study; MRB-QoL, Medication-Related Burden Quality of Life; LMQ, Living with Medicines Questionnaire.(For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.) -Quispe et al. / Journal of Clinical Epidemiology 159 (2023) 92e105

Table 1 .
Summary of included studies (n 5 72) a Totals may not add 72 due to missing data.

Table 2 .
Overview of measures of treatment burden in patients with multimorbidity (BoT-MMs) (n 5 8)

Table 4 .
Summary scores in studies using the Treatment Burden Questionnaire (TBQ) a Median (quartile 1 -quartile 3).b Of the instrument used.