Standard error of measurement and smallest detectable change of the Sarcopenia Quality of Life (SarQoL) questionnaire: An analysis of subjects from 9 validation studies

Objectives The Sarcopenia Quality of Life (SarQoL) questionnaire, a sarcopenia-specific patient-reported outcome measure, evaluates quality of life with 55 items. It produces 7 domain scores and 1 overall quality of life score, all between 0 and 100 points. This study aims to contribute to the interpretation of the SarQoL scores by calculating the standard error of measurement (SEM) and smallest detectable change (SDC) in a sample of subjects from 9 studies. Methods Subjects from 9 studies (conducted in Belgium, Brazil, Czech Republic, England, Greece, Lithuania, Poland and Spain) were included. The SEM, a measure of the error in the scores that is not due to true changes, was calculated by dividing the standard deviation of the difference between test and retest scores (SDdiff) by √2. The SDC, defined as change beyond measurement error, was calculated by multiplying SDdiff by 1.96. Bland-Altman plots were assessed for the presence of systematic errors. Results A total of 278 sarcopenic subjects, aged 77.67 ± 7.64 years and 61.5% women, were included. The SEM for the overall SarQoL score ranged from 0.18 to 4.20 points for the individual studies, and was 2.65 points when all subjects were analyzed together. The SDC for the overall score ranged from 0.49 to 11.65 points for the individual studies, and was 7.35 points for all subjects. The Bland-Altman plots revealed no systematic errors in the questionnaire. Conclusion This study shows that, for individual subjects, a change in overall quality of life of at least 7.35 points (on a scale from 0 to 100) would have to be observed to confirm that a true change, beyond measurement error, has occurred. It also demonstrated that the SarQoL questionnaire is a precise instrument, with the observed scores within less than 3 points of the theoretical “true score”.


Introduction
Sarcopenia, often described as the age-related loss of muscle mass and strength, and defined by the European Working Group on Sarcopenia in Older People (EWGSOP2) as "a progressive and generalised skeletal muscle disorder that is associated with increased likelihood of adverse outcomes including falls, fractures, physical disability and mortality", has been the subject of increased scientific attention as its prevalence and consequences have become more known [1]. Sarcopenia is confirmed to be present when a patient is diagnosed with low muscle strength and low muscle mass. When low physical performance is also established, that person is diagnosed with severe sarcopenia [1] A systematic review conducted in 2014 which estimated the prevalence of sarcopenia diagnosed with the EWGSOP-algorithm in older community-dwelling adults found a range of 1 to 29% (up to 30% in women), while a recent meta-analysis which included 35 articles and a total of 58404 healthy subjects aged 60 years and older found an overall prevalence of sarcopenia of 10% (95% CI: 8-12%) in men and 10% (95% CI: 8-13%) in women diagnosed with the EWG-SOP, the International Working Group on Sarcopenia (IWGS) or the Asian Working Group for Sarcopenia (AWGS) definitions [2,3]. It should be mentioned that the prevalence of sarcopenia varies greatly depending on the definition used, as demonstrated by Beaudart et al., who applied 6 different diagnostic criteria for sarcopenia to a single cohort of subjects and found a prevalence rate from 4.39% to 32.8% [4].
Projections about the future prevalence of sarcopenia (as diagnosed by the EWGSOP-criteria) in the European Union (EU28) predict a rise from 10.9 million people in 2016 to 18.7 million in 2045 on the low end and from 19.7 million to 32.3 million people on the high end [5]. Sarcopenia is a major public health problem and its impact will continue to grow, which should incite policy makers to act.
The available evidence concerning the impact and association of sarcopenia with several health outcomes has been steadily growing during the last decade. A systematic review and meta-analysis published in 2017 provided a comprehensive summary of what is currently known on the subject. This review included 17 prospective studies in which sarcopenia was diagnosed according to the EWGSOP guidelines. The authors found a higher risk for mortality (OR = 3.596; 95% CI = 2.96-4.37) and functional decline (OR = 3.03; 95% CI = 1.80-5.12) as well as a higher rate of falls and a higher incidence of hospitalization. The evidence on the incidence of fractures and the length of hospital stay was inconclusive [6].
The subject of quality of life in sarcopenia has mostly been examined using generic questionnaires such as the Short-Form 36-Item (SF-36) and the EuroQoL 5-Dimension (EQ-5D) [7]. Recently, a new instrument, the Sarcopenia Quality of Life (SarQoL) questionnaire has become available. It is specifically designed to measure quality of life in sarcopenic, community-dwelling individuals aged 65 years or older and was developed in 2013-2015 by Beaudart et al. [8]. It has, to date, been translated into more than 20 languages [8].
The psychometric properties of the SarQoL questionnaire have been evaluated and published for 6 language-versions: the original questionnaire in French, and the English, Dutch, Polish, Romanian and Greek translations [9][10][11][12][13][14]. These examined the discriminative power, internal consistency, construct validity, test-retest reliability and the presence of floor or ceiling effects. These 6 studies found that the questionnaire can discriminate between sarcopenic and non-sarcopenic participants, with the former having significantly lower scores for the 7 domains and the overall score, and that the questionnaire possesses good internal consistency (Cronbach's alpha of 0.87, 0.88, 0.95, 0.92, 0.88 and 0.96). These studies also confirmed the construct validity of the SarQoL questionnaire with the help of hypotheses on correlations between the questionnaire and the SF-36 and EQ-5D, and demonstrated that the SarQoL questionnaire has an excellent test-retest reliability (intraclass correlation coefficient/ICC = 0.91, 0.95, 0.99, 0.98 and 0.96) [9][10][11][12][13][14]. Lastly, floor and ceiling effects were absent from all 6 published validation studies [9][10][11][12][13][14]. These results provide convincing evidence for the validity and reliability of the SarQoL questionnaire for the evaluation of quality of life in sarcopenic, community-dwelling older people.
However, until now, the standard error of measurement (SEM) and the smallest detectable change (SDC) of the SarQoL questionnaire have not yet been calculated. These parameters supply important information on the reliability of the instrument in question by indicating the range in which the theoretical "true" score lies; and supply context when interpreting data from longitudinal measurements by indicating by how much the score needs to change before one can be reasonably certain that a true change has occurred. Clinicians and researchers could use the values for SEM and SDC as a yardstick in the interpretation of the SarQoL scores, whether obtained in clinical practice or as part of a research project. The results of this study should prove particularly valuable in the interpretation of data from interventional clinical trials, and will hopefully expedite the adoption of this PROM in clinical trials [15].
The primary objective of this study is to determine the SEM and SDC of the SarQoL questionnaire in a sample of subjects from 9 international validation studies. The secondary objectives are to examine the measurement error of the questionnaire with the help of a Bland-Altman analysis, and to update the results previously obtained for the test-retest reliability of the SarQoL questionnaire in the complete sample.

Material and methods
This study combined data from 9 cohorts in 8 different countries that were established to test the psychometric properties of the SarQoL questionnaire after translation into the local language. The team behind the SarQoL questionnaire have made a concerted effort to widen the reach of the questionnaire by having it translated into a multitude of languages. To accomplish this, they have partnered with researchers from a host of countries and language groups, who were able and willing to undertake a translation of the questionnaire. The local teams responsible for the translations were also encouraged to carry out a validation study of the translation they produced, if feasible. A considerable number of them undertook this effort, although not all validations have been published. The researchers from 9 validation studies that had the necessary data for the current analysis were contacted and agreed to share their data. All the included studies obtained approval from their local ethics committees, and written informed consent from their participants.

Population
Subjects were included in the 9 validation studies if they were 60 years of age or older and community-dwelling. For this analysis, we included all subjects who were diagnosed as being sarcopenic, who completed the SarQoL questionnaire twice and reported that their health had been stable in the interval between the two administrations.

The SarQoL questionnaire
The analyses in this article center around the test-retest data for the SarQoL questionnaire collected by the 9 included studies. The SarQoL questionnaire is a patient-reported outcome measure (PROM) designed specifically for use with sarcopenic, community-dwelling subjects 65 years of age or older. The questionnaire consists of 55 items distributed over 22 questions, with the items categorized into 7 domains of health-related quality of life (HRQoL). These domains are: "Physical and Mental Health" (D1), "Locomotion" (D2), "Body Composition" (D3), "Functionality" (D4), "Activities of Daily Living" (D5), "Leisure activities" (D6), and "Fears" (D7). Apart from the domain scores, an Overall score for quality of life is also calculated. All scores are situated on a scale from 0 to 100, with 0 being the worst possible quality of life, and 100 the best possible. The questionnaire is auto-administered and takes about 10 minutes to complete [9]. More information on the SarQoL questionnaire and the different language-specific versions can be found on www.sarqol.org.

Test-retest reliability
The test-retest reliability of a questionnaire quantifies the extent to which a questionnaire produces the same scores during repeated measurements, provided that the participants' health remains stable. It is measured by the intraclass correlation coefficient (ICC) under a 2-way mixed model with absolute agreement specified, and its associated 95% confidence interval. A questionnaire is considered reliable if the obtained ICC values are greater than 0.70 [16].

Standard error of measurement
The standard error of measurement has been defined as "the determination of the amount of variation or spread in the measurement errors for a test" [17]. The SEM is considered to be a parameter for the amount of measurement error present in an instrument, and is subsequently an indicator of the reliability of said instrument. Much like the interpretation of the standard deviation around the mean value, the SEM can be used to provide a range around the observed value within which the theoretical "true" value lies. The interval between plus and minus 1 SEM provides a probability of 68% of containing the true value. For ± 2 SEM the probability becomes 95% and for ± 3 SEM we end up with 99% probability.

Smallest detectable change
The smallest detectable change is defined as the change in the instrument's score beyond measurement error [18]. This means that the SDC provides a value for the minimum change that needs to be observed in order to be confident that the observed change is real and not, potentially, a product of measurement error in the instrument. The SDC can be calculated for individual subjects (SDC ind ) as well as for comparisons of mean scores between groups (SDC group ) [18]. Both provide utility: The SDC ind can be used in clinical practice or to label individual subjects in a study sample as either changed or unchanged. The SDC group provides an aid to the interpretation of mean scores of groups. This can lend greater credibility to the results of interventional trials that use the SarQoL questionnaires, and that want to know whether quality of life has changed in the intervention and control group as a whole.

Bland-Altman analysis
The Bland-Altman plot provides a visual representation of the presence of systematic errors in an instrument. The Bland-Altman plot is based around three variables: the mean systematic difference between test and retest scores ( � d), and the upper and lower limit of agreement, which span 95% of observations, assuming that the values for the difference between test and retest scores are distributed normally [18,19]. These variables are integrated into a scatter plot where the difference between test and retest values is put on the Y-axis and the average of the test and retest values is put on the X-axis.

Statistical analysis
Data were analyzed using IBM SPSS Statistics, version 24.0.0.0 for Windows (Armonk, NY: IBM Corp). The distribution of the variables was determined by examining the histogram, the quantile-quantile-plot, the Shapiro-Wilk test and the difference between mean and median. Variables that are normally distributed are reported as mean ± standard deviation and nonnormal variables as median (25 th percentile-75 th percentile). Nominal variables are reported as absolute (n) and relative frequencies (%).
Differences between groups with regards to clinical characteristics were examined with one-way anova analysis for continuous variables and chi-squared test for nominal variables.
The SEM was calculated by first creating a variable for the difference between the score obtained during the first and the second administration (test score-retest score = Difference). Next, we calculated the standard deviation of Difference in our sample (SD difference ) and divided the obtained value by the square root of 2 (SEM = SDdifference p 2 ) [18,20].
The SDC ind was calculated with the formula [SDC ind = 1.96 � p 2 � SEM], and the SDC group was calculated by dividing the SDC ind by the square root of the number of subjects in the sample ( SDCind p n ) [18]. The ICC was calculated with a 2-way mixed model and absolute agreement specified. The mean difference score ( � d) was calculated by calculating the mean of the differences between test and retest scores for all subjects [Mean(test score-retest score)]. The 95% limits of agreement were calculated with the formula [ � d ± (1.96 � SD difference )] [18,21]. Bland-Altman plots were created in SPSS following the instructions given in IBM tech-note n˚19420 [22] Results were considered significant at p�0.05.

Characteristics of included studies
Information on the diagnosis of sarcopenia and the characteristics of the test-retest administration are given in Table 1.

Clinical characteristics
The 278 participants included in the analysis had a mean age of 77.67 ± 7.64 years, ranging from 60 to 98 years old. The majority of subjects were women, namely 171 participants or 61.5% of the complete sample. The participants had a mean body mass index of 25.57 ± 4.40 kg/m 2 , spanning the whole gambit from underweight to morbidly obese with a minimum value of 17.42 kg/m 2 and a maximum value of 46.10 kg/m 2 . In terms of prescription drug use, the subjects took on average 4.78 ± 2.71 drugs (range: 0-13), linked to the number of comorbidities which was 3.59 ± 2.01 (range: 0-11). Clinical characteristics are reported in Table 2.
As expected, one-way anova analyses and chi-squared test revealed that the 9 studies differed significantly in terms of clinical characteristics. The results from these post-hoc analyses can be found in S1-S5 Tables.
The test-retest reliability of the SarQoL questionnaire in the complete sample resulted in an ICC of 0.969 (95% CI = 0.961-0.975) for the Overall score. Of the individual domains, 4 obtained an ICC higher than 0.9, namely domain 1, 2, 4 and 5, and all obtained ICC's higher than 0.7. The detailed results for the test-retest reliability can be found in Table 3.

Standard error of measurement
The SEM for the Overall score of the SarQoL questionnaire in the complete sample is 2.65 points. This means that one can be 68% confident (± 1 SEM) that the 'true' score of a subject can be found between -2.65 and +2.65 points from the observed score, and 95% confident (± 2 SEM) that the 'true' score is situated between -5.3 and +5.3 points of the observed score. The SEM for the different domains of the SarQoL questionnaire in the complete sample varied between 3.71 for domain 4 and 9.22 points for domain 6. The SEM-values for the complete sample can be found in Table 3, while the SEM-values for the individual included studies are available in Table 4.

Smallest detectable change
The SDC ind for the Overall score of the SarQoL questionnaire in the complete sample is 7.35 points. This means that the Overall quality of life score of an individual would have to change with at least 7.35 points (on a scale of 0 to 100) before the observed change can be considered to be a true change in the quality of life of a subject, and not potentially a result of measurement error. The SDC ind for the 7 domains of the SarQoL questionnaire goes from a minimum value of 10.27 points for domain 4 to a maximum value of 25.56 points for domain 6. The SDC group for the Overall score in the complete sample is 0.44 points. The SDC-values for the complete sample can be found in Table 3. The SDC-values for the individual included studies are available in Table 4.

Bland-Altman analysis
The mean difference score in the complete sample for the Overall score of the SarQoL questionnaire is 0.18 points (95% CI = -0.26; 0.63) which shows that there is no systematic bias between the two administrations of the questionnaire because the confidence interval contains zero. The mean difference scores in the complete sample for the 7 domains are not significant (95% CI contains zero) for domains 2, 3, 4, 5, 6 and 7, once again indicating the absence of systematic bias. One domain in the complete sample does have a small but significant mean difference score, namely domain 1 [0.86 points (0.04; 1.68)], indicating the presence of a very  slight systematic error. The full results of the Bland-Altman analysis are detailed in Table 3. A Bland-Altman plot for the Overall score in the complete sample is provided as Fig 1.

Discussion
In this study, values were obtained for the standard error of measurement and the smallest detectable change of the SarQoL questionnaire in a sample of 278 sarcopenic subjects hailing from 8 different countries and 9 different language-groups. The measurement error inherent to the questionnaire was found to be 2.65 points, and the minimum change needed to be confident that a real change in overall quality of life has occurred for an individual patient was 7.35 points. Systematic bias was further investigated with the method of Bland & Altman, and showed that there is no systematic bias for almost all domains (with domain 1 as the exception) and the overall score of the SarQoL questionnaire.
The SEM for the Overall score of the SarQoL questionnaire of 2.65 points represents 2.65% of the possible range of the Overall score (0-100) and 3.81% of the observed range of the Sar-QoL scores in the complete sample (min = 24.74; max = 94.22; range = 69.48).
This value for the standard error of measurement compares favorably with SEMs for the SF-36, the most frequently used quality of life questionnaire in sarcopenic populations. Hart found a SEM of 4 points for the Physical Component Summary (PCS-range: 0-100 points) and the Mental Component Summary (MCS-range: 0-100 points) of the SF-36 in a population of 68 subjects with a variety of orthopedic impairments [25] and Palmer calculated a SEM of 3.09 points for the PCS and 5.57 points for the MCS in a population of 233 subjects with joint hypermobility [26]. Other studies looked at the SEM for the 8 domains of the SF-36 (all range between 0-100 points), and found SEMs between 8.82 and 34.52 points in 106 women undergoing surgery for breast cancer [27], between 13.2 and 44.7 points in 92 subjects with neck pain [28], between 6.82 and 11.22 points for 628 subjects undergoing foot or ankle surgery [29], and between 11 and 32 points for 515 subjects undergoing orthopedic surgery [30]. While these have been calculated in populations that differ from ours, they show a trend for higher standard errors of measurement compared to the SarQoL questionnaire.
The SDC of the Overall score (7.35 points) of the SarQoL questionnaire is similar to the SDC found for the PCS and MCS of the SF-36. Palmer obtained SDCs of 8.56 points for the PCS and 15.44 points for the MCS, while Hart found SDCs of 9 points both for the PCS and MCS [25,26].
The The detailed breakdown of the SEM and SDC values obtained for the individual studies included in the analysis demonstrates the fact that the SEM and SDC depend on the population in which they are calculated. There is considerable variability between the studies, but not within the studies (i.e. studies with lower or higher SEM and SDC values are so for all the domains and the Overall score, and do not report low values for one domain and high for another). On the lower end are found the studies carried out in Lithuania, Poland and Spain, in the middle those carried out in Belgium (Dutch), Brazil and the Czech Republic and on the higher end those carried out in Greece, England and Belgium (French). We were unable to formulate convincing hypotheses that could begin to explain why certain studies reported lower or higher values for SEM and SDC based on the clinical or study characteristics. It is likely that the observed variation is just the manifestation of the fact that the SEM and SDC are specific to the population in which they have been measured.
The Bland-Altman analysis, detailed in Table 3 and visually represented for the Overall score in Fig 1, shows that a very small systematic bias exists in only one domain. It is unlikely that this systematic bias is clinically relevant because of its small confidence interval and the fact that the lower end of the interval is extremely close to zero (95% CI = 0.04; 1.68). These results mean that clinicians and researchers can have confidence when administering the questionnaire that the results will not be distorted by systematic bias.
The analysis of the test-retest reliability in the complete sample confirmed the results from previous validation studies. The significantly larger sample in the combined analysis means that the confidence intervals found are much narrower than has been obtained previously. These results should inspire confidence that the SarQoL questionnaire is a reliable instrument.

Strengths and limitations
The main strength of this study is the fact that we were able to assemble a relatively large and heterogeneous sample (n = 278) of sarcopenic participants. This has the important advantage that the values calculated for the SEM and SDC are not dependent on a particular population, and could thus be more confidently used as a benchmark in future studies. The studies included in the analysis used different diagnostic criteria and instruments to establish sarcopenia. This is an advantage in this particular situation because the SEM and SDC values found in this study are not specific to a single definition of sarcopenia, but should be valid for different diagnostic criteria for sarcopenia, measured with different instruments. By combining multiple samples that differ with regards to clinical characteristics, we were able to find a middle ground and values for the SEM and SDC that are not highly specific to a single population. The sample size, which would be very difficult to gather in a single study, increased the accuracy of the standard deviation of the difference between test and retest score. Given that this parameter is key in the calculation of the SEM and SDC, the accuracy of these two parameters was enhanced by the large sample size. Because the SarQoL questionnaire has undergone validation in multiple languages, we were able to use test-retest data to calculate the SEM and the SDC, which is the preferred method because it takes into account biological variation, change of mood or concentration and other circumstances [18]. Since the data on which this study was based incorporates these elements and their subsequent influence on the SarQoL score, they have greater credibility than if other methods for calculating the SEM and SDC were to have been used.
There are, however, also limitations to this study. Although the researchers who carried out the individual translation and validation studies received the same guidance on the preferred design and conduct of these studies, local circumstances sometimes led them to deviate with regards to measurement of sarcopenia components (muscle mass, muscle strength and physical performance). Therefore, the methods for establishing the presence of sarcopenia are not standardized. This could, however, also be regarded as an opportunity in that we have a mix of subjects in the combined sample that represent a spectrum of methods and instruments. Secondly, because of the original purpose of the included studies, only the SarQoL questionnaire was administered twice, to calculate the test-retest reliability. It would have been preferable to compare the SEM and SDC of the SarQoL questionnaire to values for the SF-36 and the EQ-5D measured in the same populations. But, since this data does not exist, we feel that a comparison to data from the literature was the second-best option and does provide a valid frame of reference.

Conclusion
The current study, which analyzed a sample of 278 subjects from 9 validation studies, obtained a standard error of measurement of 2.65 points and a smallest detectable change of 7.35 points for the Overall score of the SarQoL questionnaire. These values can be applied in future longitudinal research to evaluate the veracity of measured changes.
Supporting information S1