Item response theory-based measure of global disability in multiple sclerosis derived from the Performance Scales and related items

Background The eight Performance Scales and three assimilated scales (PS) used in North American Research Committee on Multiple Sclerosis (NARCOMS) registry surveys cover a broad range of neurologic domains commonly affected by multiple sclerosis (mobility, hand function, vision, fatigue, cognition, bladder/bowel, sensory, spasticity, pain, depression, and tremor/coordination). Each scale consists of a single 6-to-7-point Likert item with response categories ranging from “normal” to “total disability”. Relatively little is known about the performances of the summary index of disability derived from these scales (the Performance Scales Sum or PSS). In this study, we demonstrate the value of a combination of classical and modern methods recently proposed by the Patient-Reported Outcome Measurement Information System (PROMIS) network to evaluate the psychometric properties of the PSS and derive an improved measure of global disability from the PS. Methods The study sample included 7,851adults with MS who completed a NARCOMS intake questionnaire between 2003 and 2011. Factor analysis, bifactor modeling, and item response theory (IRT) analysis were used to evaluate the dimension(s) of disability underlying the PS; calibrate the 11 scales; and generate three alternative summary scores of global disability corresponding to different model assumptions and practical priorities. The construct validity of the three scores was compared by examining the magnitude of their associations with participant’s background characteristics, including unemployment. Results We derived structurally valid measures of global disability from the PS through the proposed methodology that were superior to the PSS. The measure most applicable to clinical practice gives similar weight to physical and mental disability. Overall reliability of the new measure is acceptable for individual comparisons (0.87). Higher scores of global disability were significantly associated with older age at assessment, longer disease duration, male gender, Native-American ethnicity, not receiving disease modifying therapy, unemployment, and higher scores on the Patient Determined Disease Steps (PDDS). Conclusion Promising, interpretable and easily-obtainable IRT scores of global disability were generated from the PS by using a sequence of traditional and modern psychometric methods based on PROMIS recommendations. Our analyses shed new light on the construct of global disability in MS. Electronic supplementary material The online version of this article (doi:10.1186/s12883-014-0192-1) contains supplementary material, which is available to authorized users.


Background
There is an acute need for a reliable and valid quantitative outcome measure of "global disability" in multiple sclerosis (MS) from the patient's perspective. The North American Research Committee on Multiple Sclerosis (NARCOMS) registry, a volunteer registry that represents approximately 10% of the U.S. MS population affords a unique opportunity to develop and validate such a measure. Since 1998, the NARCOMS registry has employed the Performances Scales (PS) to assess perceived disability in adults living with MS [1]. Single-item PS were originally developed for eight domains of function (mobility, hand function, vision, fatigue, cognition, bladder/bowel, sensory, and spasticity) [1]. To increase content validity [2,3], three more measures were added in 2001 to assess disability associated with pain [4], depression [5], and tremor/coordination [6]. Responses are recorded on a 6-point ordinal scale (0 normal, 1 minimal, 2 mild, 3 moderate, 4 severe, and 5 total disability) except for the mobility PS which is scored from 0 to 6.
An important unresolved question about the PS is that of whether a single sum score, such as the PSS-8, adequately reflects underlying global disability or whether two or more summary scores are needed to validly capture information on disability domains assessed by the PS. For instance, recent factor analysis of 7 of the original 8 PS (vision scale excluded, PSS-7) suggested that a better representation of a patient's disability might be obtained with two separate scores─one combining the mobility, spasticity and bladder/bowel PS, and the other combining the hand function, fatigue, sensory, and cognition PS [11].
We and others have described how categorical factor analysis and bifactor analysis could help uncover the fine structure of disability in MS [13,14]. In this article, we apply these methods and related techniques put forward by the Patient-Reported Outcome Measurement Information System (PROMIS) network [15] to the evaluation of the measurement structure of the 11 PS in a large cross-sectional sample of NARCOMS registrants. We explain how results from these analyses informed the Item Response Theory (IRT) calibration of the PS on a single scale of self-assessed global disability. Finally we compare the construct validity of three summary scores of global disability derived using assumptions and calculation methods of varying practicality and accuracy. To appeal to a wide readership, methodological details are presented in Additional file 2.

Study sample
Study data included NARCOMS recruitment surveys collected in 2003-2011. Analyses were restricted to participants who completed the pain, depression and tremor PS, which were not consistently included in each intake survey, and to patients who indicated whether or not they had a confirmed diagnosis of MS. Disability items consisted of the 11 PS and the Patient Determined Disease Steps (PDDS)─a patient-assessed single-item measure of perceived disability that correlates as high as 0.78 with the Expanded Disability Status Scale (EDSS) [16]. Other variables available included calendar year of survey completion, gender, race/ethnicity, age at first symptoms, disease duration, employment status, whether the respondent was on disease modifying therapy (DMT) at enrollment, and year of MS diagnosis.
The total sample was randomly split into a development sample (exploratory analyses) and a validation sample (main analyses).
The NARCOMS Registry is approved by the Institutional Review Board of the University of Alabama at Birmingham.

Preliminary analyses
After performing traditional descriptive statistics for the 11 PS, we conducted exploratory factor analysis (EFA) in the development sample to obtain initial information as to whether the PS should be aggregated into one or more than one disability measures (Additional file 2) [17,18]. Then, based on EFA results and the literature [19], we used confirmatory factor analysis (CFA) to test the fit of the most promising models to the data of the validation sample (Additional file 2).

Item calibration and measurement
Although IRT and CFA models belong to the same family of latent variable models, IRT models provide more detailed information about the functioning of each item. IRT methods also present several advantages for rigorous scale development and score interpretation [20].
We performed iterative IRT analysis to accomplish the following: examine whether respondents reliably distinguished between adjacent PS categories (Additional file 2); calibrate the PS according to the assumptions of two closely-related, and similarly plausible, CFA models; and generate corresponding IRT-based scores of disability (Additional file 2) [21]. To facilitate interpretation, all IRT scores were transformed to have mean 50 and SD 15 so that >99% of scores in the NARCOMS sample would range between 5 and 95.

Construct validity
Construct validity was assessed using known-group comparisons, that is, analysis was performed to compare the means of IRT scores generated in the previous step across the categories of key patient characteristics including PDDS score, age, gender, race/ethnicity, disease duration, and year of assessment. We also assessed the associations of IRT score estimates with unemployment, controlling for other potential predictors of unemployment. We used Mplus 6.1 for general psychometric analyses, IRTPRO 2.1 for IRT calibration and IRT scale score estimation, and Stata 12.1 for the other analyses.

Results
Of the 12,563 persons who filled a NARCOMS intake questionnaire between 2003 and 2011, 7,851registrants with self-reported diagnosis of MS completed all 11 PS. Nearly 80% of participants were women; 93% were white, and 53% completed their intake questionnaire in 2007 or later. Mean age was 46 years (SD, 11.1); mean age at diagnosis 39 years (SD, 10), and mean disease duration 15 years (SD, 11.3). Two-third of respondents were on disease-modifying therapy; 51% were unemployed.
Except if stated otherwise, all the results below were obtained from the validation sample after revision of PS response options as described in Additional file 2 (i.e., after a first round of analysis indicated that response options should be reduced from 7 to 6 for the mobility PS and from 6 to 5 for all the other PS except the fatigue PS). PSS-11 scores, (i.e., traditional raw summed scores) ranged from 0 to 43 out of a revised maximum total of 46.

Preliminary analyses
EFA suggested that one or two factors (i.e., underlying dimensions of disability) might satisfactorily explain covariations among PS (Additional file 2) [22]. As a follow-up, we fitted three CFA models to the data (Additional file 2): (1) a unidimensional model, where the 11 PS represented a single construct of global disability; (2) a two-dimensional model composed of two correlated factors that we loosely referred to as "physical disability" (mobility, hand, bladder/bowel, spasticity and tremor PS) and "mental disability" (cognition, fatigue, pain, sensory, depression, vision PS); and (3) a hybrid bifactor model [18,23], where the variability common to all 11 PS was captured by a general factor of global disability, and residual fractions of PS variability not accounted for by the general factor were captured by two auxiliary factors of "physical" and "mental" disability. In this latter model, a strong general factor and weak auxiliary factors would suggest that the structure of the data is "almost" unidimensional, and therefore that Model 1 might be preferred over Model 2 (we refer the readers to our article [14] for a general discussion of Models 1-3).
The unidimensional model had a mediocre fit, but the PS-factor correlations were all moderate to large (mean, 0.65; range, 0.50-0.77; Figure 1).
The fit of the two-dimensional model was only marginally better than that of the unidimensional model. Individual PS correlated 0.52-to-0.80 with their respective factor (means, 0.69 for the physical disability factor and 0.67 for the mental disability factor). Correlation between the two factors was high (0.83). Misfit was primarily due to the sensory PS substantially contributing to both the physical factor and the mental factor. The bifactor CFA model was specified so that the sensory PS contributed only to the general factor (i.e., to what the physical and mental PS measured in common; Figure 2). The fit of this model was excellent. Correlations between the PS and the factor of global disability were very similar to their counterpart in the unidimensional model. The largest differences in factor-PS correlations were observed for the mobility PS (correlation of 0.55 in the bifactor model vs. 0.66 in the unidimensional model) and the cognition PS (correlation of 0.56 in the bifactor model vs. 0.65 in the unidimensional model) . Both differences matched the accepted standard of ≤0.15 for a small difference [24]. This suggested that the mobility and cognitive PS would be only slightly overrepresented in scores of global disability obtained from the parsimonious, unidimensional, IRT model compared to scores of global disability obtained from the more complex bifactor IRT model. Furthermore, the variance of PS sum scores was decomposed into a large fraction explained by the factor of global disability (79%), a small fraction explained by the two auxiliary factors (11%), and a small fraction of residual error (10%) [25]. Expressed differently, 87.8% of reliable variance in the sum score represented global disability as opposed to domain-specific disability. This result was in line with the finding that only two PS had salient correlations with the factor of residual physical disability (mobility, 0.81 and tremor/coordination, 0.33) and three with the factor of residual mental disability (cognition, 0.56; depression, 0.39; and vision, 0.32). Since the auxiliary factors of a bifactor model are considered to be minor and ill-defined if they include less than three items with item-factor correlations ≥ 0.40-0.50 [26], we concluded that scores of residual physical and mental disability might not be estimated with sufficient accuracy to be of practical importance in less than very large studies.

IRT calibration and measurement
We fitted both a unidimensional-and a bifactor IRT model to the data. The bifactor IRT model replicated the measurement structure of the bifactor CFA model.
The fit of both models was acceptable (Additional file 2), but neither supported the validity of a raw summed score such as the PSS-11 [27]. IRT models indicated in particular that the level of disability corresponding to a given PSS-11 summed score varied as a function of the pattern of responses to PS items. This is illustrated in Figure 3 which describes the relations between IRT scale of global disability from the unidimensional model, raw PSS-11 scores, and standing of PS categories on the IRT scale. The figure, for instance, indicates that a minimum level of fatigue disability contributed less to global disability on the IRT scale than a minimum level of mobility disability. The figure also shows that the distance between two consecutive raw PSS-11 scores varied along the IRT scale continuum. In this situation IRT modeling offered two options. The simplest, but more approximate option was to directly convert raw PSS-11 scores into IRT summed scores by aligning the former on the more linear IRT scale. The more rigorous, but less practical option was to estimate IRT pattern scores that account for the fact that combinations of PS responses corresponding to distinct true levels of disability may yield the same raw summed score [28,29]. For each PSS raw summed score, one IRT summed score would be generated versus several IRT pattern scores. IRT summed scores would maintain the simplicity of PSS-11 scores, but at the cost of some loss of accuracy. Pattern scores would be more accurate, but too cumbersome to be calculated without a computer application.
To examine the trade-offs among the most promising alternatives, we calculated IRT summed scores and IRT pattern scores of global disability from the unidimensional model, and IRT pattern scores of global disability, residual physical disability, and residual mental disability from the bifactor model. Unless stated otherwise, in what follows "summed scores" and "pattern scores" will refer to IRT scores generated from the unidimensional model, and "bifactor scores" to IRT scores generated from the bifactor model.  Figure 1 Unidimensional CFA model of self-assessed neurological disability in NARCOMS registrants. Note: "Disability" represents a latent factor, i.e., a not directly observable continuous variable whose scale is inferred from the variability and correlations among PS. "PS-factor correlations" are estimates of the correlations between PS and factor scores. "Residual variances" represent the fractions of PS score variability that are not explained by the factor.
A graphical comparison of PSS-11 summed score levels to corresponding IRT summed score levels provided further evidence of the shortcomings of the PSS-11 summed scale─low raw PSS-11 scores (0-to-15) were shown to underestimate corresponding IRT summed scores, while high raw PSS-11 scores (21-to-46) overestimated them. (Figure 4). Therefore, in Additional file 3, we provide a conversion table that appropriately translates raw PSS-11 scores into IRT summed scores of global disability [28,29].
In contrast, differences between IRT summed and IRT pattern scores of global disability were generally small (mean, −0.07; SD, 2.1) and so were differences between pattern scores and bifactor scores of global disability (mean, 0.0; SD, 2.3). In both cases differences were near zero in the 20-to-80 IRT-score range. These results suggested that the IRT summed score approximation was unlikely to lead to clinically-relevant measurement bias.
Overall reliability of the IRT summed scores of global disability was 0.87.

Construct validity
We observed positive and statistically significant associations between mean IRT scores of global disability and PDDS scores (P < 0.001; Figure 5A). Increases in IRT scores of global disability were sharper over the lower portion of the PDSS scale (0-to-2) than over its higher portion (3-to-8). Differences between summed-, pattern-, and bifactor score estimates of global disability were generally minimal (standardized differences <0.15 except for PDDS 7). Mean bifactor scores of residual physical disability increased significantly and nearly linearly with increasing PDSS scores (P < 0.001; Figure 5B). In contrast, mean bifactor scores of residual mental disability increased significantly over PDSS scores 0-to-2 and then  Figure 2 Bifactor CFA model of self-assessed neurological disability in NARCOMS registrants. Note: "Global" represents the general factor of global disability; "Physical" and "Mental" represent the auxiliary factors of "physical" and "mental" disability. Correlations among the three factors are all forced to be zero. Thus, the physical and mental factors each explain a fraction of the variability in PS scores left unexplained by the general factor. Comparisons of "Residual variances" in Figure 2 and Figure 1, provide information about the fraction of variability in PS scores that the two auxiliary factors explain above and beyond the general factor. decreased ( Figure 5C). The patterns in Figure 5 are consistent with the higher portion of the PDDS scale being biased toward physical disability. Alternatively, these patterns may also indicate that patients experiencing high levels of "mental" disability were less likely to enroll in NARCOMS.
In bivariable analysis, higher IRT scores of global disability were significantly and consistently associated with longer disease duration, older age at assessment, male gender, Native American ethnicity, not receiving DMT, and in-  (Table 1). Differences according to the type of score estimate (summed, pattern, bifactor) were small compared to the widths of the confidence intervals for the mean estimates. The pattern of associations between personal characteristics and bifactor scores of residual physical disability (respectively residual mental disability) closely paralleled that between personal characteristics and IRT scores of global disability. This latter result reinforced the hypothesis that the 11 PS were all indicators of one broad underlying construct of global disability.  The person-PSS map shown in Figure 6 relates the distributions of IRT summed scores of global disability among employed and unemployed respondents to the estimated standing of raw PSS-11 scores on the IRT scale of global disability. After adjustment for patient characteristics, prevalence of unemployment among respondents independently increased with increasing scores on all three disability measures estimated from the bifactor model (Table 2). Similarly, we also found dose-response relationships between prevalence of unemployment and summed-, and pattern scores of global disability. Pattern scores of global disability were more strongly associated with unemployment than bifactor scores of global disability, presumably because pattern scores of global disability are a form of weighted average of the three bifactor scores of disability (i.e., global-, residual physical-, and residual mental-), and these three scores were all independently associated with unemployment. After controlling for disability scores, prevalence ratio estimates for the personal characteristic variables were remarkably similar across the three regression models, once again suggesting that, despite their imperfections, scores of global disability from the unidimensional model captured most of the disability variance explained by the bifactor model.

Discussion
This study supports the notion that information on self-assessed disability elicited by the PS is adequately  Summed stands for EAP Summed Score; only one summary EAP Summed Score was generated for each raw PSS score. *P < 0.05; **P < 0.01; ***P < 0.001. Note. The scores of global disability, residual physical disability, and residual mental disability are all reported as scaled scores (S) with mean 50 and SD 15, but they are not on the same metric. All the scores of global disability are directly comparable, but not the scores of residual physical disability and residual mental disability. For instance, because the SD of the score of global disability is much larger than that of the scores of physical and mental disability, a score of global disability of S = 70 represents much more disability than a score of residual physical, or mental, disability of S = 70.
captured by a single score of global disability. However, several limitations of the standard PSS-11 score were uncovered in this role in terms of excessive number of response categories for 10 items (Additional file 2), underestimation of global disability in the lower part of the scale and overestimation of global disability in the upper part (Figure 4), residual dependency among items from the physical and mental domains of disability (Figure 2), and equal weighing of items ( Figure 3). In comparison, the IRT summed score derived from the PSS-11 was shown to be a superior summary measure of patient-assessed global disability in MS with construct validity greater than that of the PSS-11 and similar to that of the harder-to-calculate IRT pattern score.
In this study, we developed a table (Additional file 3) to easily obtain a patient's IRT summed score of global disability from their raw PSS-11 score. We showed that this IRT estimate has a reliability of measurement (0.87) that is appropriate for both comparisons between groups and individual-level monitoring. To facilitate intuitive interpretation of where a patient stands relative to the distribution of scores in the NARCOMS reference sample, we transformed raw IRT scores into scaled scores with mean 50 and SD 15 in this sample. Finally, we created a diagram that provides graphical information about how each PS contributes to respondents' perception of their level of overall disability (Figure 3).
We encountered several challenges in data analysis which directed us toward solutions that were less than perfect from a pure measurement perspective. Some experts stress the importance of focusing efforts on welldefined, narrow and strictly unidimensional constructs in order to meet Rasch requirements for fundamental measurement [31,32]. These experts would probably point out that the current PS do not fully cover the breadth of the dimensions of physical disability and, especially, mental disability─they would likely recommend to write and test new PS in order to create two clearlydistinct, and better-defined, unidimensional measures of physical and mental disability. Instead, we adopted the position of those experts who emphasize clinical appropriateness at the cost of small measurement bias [33][34][35][36]: i.e., we relied on what the bifactor model enables and gave priority to incorporating in a single summary measure the domains of disability most commonly affected by MS. Analyses suggested that two underlying dimensions of disability could be identified in the data, but that these dimensions were ill-defined and highly correlated. Little reliable information on what we loosely called "physical" and "mental" disability was left in the PS after having extracted information on global disability (this information was provided by the minority of patients affected to markedly different degrees by physical and mental disability). Scores of residual physical disability and residual mental disability were significantly associated with the same patient characteristics. Furthermore, for both scores of residual disability the patterns of associations with patient characteristics was similar to that observed for scores of global disability. With one exception, only the strength of some associations differed slightly depending on the disability score examined. This exception pertained to high PDDS scores, which were positively correlated with high scores of global disability and high scores of residual physical disability, but not with high scores of mental disability. This finding is likely, at least in part to be a reflection of the bias of the upper portion of the PDDS scale toward mobility disability and physical disability in general.  Bayesian Expected a posteriori (EAP) estimates of disability (i.e., "IRT pattern scores"). e EAP summed-score estimates of disability (i.e., "IRT summed scores").