Evaluation of the Structural Validity of the Work Instability Scale Using the Rasch Model

Objective To use Rasch analysis to examine the measurement properties of the 23-item version of the Work Instability Scale (WIS-23) in a sample of worker compensation claimants with upper extremity disorders. Design Secondary data analysis on the data retrieved from a cross-sectional study. Setting Tertiary care hospital. Participants Patients (N=392) attending a specialty clinic for workers with upper limb injuries at a tertiary hospital were prospectively enrolled. Interventions Not applicable. Main Outcome Measures WIS-23. Results The study sample contained 392 participants between the ages of 19 and 73 years (mean, 47.0±10.5y). There were 148 (37.8%) women, 182 (46.4%) men, and 62 (15.8%) participants for whom sex identification was unavailable. The initial WIS data analysis showed significant misfit from the Rasch model (item-trait interaction: χ2=293.52; P<.0001). Item removal and splitting were performed to improve the model fit, resulting in a 20-item scale that met all assumptions (χ2=160.42; P=.008), including unidimensionality, local independence of items, and the absence of differential item function based on age, sex of respondents, employment type, and affected upper extremity area across all tested factors. Conclusion With the application of Rasch analysis, we refined the WIS-23 to produce a 20-item WIS for work-related upper extremity disorders (WIS-WREUD). The 20-item WIS-WREUD demonstrated excellent item and person fit, unidimensionality, acceptable person separation index, and local independency. The WIS-20 may provide better measurement properties, although longitudinal psychometric evaluations are needed.

The assessment of health-related at-work limitations is essential for clinical researchers to evaluate the effect of occupational injuries. [1][2][3] Previous research suggests that productivity loss at work contributes to a considerable amount of indirect economic costs. 4 Upper extremity injuries are commonly considered as the major source of work disability. 5,6 Some clients have complicated career trajectories that include career changes and job adjustments that may relate to a mismatch between person's functional capabilities and the demands of the job. This is defined as work instability (WI). [7][8][9] WI can be related to a variety of factors, including decreased capability related to aging or disease or increased capability due to treatment effects. Work instability can also occur when the demands of the job change.
The WI Scale (WIS-23) was originally developed as a WI classification system for individuals with rheumatoid arthritis (RA) to support timely and appropriate management, including vocational rehabilitation, psychosocial support, and clinical intervention for job retention. 7,10 It consists of 23 dichotomous items (yes or no) describing specific experiences or situations that provide indications of WI. Examples of individual items include: "I have pain or stiffness all the time at work." The total score is a sum between 0 and 23, with higher scores indicating greater WI. Three levels are used to classify the total score into low WI (<10), indicating low risk of work disability (WD); moderate WI (10)(11)(12)(13)(14)(15)(16)(17), corresponding to medium level of WD; and high WI (>17), indicating those individuals having high risk of WD. 7 Workplace modification would be recommended for individuals (4 of 5) with a moderate level of WI and is critical for those (19 of 20) who have a high level of WI. 7 Previous studies have affirmed acceptable reliability and construct validity under classical test theory in populations with RA and osteoarthritis (OA). 7,11 In addition, a qualitative interview captured key themes of WIS-23 covering job flexibility, good working relationships, and symptom control. 11 Several formats of the WIS have been developed based on different diagnoses and populations, including upper extremity disorders, 12 manual workers, 13 brain injuries, 14 and OA. 10 A large study involving 2092 participants concluded that individuals with work-related injuries had a higher risk of absence from work, mobility-related functional problems, disability, and impaired functioning related to anxiety or depression. 15 Workers who experience musculoskeletal pain across various regions will seek diverse resources of health care, in comparison to those with an explicit health condition. 13 This suggests the need for a universal version of WIS for work-related upper extremity disorders (WIS-WRUED).
Rasch analysis, is an alternative strategy for the evaluation of structural validity. Rasch analysis enables examination of the assumption of unidimensionality and structure of rating scales to convert the ordinal scale of individual items into interval scaling. 16,17 To compute a total score, the response options should demonstrate interval level scaling, also known as the scaling assumption. 17 Where outcome measures are not developed using Rasch modeling, they can be retrospectively evaluated for fit to the Rasch model, which often results in modifications to the questionnaire to obtain fit. Previous studies evaluated the WIS-23 using Rasch analysis and found a significant deviation from the Rasch model fit. 10,12 Therefore, the purpose of this study was to apply Rasch analysis to (1) examine to what extent the rating scale of WIS-23 fits to the Rasch model by inspecting the test fit statistic and ordering of item thresholds; (2) examine the differential item functioning based on age, sex of respondents, employment type, and affected upper extremity area and exploring solutions by altering the rating scale; and (3) test the construct validity of all subscales of the Work Limitations Questionnaire by examining the unidimensionality and local dependency.

Methods
The study dataset was prospectively collected from a specialty clinic for workers with upper limb injuries in a tertiary hospital in Canada. The package of patient-reported outcome measures including WIS-23 was sent to patients shortly before their initial clinic assessment and completed immediately prior to attending clinic or at the initial clinic visit. Research ethics board approval was obtained for the clinical database, and patients provided informed written consent to have data used in research.
Rasch analysis included tests of unidimensionality fit of residual, ordering of item thresholds, person separation index (PSI), differential item functioning (DIF), and local independence of items. The analysis was performed using RUMM 2030 professional suite software. a The significance level was set at 0.05, with Bonferroni correction applied when multiple comparisons were made. To facilitate stable analyses, sample sizes of 250 participants were required. 18,19 Rasch analysis Test of fit The test of fit quantifies to what extent items from the outcome measure meet the expectations of the Rasch model. Fit statistics can be checked at both overall and individual item levels. For the overall fit, the P value from a chi-square test of item-trait interaction should be nonsignificant according to the critical value. 17,20 The itemeperson interaction statistics were then transformed to approximate a z score following a standardized normal distribution. We expected a mean of approximately 0 and a standard deviation of 1 to characterize the normal distribution. 17 At the individual level, a fit residual localized within AE2.5 logits represented an adequate fit for the model. 17

Threshold
The threshold refers to the point between 2 response categories at which either response is equally probable. Disordered thresholds mean that the respondents fail to meaningfully discriminate between response options or that options are potentially confusing. Threshold maps and categorical probability curves were used for visual inspection of this phenomenon. Where needed, we attempted to resolve this problem by collapsing adjacent categories and reversing the order of response options. 17 Targeting Scale-to-sample targeting reflects the extent to which the items can measure the whole range of an individual's ability level. The person item threshold distribution displays the relative difficulty (item locations) and relative ability (person location) on the same ruler of logits. The better the ranges match each other, the greater the potential for precise person measurement. Poor targeting often results in floor or ceiling effects, indicating that patients at the extremes cannot be differentiated from each other nor can change be measured in terms of lower (floor) or higher (ceiling) future scores. 20,21 A scale is considered well targeted if the difference between person and item means would be less than 1 logit unit. 10 DIF DIF occurs when individual groups of patients within the study sample (men vs women), respond differently to an item given the equal level of characteristic being measured. For instance, men and women with equal level of work disability may respond systematically different to an item measuring completeness of job demands. 22 Two types of DIF can be identified. Uniform DIF is where the group shows a consistent systematic difference in their responses to an item. The standard approach of splitting the item for individual groups can be used to address such issue. Nonuniform DIF results from random differences (eg, responses to individual item vary across levels of the ability for subgroups). Currently, there is no solution for nonuniform DIF, except item removal. 23 DIF was examined on age, sex of respondents, employment type, and affected upper extremity area using both critical statistical values using a Bonferroni adjusted P value of .0007 (.05/23 * 3) and visual inspection by item characteristic curve (ICC). 17 The visual inspection was facilitated by plotting the item characteristic curve along with the person trait for given person factors. Under an ideal situation, there is no difference in ICCs for different groups, indicating that participants with identical level of ability have equal probabilities of affirming a given item.

Dimensionality
The basic assumption of the Rasch model, unidimensionality, was checked through the principal component analysis (PCA) under item response theory. After the rescoring of individual response options and any resultant item reduction, the PCA was be revisited as confirmation of the unidimensionality. 24 We set the number of significant t tests at 5% of the total comparisons as the indicator of multidimensionality.

Local independence
Residual pattern refers to the standardized person-item differences between the observed data and expected response generated by the model for every person's response to individual item. The Rasch model requires no residual pattern existing in the data, which is named as local independence. Such residual pattern was examined by PCA. A residual correlation between any 2 items greater than 0.2 above the average correlation would appear to indicate the violation of local independence, as local dependency (LD). 25,26 Potential reasons for the appearance of LD are response dependency and multidimensionality. 27 Item deletion and the creation of testlets to bundle the dependent items was used as a potential solution to address the LD. 25,28 Reliability The PSI indicates the precision of the estimate for each person and was used to evaluate the internal consistency under the Rasch model. The acceptable value was set as 0.7 to establish that the scale is reliable to distinguish between at least 2 groups. [29][30][31] In addition to the PSI, the traditional Cronbach's alpha was provided by the RUMM 2030 program; 0.8 was adopted as the satisfactory value. 32

Study participants
The study sample contained 392 participants with full responses. The age of participants ranged from 19 to 73 years old (mean AE SD, 47.0AE10.5y). There were 148 (37.8%) women, 182 (46.4%) men, and 62 (15.8%) participants missing sex information. Within the study sample, 56.4% of the total subjects engaged in labor work, 16.8% engaged in office work, and 7.9% engaged in a job that included both labor and office work. Due to the nature of Rasch analysis, continuous descriptive variables were transferred to categorical data. The continuous age variable was then recoded into 2 groups according to the median value. Specifically, code 1 represented the group between 19 and 49 years old, and code 2 represented those between 50 and 73 years old. A full summary of participant demographic and clinical characteristics is listed in table 1.

Test of fit
The rating scale model was selected for the current analysis due to the unified dichotomous response options over all 23 times. 27 The initial evaluation of the overall questionnaire with 23 items demonstrated poor overall fit to the Rasch model in the substantial deviation in the standardized item fit residual statistic (mean AE SD, e0.27AE1.9). The significant chi-square test (c 2 Z293.52; dfZ138; P<.001) for item-trait interaction revealed misfit from Rasch model. Table 2 lists the overall summary of fit statistics. To locate problematic items that may cause misfit issues, we checked the individual item fit statistics. Items 1 and 23 were misfitting as the fit residual values were greater than 2.5. Individual item descriptions with Rasch solutions are shown in table 3.

Threshold
The inspection of the threshold map of WIS-23 suggested that no disordered issues were present, as expected given the dichotomous response option. Figure 1 shows the threshold map of the original WIS.
Targeting Figure 2 shows the targeting of the WIS-23. The mean of person logits is equal to 0.24 (<1 logit), indicating that the scale item difficulty matches the abilities of the sample. In addition, the floor and ceiling effects were not detected because only 6% (<10 %) of the participants were on either end of the spectrum.

DIF
The personal factors considered as potential sources of DIF (bias) were sex, age (19-49y and 50-73y), affected sides (left, right, bilateral), injured body area (wrist and hand, elbow and forearm, shoulder and arm, entire upper extremity), and 3 job categories (labor, office, and mixed). Uniform DIF due to sex bias was detected on item 11, as the curve representing expected value of item response for women was higher than the curve for men across most person locations (person ability). Figure 3 shows the uniform-DIF of item 11. Similar uniform DIFs were identified on items 18 and 22 for injured body area due to selection bias. Item 19 was removed due to the nonuniform DIF identified in different job categories. Table 3 lists the individual item description with Rasch solutions.

Dimensionality
The WIS-23 met the assumption of unidimensionality as 4.95% (<5%) of the independent t tests were found to be significant at the 5% level (see table 2 for the overall summary of fit statistics).

Local independence
None of the item-to-item correlations exceeded the cutoff value, and the assumption of local independence was met in the WIS-23 (Supplemental Table S1, available online only at http://www.archives-pmr.org/).

Reliability
The PSI value was equal to 0.85 for the WIS-23. The Cronbach's alpha was calculated as 0.87 without missing data points.  4) A mean value of 0.14 logits (<1 logit) revealed good targeting of the revised WIS. Flooring and ceiling effects were absent in the 20-item version ( fig 5). The revised version was in accordance with the assumptions of dimensionality and local independency. Both PSI and reliability statistics remained the same after deleting 3 items (see table 2). Therefore, we conducted a logit transformation of the 20-item WIS summed scores to provide interval-level scaling. The conversions are presented in table 4.

Discussion
Our Rasch analysis of WIS-23 supports a 20-item version and provides a transformation of interval level scaling in injured workers with upper extremity conditions. This  12 in which they recommended a 17-item WIS specifically for the upper extremity. They excluded items 1 and 23 among others, which were also recommended to be removed by the current Rasch analysis. Our initial analysis indicated that there were no disordered thresholds. This was an expected finding as dichotomous response options are not susceptible to disordered thresholds, as reported in a previous Rasch analysis study. 12 PSI values of >0.70 and >0.85 are considered acceptable for group and individual use, respectively. The PSI value for the 20-item WIS was 0.85, indicating that the WIS can discriminate at both levels and show similar discrimination as the WIS-23 (PSIZ0.83). 12  We had to delete 2 misfitting items, including items 1 ("I can get my job done, I'm just a lot slower") and 23 ("I get good and bad days at work"), as they had fit residual greater than AE2.5. These items also misfit in a previous Rasch analysis, 12 which increases our confidence that there may be problems with these items that warrant their removal. Item 1 is a double-barreled question, as an answer of "no" could mean that the respondent cannot do their job tasks (eg, modified work) or that they can and are not slower (eg, can do normally but have pain). Doublebarreled questions are insufficient in terms of content validity. Thus, our Rasch findings that question this item reinforce the need for careful content validity before items are included in a measure and provide justification for removal of the item. Item 23 was designed to indicate the fluctuation of symptoms during day-to-day work but not necessarily for respondents in the recovery phase. 12 These findings are similar to previous Rasch analyses. Because a single piece of evidence is too preliminary to justify largescale adoption of a revised measure, this level of conceptual and independent verification is needed. We suggest that these combined concerns justify the removal of these 2 items.
Prior studies have suggested removal of items 8, 13, 20, and 22 to improve the Rasch model fit. However, in the current study, we found no significant deviation from the model fit and retained these items. Because work instability is a complex phenomenon and the questionnaire is  already quickly answered due to its yes/no response options, we believe that an overly shortened item may not be sufficiently inclusive of the problems individuals experience. Therefore, where items do function well, we would prefer to retain them. Some differences between Rasch analyses in different studies can be expected due to differences in the health conditions, occupations, and demographics of different study samples. For example, item 22 ("I'm finding any pressure on my hands is a problem") performed well in our sample in which 25.1% of the injured workers had claims related to the hand and wrist, but may not have been as relevant to patients in a previous study that only included patients with shoulder injuries. 12 As we see different Rasch analyses being performed on the same instrument, we have come to appreciate that variances in results are to be expected. Therefore, researchers wanting to optimize measurement in their sample should choose a Rasch solution that is most similar to their population. Permanent changes to measures should be considered where there are consistent findings that items do not perform well.
The third item eliminated from the 23-item version to create our 20-item WIS was item 19 ("I get very stiff at work") since it exhibited nonuniform DIF based on job classification. Usually, nonuniform DIF occurs when the ICC indicates that the participants from different groups with identical ability levels have different probabilities of correctly responding to an item. In the current case, we classified jobs as labor, office work, and mixed. The onset of stiffness could vary between individuals depending on what type of activities they perform. For example, an office worker who works in front of a computer can develop spinal stiffness due to sustained postures, whereas a person with a manual job might have knee and hand stiffness that arises from overuse of muscles and joints. These could be quite different in nature and severity. Because the question is not very specific in terms of the source of stiffness, different jobs could have a different type of response or bias. However, unless this is a stable finding across studies, we are less certain of the recommendation for permanent removal.
Our study identified uniform DIF for item 11 ("I have great difficulty opening some of the doors at work") for men and women. This did not result in removal of the item because we recognize that differences in response patterns could be due to the biological variations in strength, muscle fibers, etc 33 because the percentage of strength required for men and women to open a standard door would be different. Uniform DIF for areas of injury was also observed in items 18 ("I'm getting up earlier because of my shoulder/ elbow problem") and 22 ("I'm finding any pressure on my hands is a problem") in the current study. A similar trend was observed in a previous study by Tang et al, 12 including uniform DIF for items 18 and 22 between rheumatoid  The transformation of the summed score to an intervallevel score in our revised version enables the analysis of within-and between-subject differences such as analysis of variance since interval level scaling is an assumption for mathematical manipulations. Ordinal measures performed by many self-report instruments cannot guarantee that the intervals between different response options are equivalent and, therefore, it would be inappropriate to perform mathematical operations on these scores. We recognize that this is routinely done. However, it violates the assumptions of statistical tests and is a potential source of error.

Study limitations
A limitation of Rasch analysis is that it can be used to identify problematic items but cannot identify why an item does not perform well. We know that many currently used outcome measures do not have publications that clearly articulate the steps taken to ensure content validity of the items. Issues such as lack of clarity of items, doublebarreled questions, and lack of relevance are surprisingly common in currently used outcome measures. Therefore, it is possible that Rasch analysis will indicate the need to delete an item that could be rehabilitated. Some differences exist across different Rasch solutions and only those items that consistently demonstrate measurement problems can be confidently removed 34 to avoid multiple conflicting versions of outcome measures. However, because shortened measures that remove nonfunctioning items can lessen burden and improve measurement, they should be adopted where sufficient evidence exists.

Implications
Further research should include cognitive interviewing and qualitative methods to further examine the content validity and clarity of the WIS items to complement statistical approaches and comparison of different Rasch solutions in the same sample to further examine proposed alternate forms, including complementary statistical analyses such as factor analyses and classic psychometric properties (eg, responsiveness). A comparative approach across samples and versions is needed to provide robust evidence on the "best" version of the WIS. At present, researchers should use the version of the WIS that provides the best measurement properties for their study sample.

Conclusions
In conclusion, we found through Rasch analysis that 3 problematic items of the WIS-23 could be removed to perform a well-functioning interval level scaled WIS-20 in injured workers with upper extremity musculoskeletal conditions. The 20-item WIS-WREUD demonstrated excellent item and person fit, unidimensionality, acceptable person separation index, and local independency.