Calibration of the PROMIS Physical Function Item Bank in Dutch Patients with Rheumatoid Arthritis

Objective To calibrate the Dutch-Flemish version of the PROMIS physical function (PF) item bank in patients with rheumatoid arthritis (RA) and to evaluate cross-cultural measurement equivalence with US general population and RA data. Methods Data were collected from RA patients enrolled in the Dutch DREAM registry. An incomplete longitudinal anchored design was used where patients completed all 121 items of the item bank over the course of three waves of data collection. Item responses were fit to a generalized partial credit model adapted for longitudinal data and the item parameters were examined for differential item functioning (DIF) across country, age, and sex. Results In total, 690 patients participated in the study at time point 1 (T2, N = 489; T3, N = 311). The item bank could be successfully fitted to a generalized partial credit model, with the number of misfitting items falling within acceptable limits. Seven items demonstrated DIF for sex, while 5 items showed DIF for age in the Dutch RA sample. Twenty-five (20%) items were flagged for cross-cultural DIF compared to the US general population. However, the impact of observed DIF on total physical function estimates was negligible. Discussion The results of this study showed that the PROMIS PF item bank adequately fit a unidimensional IRT model which provides support for applications that require invariant estimates of physical function, such as computer adaptive testing and targeted short forms. More studies are needed to further investigate the cross-cultural applicability of the US-based PROMIS calibration and standardized metric.


Introduction
Rheumatoid arthritis (RA) is one of the most prevalent rheumatic diseases, characterized by pain and swelling of the joints which may lead to significant disability. Patient-reported physical function is a core outcome domain in RA research [1,2]. Physical function is typically assessed using standard, fixed-length questionnaires. Although often extensively validated, key limitations of these traditional questionnaires remain their static nature and limited measurement range and measurement precision, frequently leading to ceiling and floor effects and limited sensitivity to change [3][4][5][6][7][8]. Recent studies have suggested that these shortcomings may be overcome by item response theory (IRT) based item banking [9,10]. IRT calibrated item banks can serve as a platform for tailored assessment of patient-reported outcomes, through developing targeted short forms or computerized adaptive tests (CATs). Both methods of assessment ensure that patients respond to questions that are more relevant to their specific level of disability and that only minimal questions need to be answered, while retaining or surpassing the measurement precision of fixedlength instruments.
The Patient-Reported Outcomes Measurement Information System (PROMIS) initiative has developed and calibrated item banks for assessing several important domains of health status, including physical function, across a wide variety of chronic diseases and conditions and the general population in the US [11]. Using data from the general population and several clinical samples in the US, all items in the item banks are calibrated on a common, standardized metric. Potentially, the PROMIS physical function (PF) item bank could also lead to improved assessment of physical function in clinical or comparative studies in RA. Indeed, recent studies have already shown a 20-item PROMIS PF short form to be more precise and more responsive to change than traditional questionnaires in RA [12]. Recently, the PROMIS PF item bank has been translated and culturally adapted for use among Dutch and Flemish populations. Pretesting of the translated items revealed that the items were understood by patients as intended and culturally appropriate for use in Dutch populations with arthritis [13,14]. Before an item bank can be used in a new population, however, it should be demonstrated that data collected from that population can be fit to an appropriate IRT model. If this is the case, a latent metric specific for this population can be created that allows invariant estimates of the item parameters and physical function levels to be obtained (e.g., item parameters that are independent of the physical function level of the respondents used to calibrate the item bank) [15]. As a result, physical function estimates on a common scale may be obtained from any number and combination of items in the item bank and applications such as CATs and targeted short forms become possible. A second question that needs to be addressed is whether the relationship between observed physical function scores and the physical function trait measured by the item bank is equivalent to this relationship for the original population. If this is the case, this would provide evidence that the model parameters can be expressed on a common scale [16]. In case of the PROMIS PF item bank, this would mean that data from the specific population can be scored using the US-based PROMIS calibration and standardized metric, making scores directly comparable between populations.
The aims of the current study were to calibrate the Dutch-Flemish PROMIS PF item bank in a prospective cohort of Dutch patients with RA and to evaluate its measurement equivalence with data from the total PROMIS wave 1 calibration sample in the US and a smaller subset of US RA patients.

Patients
Data for this study were collected within the Dutch Rheumatoid Arthritis Monitoring (DREAM) registry. The DREAM registry is an observational multicenter cohort study that monitors the course of unselected RA patients in the Netherlands. Both patientreported and clinical outcomes are collected and monitored using a web-based data acquisition and storage system. Patient-reported outcomes, including the Health Assessment Questionnaire disability index (HAQ-DI) and the SF-36 health survey, are completed preceding every visit to the outpatient clinic. Between September 2012 and September 2013, all participating patients from three DREAM hospitals were informed about the study and invited to participate upon logging on to their patient portals preceding their visit to the clinic.

Data collection designs
Dutch DREAM data. To optimize data quality and minimize patient burden, an incomplete longitudinal design was used for calibrating the Dutch-Flemish item bank in the Dutch RA patients, in which different subsets of items (booklets) were administered to different patients. The booklets were linked using common items, making it possible to place all items on a single scale [17]. Since previous research has found that the number of common items within booklets improves the stability of IRT models estimated from incomplete calibration designs, [18] the item responses on the HAQ-DI and the SF-36 physical functioning scale (PF-10), the two most widely used measures of physical function in RA, were added to the calibration design. A graphical overview of the calibration design is presented in Figure 1.
Upon consenting to participate, patients were allocated randomly to one of six booklets. Besides the HAQ-DI and PF-10, each booklet contained two sets of approximately 20 of the 121 PROMIS PF items and each of the six sets featured in two booklets in such a way that half of the items in each booklet overlapped with the previous booklet and half with the next. On successive participations, patients were allocated to booklet N+2 (for N = 1,2,3,4) or N24 (for N = 5,6), where N is the booklet that was administered at the preceding participation, so that patients completed the full item bank after three participations. The sample sizes of the six groups were approximately equal so that all items received an approximately equal number of responses.
From historical log data of physical function items in the DREAM registry, it was estimated that the majority of patients would need no more than 10 minutes to complete each booklet of approximately 40 items. An effort was made to balance the relative difficulty of the items in each booklet by ordering the items according to their peak statistical information on the latent IRT metric according to the US PROMIS wave 1 calibration results. As 20% of the PROMIS PF items has a different stem (i.e., 'does your health now limit you…' rather 'than are you able to…') and associated set of response options, each booklet contained a proportional number of these items.
US PROMIS wave 1 data. PROMIS wave 1 data for 14 candidate item pools, including three pools of physical function items, were collected between July 2006 to March 2007 from over 21,000 participants selected from both the US general population and specific clinical populations [19]. The data collection design of the wave 1 data consisted of both so-called 'full bank administrations', where participants were administered two sets of 56 items from only one or two item pools, and 'block administrations' where participants completed 14 blocks of seven items from all item pools. To avoid complicating the calibration design and analyses, we chose to model only the available full bank data from the general population sample and the block data available from the clinical sample of RA patients.
The full bank arm of the data collection design for physical function in the general population consisted of two booklets that were completed by two independent samples of 942 and 995 respondents, respectively. The booklets were complementary in that each PF item featured in only one booklet and together the booklets contained all 121 final items of the PROMIS PF item bank. Besides the PROMIS PF items, respondents completed the HAQ-DI or PF-10 or both. The HAQ-DI and PF-10 data were included in the calibrations in order to obtain a linked structure so that the US item parameters could be placed on a common latent scale, despite the lack of overlapping PROMIS PF items between the two booklets. Additionally, two clinical samples of 273 and 280 RA patients completed a booklet with a selection of seven items from each of the three PF item pools. Twenty-four of these 42 administered items were calibrated in the final US PROMIS PF item bank (13 and 11 items from each booklet, respectively).

Measures
PROMIS physical function (PF) item bank. The PROMIS PF item bank measures self-reported, current capability to carry out activities that require physical actions, ranging from self-care (activities of daily living) to more complex activities that require a combination of skills, often within a social context. The final calibrated item bank contains 121 questions assessing the functioning of the upper extremities (dexterity), lower extremities (walking or mobility), and central regions (neck, back), as well as instrumental activities of daily living, such as running errands [19]. Each item is scored on a 5 point rating scale, with higher scores indicating better functioning. The Dutch-Flemish translation of the item bank was developed according to the universal PROMIS translation approach (http://www.nihpromis.org/measures/ translations), which included extensive forward-back translation procedures, expert reviews, and cognitive debriefing interviews among Dutch and Flemish participants [20].
Health Assessment Questionnaire disability index (HAQ-DI). The HAQ-DI contains 20 items measuring physical disabilities over the past week in eight categories of daily living: dressing and grooming, rising, eating, walking, hygiene, reach, grip, and activities [21]. Each item is scored on a 4-point rating scale from 0 (without any difficulty) to 3 (unable to do). Disability scores were calculated according to the alternative scoring rule, which does not account for the use of aids and help from others [22]. Category scores are averaged to produce a total score between 0 and 3, with higher values indicating more disability. The Dutch consensus version of the HAQ-DI was used in the DREAM data collection.

SF-36
Health Survey physical functioning scale (PF-10). The PF-10 is one of the eight scales of the SF-36 Health Survey and consists of 10 items measuring perceived current limitations in a variety of physical activities on a 3-point response scale from 1 (yes, limited a lot) to 3 (no, not limited at all). Scores of the PF-10 items are summed and linearly transformed to range between 0 and 100, with higher scores indicating better physical functioning [23]. The Dutch version of the SF-36v2 was used in the DREAM study [24].
Additional patient-reported and clinical measures. The Dutch DREAM registry additionally collected patient-reported general health, disease activity, fatigue, and pain in the past week on 0-100 visual analog scales (VASs), with higher scores indicating worse status. Clinical data were collected during visits to the outpatient clinic, including a 28-tender joint count, 28-swollen joint count, and erythrocyte sedimentation rate. Together with the VAS general health, these measures were combined into a single index of clinical disease activity (DAS28) [25].

Statistical analysis
All IRT analyses were performed with the MIRT software package [26]. The marginal maximum likelihood estimation procedure was utilized to estimate the model parameters and the latent physical function levels of patients were estimated using the expected a posteriori (EAP) method throughout all analyses. Latent physical function scores are expressed on a scale with a mean of 0 and SD of 1. A multidimensional generalization of the two-parameter generalized partial credit model (GPCM), suitable for the analysis of longitudinal, polytomous data [27], was used to model the Dutch data. In this model, the item parameters pertain to time point specific latent dimensions and the dependency between item responses at different time points is modeled by the correlation between the dimensions. The model allows patients' levels of physical function to change over time but item parameters are constrained to be equal across time points. To evaluate whether the item parameters were stable over time, the presence of longitudinal differential item functioning (DIF) was evaluated using regression analysis as proposed by Te Marvelde & Glas [28]. To this end, unidimensional GPCM estimates of the Dutch PROMIS data were obtained for each time point separately. The resulting threshold parameters were regressed on the threshold parameters emanating from one of the other two models in a series of univariate regression models. Individual items were considered to display statistically significant longitudinal DIF in case an item's 99% confidence interval did not intersect the regression line [28].
Fit of the longitudinal IRT model was assessed using Lagrange multiplier (LM) statistics, which evaluate whether observed item scores correspond to those expected by the item characteristic function [29]. To evaluate the magnitude of model violation of significant LM tests, effect size statistics (ES) were also obtained. These effect sizes are differences between average observed and expected scores across 3 total-score level groups. To compute these effect sizes, the patients were divided in 3 groups of approximately equal size obtaining low, intermediate, and high scores. The observed and expected scores were divided by the maximum attainable item score, such that a difference of, say, 0.10 indicated that the observed average score was 10% different from its expectation under the model. Items were considered to lack fit in case P's,0.05 and ES statistics were .0.10 [30]. We first evaluated fit within time points by estimating the unidimensional GPCM 3 times, once for each time point. Subsequently, fit of the total multidimensional model, with item parameters constrained to be equal across time points and which includes the covariance matrix between time points, was evaluated. The Dutch data was evaluated for DIF across age (median split at 58 years) and sex. To this end the baseline model was extended by partitioning the booklets further according to age or sex and DIF was evaluated across two marginal distributions of physical function of males vs. females and younger vs. older patients, respectively. DIF across the marginal distributions was evaluated with an LM test for DIF [31].
Cross-cultural equivalence with the original US data was investigated first using the wave 1 general population data. The analysis was subsequently repeated on the independent subset of 25 items administered to the US RA patients [19]. US item parameters were obtained from a unidimensional GPCM and analysis of cross-cultural DIF was again performed with the regression analysis method outlined above [28]. To examine the impact of any observed DIF, US and Dutch baseline data were jointly modeled in a unidimensional GPCM with country-specific item parameters for those items flagged for cross-cultural DIF. The resulting EAP estimates were compared to those emanating from a model without country-specific item parameters. In both models, the mean was set to zero for US respondents (SD = 1). The agreement between the resulting latent EAP estimates was evaluated by calculating intraclass correlation coefficients (ICCs, model A,1) and the limits of agreement according to the Bland-Altman method [32]. Two independent data sets were available of US RA patients. The first sample (Stanford sample) contained 14 items administered to 273 patients and the second sample (Polimetrix sample) contained 10 items administered to 280 patients. To evaluate RA-related DIF, the baseline model of Dutch RA-patients was extended to incorporate these data. DIF was subsequently evaluated across three marginal distributions (Dutch, Stanford, and Polimetrix) using the LM test approach outlined above.

Participant characteristics
Baseline data of 690 Dutch RA patients was available for analysis (Table 1). Of these, 489 and 311 patients completed booklets at T2 and T3, respectively. Average time between participations was 6.0 months (SD = 2.5) for T1 to T2 and 4. months (SD = 1.8) for T2 and T3. On average, Dutch patients had relatively low disease activity and high levels of physical function at baseline. Whereas the US general population and the combined RA samples had a balanced sex distribution, 64% of the Dutch RA patients were female, reflecting the greater prevalence of RA among women. The average level of physical function of US general population respondents was higher than that of Dutch RA patients according to the HAQ-DI and the average age of US general population respondents was lower. Table 2 presents an overview of the LM tests and the average observed and average expected item scores across three total score level groups for the PROMIS PF items administered in the odd booklets at T1 (see Figure 1). Results were similar for the even booklets and the other time points. The items are organized according to the point on the latent scale where they provide their optimum information, as an indication of the relative difficulty of the activities they refer to. As expected, more 'easy' items referred to simple activities of daily living, such as eating or getting up from a chair, while items involving increasingly higher levels of cardiopulmonary function were clustered around the higher end of the latent metric. For most items, average observed scores were quite high considering the 1-5 rating scale of the PROMIS items, reflecting the relatively high level of physical function of the sample. Item scores expected by the IRT model tended to be close to the observed item scores across total score groups, leading to an acceptable average ES of 0.01 for time point 1.

Evaluation of the longitudinal IRT model in the Dutch data
The number of items exhibiting lack of fit was very low for all three time points. For T1, T2 and T3 respectively, only 14 (3%), 12 (3%) and 5 (1%) items demonstrated misfit according to the LM test. Moreover, ESs exceeded 0.10 only for two items, both at T3 (PFA9, ES = 0.10 and PFA15, ES = 0.11) The item parameters were stable over time, with all correlations between threshold parameters at different time points exceeding 0.90 and all of the 99% confidence intervals intersecting the regression line in the three univariate regression analyses.
In the subsequent evaluation of the longitudinal (multidimensional) model, 6.3% of item level fit statistics showed lack of fit to the model, which corresponds approximately to the level of significant item tests expected based on chance. None of the items showed lack of fit in both or, in case of the HAQ-DI and PF-10 items, all booklets that it was included in, nor did any item show misfit across time points. The multidimensional IRT model provides estimates of the correlation of PF over the three different time points. The correlation between between latent PF levels across the three time points ranged from 0.73 between T1 and T3 to 0.87 between T1 and T2, indicating that physical function levels were quite stable over time. The overall conclusion was that model fit was acceptable.

DIF across age and gender
Seven items demonstrated DIF for sex, while five items showed DIF for age in the Dutch RA sample at baseline (Table 3). For all items flagged for sex DIF, men reported slightly higher scores than expected by the IRT model, whereas women reported lower scores than expected, indicating that the activities were easier for male RA patients. Likewise, all items flagged for age DIF, except item PFA53 ('Are you able to run errands and shop?') were more easily endorsed by younger rather than older patients.

Equivalence with PROMIS wave 1 data
To evaluate measurement equivalence, US item parameters were obtained and compared with the Dutch item parameters    using the regression analysis approach. Twenty-five items showed at least some level of uniform DIF in the regression analysis. For 11 of these items, Dutch patients were more likely to endorse lower response options according to the item response curves, indicating that these activities were relatively more difficult for them compared to the US general population. All these items involved the use of the hand or arms (see Table 2). Twelve items were more difficult for US respondents, of which five involved climbing stairs. Consequently, all items referring to climbing stairs were more precise at lower levels of overall physical function in the Dutch RA patients, whereas items involving dexterity tended to have better measurement precision at higher levels of function, as illustrated by two typical item information curves in Figure 2.
In the analysis of cross-cultural DIF in Dutch and US RA patients, the mean was set to zero for the Polimetrix sample and the latent means of the Dutch and Stanford sample were respectively 20.07 and 0.09, indicating that physical function

Impact of cross-cultural DIF
In the joint calibration of the Dutch RA data and the US general population data, with country-specific item parameters for the 25 DIF items, the mean of the latent physical function scores was set to 0 (SD = 1) for the US sample and the mean for Dutch RA patients was 21.18 (SD = 1.21), illustrating the considerably lower level of physical function of the Dutch RA patients. This estimate was very close to that observed in the original model without country-specific item parameters (M = 21.01, SD = 1.08), suggesting that the observed item DIF had little impact influence on the average total estimate obtained from all administered items. Moreover, agreement between total estimates was high (ICC = 0.99) and the limits of agreement were narrow, ranging from 20.23 to 0.25 in the Dutch data and from 20.20 to 0.18 in the US data.

Discussion
This study presents the preliminary calibration and crosscultural evaluation of the Dutch-Flemish translation of the PROMIS physical function (PF) item bank for Dutch patients with RA. The findings of the study indicate that the PROMIS PF item bank is a promising tool for applications such as CAT and tailored short forms in RA patients. However, some concerns remain regarding its cross-cultural measurement equivalence. Using the US-based standardized PROMIS calibration and metric requires further study.
The first principal finding of the current study was that the item bank could be successfully calibrated in a sample of Dutch patients with RA using an appropriate IRT model. To our knowledge, this is the first study to actually demonstrate that the full PROMIS PF item bank can be fitted to an appropriate IRT model in an RA sample. Therefore the current study provides support for the validity of applications of the item bank that require invariant estimates of the item and person parameters, such as CAT or short forms using a metric specific to Dutch patients with RA.
As a general rule, the stability of item parameters increases with more data. In that sense, the item parameters obtained in the current study should be considered preliminary and data that will be collected in future studies with the item bank in Dutch RA patients can be used to update the calibrations. Several ongoing studies in the Netherlands are evaluating the item bank in other patient groups. Future studies should evaluate the equivalence of the resulting item parameters across conditions to evaluate whether a common Dutch metric can be created.
The second principal finding of the study was that 25 of the PROMIS items (20%) showed substantial cross-cultural uniform DIF. The relatively high number of DIF items was not unexpected given that many items assess similar content (e.g. climbing stairs). Moreover, similar percentages of items with cross-cultural DIF are generally identified in scales with fewer items [30,33]. Interestingly, all the PROMIS physical function item bank items that involve climbing stairs were more difficult for the US general population sample, compared to Dutch RA patients. This replicates findings in an earlier study we performed on the cross-cultural equivalence of HAQ-II in US and Dutch RA patients [30]. One speculative explanation for this repeated finding could be that Europeans are more accustomed to climbing stairs, since stairs are more prevalent Table 3. Differential item functioning (DIF) across age and sex in the Dutch RA sample. in Europe, both in domestic and communal settings. However the US and Dutch sample might have also differed on key variables that might explain the observed DIF. For example body mass index has been linked to stair climbing in previous studies [34]. It would be interesting for future studies to evaluate the presence of body mass index related DIF in the PROMIS physical function items. By contrast, most items that were found to be more difficult for Dutch RA patients refer to activities involving the hands or the arms. This was not a surprising result, considering that disability of particularly the hands is a well-known clinical feature in RA. In fact, we had anticipated to find more DIF items between RA and the general population sample for items measuring dexterity. However, it should be noted that DREAM registry includes patients upon diagnosis with very early RA and these patients are treated aggressively. This is reflected in the average level of disease activity being below the commonly used DAS28 remission criterion of 2.6 and the low levels of disability observed, compared with international benchmarks in RA [35,36]. Therefore, typical  manifestations of RA-related disability may have been absent for many patients in the current study. Moreover, all items with collapsed response options involved measuring disability of the hands and these items showed severe distributional problems, even in the Dutch RA data with very few patients endorsing the lower response options. These two factors limit the sensitivity of the analyses with respect to RA-related DIF, and therefore studies in RA populations with more pronounced disease are desirable.
The results of the DIF analyses suggest that the Dutch RA data is not strictly equivalent to US general population data at the level of individual items, which was also observed in a previous study evaluating a Spanish language version of the item bank [37]. A limitation of the study design is that it cannot be definitively concluded whether observed differences in response probabilities conditional on overall level of function occurred because of disease characteristics or cross-cultural differences, since not all items were administered to US RA patients and no general population Dutch data is yet available. However, previous studies have generally shown European versions of physical function instruments to be equivalent to US versions in arthritis populations [30,33], while substantial DIF has been observed across rheumatic conditions in one previous study [38]. It also seems unlikely that observed DIF occurred as a result of translation errors, given the rigorous approach in translating and that all items refer to everyday activities that are very common in both US and The Netherlands. For these reasons, more studies are needed before firm conclusions regarding the measurement equivalence can be made. If such studies consistently identify certain items to exhibit DIF, their item parameters can still be expressed on a common metric by assigning group-specific item parameters to biased items. This allows cross-cultural comparison even in the presence of significantly biased items and physical function levels to be expressed on the PROMIS standardized metric if this is desired. In the mean time we recommend that those interested in expressing physical function levels of Dutch RA patients on the PROMIS standardized metric to select only items that were not flagged for DIF in the current study.
In the analysis of impact of DIF on total EAP estimates of physical function, we observed that biased items appeared to have a negligible influence on total physical function estimates from all items that were administered to patients at baseline. It should be stressed though that patients were administered between 48 and 72 items which is likely to be greater than the number of items that will be administered in practical applications of the item bank. In a recent validation study of a PROMIS PF CAT only four items were administered on average to obtain physical function estimates [39]. The impact of DIF on physical function estimates is likely to be greater in such situations, provided that the item characteristics of biased items make them likely to be selected in such an application. Future studies should further evaluate the impact on physical function estimates in situations were fewer items are administered.
In the current study we used different methods to identify DIF. Whenever possible, DIF was evaluated using LM statistics. An advantage of this method is that violations of model assumptions can be investigated within a framework that directly pertains to the observed scores. As a result, the magnitude and direction of DIF can also be directly inferred from a weighted difference between average observed and average expected scores. In the regression analysis the direction of DIF had to be inferred indirectly by inspecting the response curves and item information functions visually. A limitation of the DIF analysis is therefore that no qualifications regarding the magnitude of DIF could be given in the current study of equivalence with the PROMIS wave 1 general population data. The reason we resorted to the regression analysis in the analysis of cross-cultural equivalence was that the US general population data suffered from severe ceiling effects, with the majority of respondents endorsing the higher response options. Consequently, insufficient variability was present within total score level groups for the LM test to produce interpretable results. For this reason also, no indication of model fit could be given for the US data. The longitudinal DIF analysis could not be performed with the LM test since the test compares scores on individual items between two groups, but in the longitudinal design, each item was presented to each patient only once.
In summary, the results of this study show that the PROMIS physical function item bank could be fitted to an IRT model that assumes physical function to be a unidimensional trait. However, a substantial number of its items showed statistically significant DIF compared to the US general population wave 1 data. Although the impact of observed DIF on physical function estimates was minimal in this study, more studies are needed to evaluate the validity of the PROMIS standardized metric in RA patients in the Netherlands.