A Multidimensional Computerized Adaptive Short-Form Quality of Life Questionnaire Developed and Validated for Multiple Sclerosis

Supplemental Digital Content is available in the text


INTRODUCTION
H ealth-related quality of life (QoL) measurements are increasingly being considered important in regard to evaluating disease progression, treatment options, and the management of care provided to patients with chronic diseases. 1,2 Selfreported questionnaires are traditionally used to measure QoL, but they are often considered too lengthy by patients and professionals. 3 The time and resources necessary for the completion of questionnaires are constraints on professionals whose main role is providing patient care. 4 Additionally, questionnaires should be as brief as possible because of the difficulties of fatigue and concentration in some clinical populations (e.g., patients with multiple sclerosis [MS], schizophrenia). Providing shorter questionnaires in QoL measures may be useful for clinical practice. 5 Short-form instruments are usually a fixedlength (i.e., the same items are proposed to all patients) and adapted from a long-form instrument by reducing the number of questions based on classical and item response theories (IRTs). However, these fixed-length short-form instruments have drawbacks (e.g., the reduction of questions brings a risk of losing important information that can result in a decline of measurement precision). 6,7 Additionally, because some items are not tailored to patients, the precision of the QoL measure is not maximized, and patients may feel a lack of interest in the QoL measure and stop completing the questionnaire.
Interestingly, methods based on IRT models, currently used in the development of unidimensional item banks and computerized adaptive testing (CAT), can be adapted to overcome the problems faced by the development of fixed-length short-form questionnaires. 8,9 Indeed, CAT allows for the administration of only the items that will offer the most relevance for a given individual, reducing the length of the questionnaire and the completion time in addition to maintaining the test's precision. [10][11][12] Additionally, multidimensional CAT (MCAT) based on multidimensional IRT (MIRT) has been recently applied to measure health problems in various chronic diseases (e.g., symptomatology, fatigue, physical, and emotional functioning). [13][14][15][16][17][18] Because of the multidimensional nature of QoL, this method seems relevant in developing a valid and reliable adaptive short-form QoL questionnaire. 14 Currently, MCATs applied to shorten fixed-length available QoL questionnaires are scarce. 14,19 The aim of this study was to develop a multidimensional computerized adaptive short-form questionnaire (MCAT) from a fixed-length available QoL questionnaire for patients with a chronic disease marked by the difficulties of fatigue and concentration, MS. Our study focused on the multiple sclerosis international quality of life questionnaire (MusiQoL), which is a widely used QoL questionnaire in MS. 20 Compared to other MS questionnaires, this instrument has 3 important characteristics: specifically reflecting the perspective of patients with MS on the impact of the disease on their daily life; anchored in an explicit conceptual approach; 21 and developed and available in multiple languages and psychometrically validated to appropriate standards.

Questionnaire
The MusiQoL questionnaire is a MS-specific, self-administered, and multidimensional QoL instrument. 20 It comprises 31 items describing 9 dimensions. Each dimension is named according to its constitutive items as follows: activities of daily living (ADL, 8 items), psychological well-being (PWB, 4 items), symptoms (SYMP, 4 items), relationships with friends (RFR, 3 items), relationships with family (RFA, 3 items), relationships with healthcare system (RHCS, i.e., satisfaction with healthcare; 3 items), sentimental and sexual life (SSL, 2 items), coping (COP, 2 items), and rejection (REJ, 2 items). Each item is scored on a 6-point Likert scale, in which a score of 1 represents never/not at all, 2 represents rarely/a little, 3 represents sometimes/somewhat, 4 represents often/a lot, 5 represents always/very much, and 6 represents not applicable. For each individual, the score on each dimension is obtained by computing the mean of the item scores for that dimension. All dimension scores are linearly transformed to a 0 to 100 scale. A global index score is computed as the mean of the dimension scores. Higher scores indicate a higher level of QoL.

Study Design and Setting
Data from an international, multicenter, and cross-sectional MusiQoL validation study were used. 20 Patients were recruited between January 2004 and February 2005 at neurological departments in 15 countries: Argentina, Canada, France, Germany, Greece, Israel, Italy, Lebanon, Norway, Russia, South Africa, Spain, Turkey, UK, and USA. This study was performed in accordance with the Declaration of Helsinki and all applicable regulatory authority requirements and national laws (Institutional Review Boards or Independent Ethics Committees in accordance with the local requirements of each of the 15 countries). Written informed consent from patients was obtained before any study procedures were performed.

Population
The inclusion criteria included a diagnosis of MS according to McDonald, 22 being treated as an in-or outpatient at a hospital, over 18 years of age, informed consent to participate, and a native speaker of the local language. The main exclusion criteria included a neurologic diagnosis other than MS, dementia, ongoing severe relapse, an inability to complete the questionnaire unassisted, and withdrawal of consent.

Data Collection
In addition to the MusiQoL questionnaire, the following data were collected: (1) Socio-demographic information: age (years); gender (male, female); educational level (less than 12 years, greater than 12 years); marital status (single, not single); and employment status (active, unemployed).

MCAT Procedure and Analyses
This procedure was divided into 3 phases: MIRT analysis; MCAT simulations with analyses of accuracy and precision; and clinical validity of the MusiQoL-MCAT.

Multidimensional Item Response Theory Analysis
Percentages of missing values were computed for each item. In accordance with the steps taken previously to validate the MusiQoL, 20 a between-items MIRT model was calibrated. We tested 2 flexible IRT models that allow for the consideration of items with various numbers of categories and various difficulty thresholds: the multidimensional graded response model (MRGM) 26 and the multidimensional generalized partial credit model. 27 The MRGM was retained because it yielded a better fit than multidimensional generalized partial credit model in regard to the Akaike's information criterion and Bayesian information criterion. We also tested 2 IRT models with missing data and imputed data. For the model with imputed data, we used multiple data imputation because we considered the data missing not at random, following previous works on QoL. [28][29][30] The model with missing data was retained because it yielded a better fit in terms of the Akaike's information criterion (145,922 vs 153,334) and Bayesian information criterion (146,974 vs 154,359).
Item parameters were thus estimated using the MRGM with unconditional maximum likelihood (ML) estimation, as implemented in the R package mirt. 31 We used the Metropolis-Hastings Robbins-Monro 32 method as an estimation algorithm because it provides better precision than a classical expectation-maximization algorithm approach 33 in the presence of more than 3 factors.
The MRGM consists of 2 multidimensional sequential 2-parameter logistic models and is defined as follows: where i is the ith individual, j the jth item, x ij the ordinal response taking the value k 2 1; :::; K f g , a j the item discrimination parameter according to dimension d, u i the individual parameter according to dimension d, and b jk is the kth item difficulty threshold parameter.
Bayesian maximum a posteriori (MAP) estimation 8 of person-specific parameters (i.e., latent trait estimates) were computed using the MRGM parameters and the 31 item responses, providing IRT dimension scores for each patient. In IRT, item information is a function of the item parameters (i.e., the discrimination and difficulty threshold parameters). An item with more information is more discriminant and provides a lower error of measurement. The test information is the sum of all item information. The contribution of each item to the total test information (also called the amount of test information) was calculated.
The unidimensionality of each dimension was assessed using a Rasch analysis. The goodness-of-fit statistics (inliersensitive fit, ranging between 0.7 and 1.3) ensured that all items of the scale measured the same concept. 34 Differential item functioning (DIF) analyses were performed to compare the item differences among countries to determine whether all items behaved the same way. 35 The DIF indicates whether an item performs and measures differently for 1 subgroup of a population compared with another.

MCAT Simulations With Analyses of Accuracy and Precision
We performed a post-hoc or real-data simulation approach (i.e., complete response patterns to the 31 items of the MuSiQoL were used to simulate the conditions of an MCAT assessment). The algorithm of the MCAT was based on Mulder and van der Linden's work for Kullback-Leibler Information Item Selection. 36 Initially, the person-specific parameter estimate was set to the IRT dimension population mean scores. As the starting item, we used the item with the highest amount of test information. Item selection depended on responses to earlier items in the questionnaire taken from the empirical data. At each step of item selection, the Bayesian MAP procedure estimated the latent trait level that maximized the posterior distribution based on the current likelihood of the data and the assumed prior distribution. As a stopping criterion, we examined the 4 initial simulations based on a fixed number of items (5, 10, 15, and 20).
For each simulation, MCAT dimension scores were calculated, and accuracy and precision were then assessed. Accuracy was assessed using the level of correlation between the MCAT and the IRT dimension scores based on the full set of items (levels of correlation >0.9 were expected for each dimension). Precision was assessed using 2 indicators: the standard error measurement (SEM) and the root mean square error (RMSE). The SEMs of the MCAT dimension scores are considered indicators of reliability. The SEMs of the MCAT dimension scores are considered indicators of reliability. According to Harvill's work, 37 there is a direct relationship between the reliability of a dimension and the SEM; lower reliability estimates provide higher SEM estimates. An acceptable range was defined as <0.55 to ensure a satisfactory reliability level (reliability >0.70). The RMSE shows how precise the MCAT dimension scores are relative to the IRT scores from the full item set. The RMSE is calculated as follows: where u i is the IRT score from the full item set of the ith individual and b u i is the MCAT score, and smaller values of RMSE represent better measurement precision. RMSE values lower or equal to 0.3 indicate excellent measurement precision. 38 According to the accuracy/precision of the first 4 simulations, other simulations were tested to determine the best MCAT version. The final version of the MusiQoL-MCAT was selected considering the lowest number of items matching with the most satisfactory level of accuracy and precision. The item exposure (i.e., the number of times each item was exposed during the CAT procedure) was described for this version.

Validity of the MusiQoL-MCAT
To assess the validity of the selected MusiQoL-MCAT, we explored both convergent and discriminant validity. To explore the convergent validity, Pearson correlation coefficients were used to investigate the relationships between the dimensions of the MusiQoL-MCAT and the dimensions of the generic QoL questionnaire (i.e., SF-36). In accordance with the assumptions from the initial validation of the MusiQoL, 20 we hypothesized that the MusiQoL-MCAT scores would be more correlated with scores of dimensions exploring similar aspects from the SF-36 than with those exploring dissimilar aspects. The discriminant validity was determined by exploring the relationships between the MusiQoL-MCAT scores and socio-demographic (i.e., age, gender, educational level, marital status, and employment status) and clinical (i.e., EDSS score and MS subtypes) features using t-tests, ANOVAs, and Pearson correlations. To control the familywise error rates caused by the large number of correlations, we performed multivariate permutation tests. 39,40 Several hypotheses were formulated in accordance with previous studies: the MusiQoL-MCAT should differ according to sociodemographic characteristics (i.e., with younger age, higher educational level, and being in a couple associated with higher QoL); should be negatively correlated with the severity of the disease (i.e., EDSS); and should be lower in patients with the SP form of MS.
All the statistical analyses were performed using R version 2.15.2.

Multidimensional Item Response Theory Analysis
Percentages of missing data, estimated item parameters, information, and inlier-sensitive fit are presented in Table 1, and the IRT score distribution for each dimension is presented in Figure 1. Item 17 from the RFR dimension (''have you felt understood by your friends?'') provided the greatest amount of information, and item 16 from the SYMP dimension (''have you experienced unpleasant feelings: i.e., hot, cold?'') provided the least amount of information. Substantial DIF between countries was not evidenced for all dimensions, confirming the interest of this MCAT in international studies.

Analyses of Accuracy and Precision
Real-data simulations were performed on 922 patients with complete response patterns to the 31 items of the MuSiQoL. Accuracy and precision indicators of each simulation are presented in Table 2.
The number of dimensions with satisfactory accuracy (i.e., correlation >0.9) increased when simulations included a high number of items (from 3 of the 9 dimensions for the 5-item simulation to 8 of the 9 dimensions for the 20-item simulation). The relationships with healthcare system dimensions remained unsatisfactory regardless of the number of items in the simulation.
In regard to accuracy, the 2 indicators of precision were better when simulations included a high number of items. The number of dimensions with satisfactory SEM and RMSE varied from 3 of the 9 dimensions for the 5-item simulation to 8 of the 9 dimensions for the 20-item simulation and from 2 of the 9 dimensions for the 5-item simulation to 8 of the 9 dimensions for the 20-item simulation, respectively. The same dimension (i.e., relationships with the healthcare system) remained unsatisfactory regardless of the number of items in the simulation. As accuracy and precision of the 15-and 20-item simulations were the most satisfactory, 4 supplementary simulations were tested from 16 to 19 items. The 16-item version of the MusiQoL-MCAT was defined as the most satisfactory MCAT simulation because the level of accuracy and precision did not substantially change after 16 items.
Item exposure (i.e., the utilization frequency of an item) of the 16 item version of the MusiQoL-MCAT procedure is presented in Figure 2. Three items from both the SYMP and RHCS dimensions were never administered (items 15, 16, and 25), whereas 8 were administered more than 9 times out of 10 (items 1 and 2 from ADL dimension, item 10 from PWB dimension, item 14 from SYMP dimension, items 17 and 19 from RFR dimension, item 27 from SSL dimension, and item 28 from COP dimension).

Validity
Convergent and discriminant validity results were assessed for the 16-item version of the MusiQoL-MCAT and are shown in Table 3. Our findings were consistent with our assumptions. Age was negatively correlated with ADL, SYMP, SSL, and REJ dimension scores. RFR dimension scores were significantly higher in women. Individuals with higher educational levels had significantly better scores, except for the SYMP, RFA, and SSL dimensions. Among single individuals, significantly lower scores were observed on the RFA, RHCS, and SSL dimensions. Unemployed people had significantly lower scores on 5 dimensions (ADL, PWB, SYMP, COP, and REJ) compared to active individuals. Disease duration was negatively correlated with ADL and REJ scores. Significant differences were observed for ADL, RHCS, and REJ dimension scores between the 4 MS subtypes, with the highest scores found in individuals with CIS and the lowest scores found in those with SP. Bonferroni

DISCUSSION
To our knowledge, this study is one of the 1st investigations to propose a multidimensional computerized adaptive short-form questionnaire from a fixed-length available QoL questionnaire.
First, we demonstrated that the MusiQoL-MCAT had satisfactory precision and accuracy properties. All the MusiQoL-MCAT dimensions had levels of correlation higher than 0.9 with the IRT dimension scores based on the full set of items, SEM lower than 0.55 and RMSE lower than 0.3, except for 1 dimension (i.e., RHCS). However, the RHCS dimension has previously shown unsatisfactory performance, especially in the initial validation procedure. 20 Despite this drawback, the experts and developers of the MusiQoL decided to maintain this dimension due to its specific content concerning the healthcare environment. Additionally, the external validity of the Musi-QoL-MCAT was consistent with the external validity of the fixed-length MusiQoL. 20 The MusiQoL-MCAT scores were moderately correlated with the EDSS. These results confirmed that clinical assessments may not adequately reflect patients' perceptions and the impact of their SYMP and that the Musi-QoL-MCAT adds important complementary information to traditional clinical measures. The lowest MusiQoL-MCAT scores were reported by patients with the SP form of MS, ADL ¼ activities of daily living, COP ¼ coping, PWB ¼ psychological well-being, R ¼ correlation coefficient with the IRT dimension score (all Pvalues < 0.05), RMSE ¼ root mean square error (smaller values representing better precision), REJ ¼ rejection, RFA ¼ relationships with family, RFR ¼ relationships with friends, RHCS ¼ relationships with healthcare system, Score ¼ MCAT mean score, SEM ¼ standard error measurement (acceptable range from 0.32 to 0.55), SSL ¼ sentimental and sexual life, SYMP ¼ symptoms.
confirming that it is the most clinically aggressive and severe form of the disease. In this work, few significant differences were reported according to gender, which is consistent with other studies. 41 Higher educational level or being in a couple was associated with higher QoL levels, as previously reported in similar cross-sectional studies. 42 Older age was significantly associated with worse scores in the physical dimensions as ADL and SYMP, consistent with previous findings. 43 As expected, the MusiQoL-MCAT scores were correlated with the scores of similar dimensions from the SF36-ADL dimension of the MusiQoL-MCAT with the physical-like dimensions of the SF36 and the ''mental/psychological-like'' dimensions of the MusiQoL-MCAT with the ''mental/psychological-like'' dimensions of the SF36.
From a methodological perspective, 4 key issues need to be discussed: the IRT model used; the calculation of the trait estimate after an individual gives the response; the item selection; and the stopping rule. Concerning the 1st point, 2 types of MIRT models could have been considered: betweenitems and within-items models. 44 In our study, we used a between-items model (i.e., each item loading on 1 dimension only) in accordance with the steps taken previously to validate the MusiQoL. 20 A within-item multidimensional model (i.e., 1 item loading on several dimensions) could have also been considered, but the goal of this study was not to reexamine the dimensionality of the MusiQoL. Future work should explore this option and determine whether a within-item multidimensional model better fits the data, and if it can improve the precision and accuracy properties of the Musi-QoL-MCAT, especially in relationships with the healthcare dimension. Second, 2 main algorithms are available for ability estimation: ML estimation and Bayesian estimation including MAP and expected a posteriori (EAP). In our study, we used the Bayesian MAP method to estimate the latent trait level for the initial estimation of IRT scores, for updating the scores during the CAT procedure and for the final estimation of CAT scores. Although this option might be debatable, Yao 45 has shown that MAP yielded better precision than ML and performs similarly or better than EAP. Moreover, according to Chalmers' findings, 31 using EAP scores for models with more than 3 factors are generally not recommended as it results in slower estimation and less precision. Therefore, MAP scores should be used instead of EAP scores for higher dimensional models, 31 such as the MusiQoL structure. Third, the choice of the 1st item and following items is of great importance and depends on the approach taken previously (i.e., ML or Bayesian approach). In the Bayesian approach, it is recommended to select items with the highest information. 46 For example, Petersen et al 14 compared 2 CAT procedures, the 1st using the most informative item as the starting item and the 2nd using a less informative item and reported that administering the least or moderate informative item first provides a greater test length and a less precise measurement. Additionally, the information item selection can also be discussed. The Kullback-Liebler information item selection seemed to be the best way to select the items in our CAT procedures. Indeed, in a recent study, Yao 47 compared the Kullback-Liebler method with 4 other methods. In many ways, the Kullback-Liebler method outperformed the other methods, producing the smallest test length, which was an important argument for clinical use of the MusiQoL. Moreover, the Kullback-Liebler information item selection is preferable to the Fisher selection, especially if the number of items used is small, as in our study. 48,49 Fourth, we chose as a stopping criterion a fixed-length rule that was compatible with clinical practice rather than a variable-length rule which would make the questionnaire too long because of the unsatisfactory property of the relationships with healthcare dimension.
The MCAT simulation results indicated that 3 items were never administered (items 15, 16, and 25 from the SYMP and RHCS dimensions). These 3 items were the least discriminating items and provided the least amount of test information. This finding may be not surprizing because the RHCS and SYMP dimensions appear to be more influenced by a medical perspective and are further from the patient's point of view than other MusiQoL dimensions. However, other items from these 2 dimensions (i.e., items 13, 14, 23, and 24) were administered, confirming the satisfactory distribution of item exposure rates for each MusiQoL dimension. For this reason, we did not apply strategies for controlling item exposure in the MCAT. 45,50 Last, this study also provides a broader reflection on the development strategy of the new QoL measures. Fixed-length self-reported questionnaires are classically used to measure QoL in MS and other chronic diseases. CAT has proven to be efficient compared to these classical questionnaire measurements, including increased precision and avoidance of noninformative questions. As a consequence, important groundwork has been the development of unidimensional item banks containing a large amount of items covering the entire range of a latent trait (e.g., fatigue, pain). 51,52 The construction of a QoL item bank is an important step to proposing QoL CAT. However, a QoL item bank requires substantial resources and time because several issues remain unresolved: Is it possible to associate several QoL questionnaires based on various theoretical and conceptual backgrounds in the same bank? Can we associate generic and specific questionnaires? Should we associate questionnaires developed from the perspective of the patient and the experts? Additionally, the multidimensional nature of QoL involves the development of all of the unidimensional attributes of QoL that should be calibrated; then, the development of a multidimensional measure would be possible. All of these issues need to be resolved and therefore delay the development of a large QoL item bank and, thus, a multidimensional QoL CAT based on such a bank. Pending the completion of this major work, and although the number of items is relatively small in QoL questionnaires compared with item banks, the development of MCAT from available QoL questionnaires can be an attractive option based on financial and time resources.

Strengths and Limitations
A limitation in our study is that we used the entire sample only for the MIRT model calibration. MCAT simulations were performed using only the complete response patterns. To overcome this issue, it should be possible to use a well-known data imputation method, such as the multiple imputations approach, and use the imputed dataset for both MIRT model calibration and MCAT simulations. Using multiple imputations on our dataset for MIRT calibration resulted in a deterioration of the model fit. This approach encouraged us to use the raw dataset in this study, given that the sample was large enough to obtain robust results.
Even with the large overall sample size in this study, the representativeness of our sample should be discussed. Compared with the most important longitudinal studies that parallel the present study, our patients were younger or older (others had mean ages of 42, 53 44, 54 and 34 years), 55 had less severe baseline disability statuses (mean EDSS scores of 4.1 53 and 5.1 54 were seen in other studies), and had a sex-ratio of 3:1 (4:1, 53 2:1, 54 and 2.5:1 55 were found in other studies). Future research with different sample characteristics could improve the generalizability and applicability of the MusiQoL-MCAT.
The responsiveness or sensitivity to change was not tested in our study. This property, defined as the ability to detect a meaningful change, is a core psychometric property of measurement instruments. 56 This property is of major interest for the follow-up of patients with MS in clinical practice and for psychosocial research. 57,58 This property should thus be confirmed on the MusiQoL-MCAT in future longitudinal studies.
Despite these limitations, our work has several strengths that should be recognized (e.g., a large sample size and psychometric properties performed in accordance with international guidelines for developing questionnaires). 14,59 Moreover, it should be noted that these requirements are not systematically met for more ''objective'' outcome measures used by clinicians and decision makers. 60 Requirements that are too high-level may cause more harm than good, especially by preventing the use and diffusion of current QoL measures. In this sense, this new multidimensional computerized adaptive short-form questionnaire has satisfactory properties and can be considered interesting option for promoting both the use and usefulness of measuring QoL in MS clinical practice.

CONCLUSION
The MusiQoL-MCAT presents satisfactory properties and can individually tailor QoL assessment to each patient, making QoL assessment less burdensome to patients with multiple sclerosis and better adapted for use in clinical practice. As the construction of QoL item banks requires substantial resources and time, the development of MCAT from available QoL questionnaires using relevant methodology can be an attractive option based on financial and time resources.