Electronic Quality of Life Assessment Using Computer-Adaptive Testing

Background Quality of life (QoL) questionnaires are desirable for clinical practice but can be time-consuming to administer and interpret, making their widespread adoption difficult. Objective Our aim was to assess the performance of the World Health Organization Quality of Life (WHOQOL)-100 questionnaire as four item banks to facilitate adaptive testing using simulated computer adaptive tests (CATs) for physical, psychological, social, and environmental QoL. Methods We used data from the UK WHOQOL-100 questionnaire (N=320) to calibrate item banks using item response theory, which included psychometric assessments of differential item functioning, local dependency, unidimensionality, and reliability. We simulated CATs to assess the number of items administered before prespecified levels of reliability was met. Results The item banks (40 items) all displayed good model fit (P>.01) and were unidimensional (fewer than 5% of t tests significant), reliable (Person Separation Index>.70), and free from differential item functioning (no significant analysis of variance interaction) or local dependency (residual correlations < +.20). When matched for reliability, the item banks were between 45% and 75% shorter than paper-based WHOQOL measures. Across the four domains, a high standard of reliability (alpha>.90) could be gained with a median of 9 items. Conclusions Using CAT, simulated assessments were as reliable as paper-based forms of the WHOQOL with a fraction of the number of items. These properties suggest that these item banks are suitable for computerized adaptive assessment. These item banks have the potential for international development using existing alternative language versions of the WHOQOL items.


Rasch analysis
The Rasch model is closely related to parametric item response theory models. It is considered to be the 'practical realisation' ( 9 p.237) of Luce and Tukey's 10 additive conjoint measurement, allowing the social sciences, which must deal with latent traits and other phenomena that are not directly observable, to confirm the construction of fundamental measurement for latent phenomena 11 . Therefore, when scale data are successfully fitted to the Rasch model they can be said to capable of linear unidimensional measurement. We selected the Rasch model as it is sample-distribution independent, so maintaining specific objectivity (allowing comparisons to be made between individuals independent of the instruments which are used) 45 . In our experience, fitting scale data to the Rasch model has created efficient (few items with greater reliability) and precise (reduced measurement error) paper-based questionnaire measures in diverse areas of the health sciences 46-49 . The Rasch model has previously been used to successfully develop item banks for fatigue 50 , depression 51 and pain 52 .
Rasch analysis follows an iterative process of scale evaluation, modification and re-assessment. The cardinal criterion is scale fit to the Rasch model, indicated by a non-significant chi square interaction between the model and the data (Chi Square probability > 0.01). If the scale data does not fit the Rasch model it is necessary to establish the reasons for the misfit. Indicators that may identify the reasons for misfit include category threshold ordering, item fit residuals, differential item functioning, local dependence and dimensionality. A brief explanation of each indicator is given below; further information is available elsewhere 12 .

Category Threshold Analysis
Scales with polytomous response modalities have several ordered response categories (e.g. a Likert scale) which are typically scored sequentially with a higher score indicating a higher level of the latent phenomena being measured. Categories can become disordered when the category is not modal, i.e., respondents to not endorse it frequently enough 13 . Categories that are disordered may be collapsed adjacently and rescored in order to preserve the correct ordering. Disordered category thresholds prevent the calculation of interval level estimates from the item banks and have a negative impact on overall model fit.

Item fit residual
Item fit residuals are analysed to ascertain if the individual items fit the Rasch model or if the items are over (high negative fit residual) or under (high positive fit residual) discriminating. Items that underdiscriminate are considered to have a weak relationship with the underlying construct and those with a high fit residual are likely to correlate too strongly with the underlying construct, indicating possible redundancy or dependency with other items. For the current analysis, items with a fit residual greater than ±2.5 logits were removed from the scale.

Differential item functioning
Differential item functioning (DIF) occurs when different demographic groups in a sample respond in a systematically different way to an item. Two types of DIF can occur; uniform DIF, where a certain group responds differently across the entire range of the underling phenomena; and non-uniform DIF, where a group responds differently to an item at a certain level of the underlying phenomena. In the current study we analysed DIF by gender, age group and broad medical status (well/sick) using ANOVA.

Local dependency
Local dependency is assessed statistically by correlating item residuals, correlations greater than +. 2 are considered to be locally dependent 12 and the item should either be removed from the scale or 'collapsed' into a testlet (a bundle of common items) 14 . The best strategy for dealing with local dependency (deletion or testlet) is determined by the capacity of future test administrators (whether they be computers or humans) to take account of locally dependent items in their administration protocol. As many CAT simulation and administration programs do not yet have the functionality to account for local dependency in this manner, we decided to remove locally dependent items, rather than collapse into a testlet. For the current analysis, where pairs of items are locally dependent, the item with the greatest (positive or negative) fit residual will be removed.

Dimensionality
A fundamental assumption of item banks for clinical purposes is that the items within each bank all measure the same single underlying phenomenon (e.g. psychological QoL). This is known as the assumption of unidimensionality and it is assessed using a formal test of the difference between component loadings on the first residual factor within the scale 15 . A principal components analysis is conducted and items are divided into two groups: those that load positively and those that load negatively on the first residual factor. Both sets of items are then used to create an independent estimate for each participant. An independent samples t-test is then conducted to assess if there is a significant difference between the two estimates for all of the participants in the sample. As the scale is expected to by unidimensional, the hypothesis is that there be minimal difference between the two groups. The acceptable criterion for unidimensionality is that fewer than 5% of the t tests return a significant result (or the 95% confidence interval falls below 5% of significant tests) 16 .