Introduction

Assessment of change in health-related quality of life (HRQL) is important for determining the clinical effectiveness of treatment, as well as for monitoring well-being of individual patients over time. However, comparison of HRQL-scores across time may be invalidated by the occurrence of “response shift”. Response shift refers to a change in respondents’ frames of reference that hinders a meaningful comparison of questionnaire-scores across time. Three different types of response shift are distinguished: recalibration, reprioritization and reconceptualization [38].

Several methodological approaches have been developed for the detection of response shift in HRQL outcomes [37], among which are statistical approaches such as structural equation modeling (SEM) [33]. Advantages of the SEM approach are that it allows for the operationalization of all three types of response shift and that possible response shift effects can be taken into account to assess “true” change. Within the SEM framework, the observed scores (e.g., questionnaire scales) are modeled to be reflective of an underlying unobserved latent variable or common factor (e.g., HRQL). The means and covariances of the observed variables (y) are then given by:

$${\text{Mean}}(y) = {\varvec{\upmu}} = {\varvec{\uptau}} + {\varvec{\Lambda}}\,{\varvec{\upkappa}},$$
(1)

and

$${\text{Cov}}\left( {y,y^{\prime } } \right) = {\varvec{\Sigma}} = {\varvec{\Lambda}}\,{\varvec{\Phi}}\,{\varvec{\Lambda}}^{\prime } + {\varvec{\Theta}},$$
(2)

where τ is a vector of intercepts, Λ is a matrix of common factor loadings, κ is a vector of common factor means, Φ is a matrix containing the variances and covariances of the common factors, \({\varvec{\Lambda}}^{{\prime }}\) denotes the transpose of Λ, and Θ is a matrix containing the variances and covariances of the residual factors. When SEM is applied to longitudinal data, response shift can be operationalized using SEM parameter estimates, where changes in the pattern of factor loadings (i.e., the pattern of Λ indicates which of the factor loadings are free to be estimated) are indicative of reconceptualization, changes in the values of factor loadings are indicative of reprioritization, and changes in intercepts and residual variances are indicative of uniform and nonuniform recalibration, respectively, (see [33] for more details).

The SEM method is especially suited to detect response shift and assess true change in continuous data. The objective of the present paper is twofold. First, we will explain how to analyze discrete data, e.g., ordinal item responses, using the SEM approach. We will show that the model of Eqs. (1) and (2) can still be used, but that the SEM approach needs to be extended to include a modeling stage in which the observed discrete ordinal variables are modeled to be reflective of underlying continuous variables (Stage 1). Stage 1 yields estimates of means and variances and covariances that can be used for the detection of response shift and assessment of true change in Stage 2. Second, we will apply the proposed SEM approach to the discrete ordinal item responses of the SF-36 questionnaire [40] that were obtained from 485 cancer patients, before and after start of antineoplastic treatment.

SEM approach for discrete data

One of the underlying assumptions of SEM with maximum likelihood (ML) estimation is that the scores of the observed variables follow a multivariate normal distribution. In the case of discrete variables, this assumption is not met, as the responses are limited to a small number of values (e.g., two, three or four response categories). To enable analysis of discrete data, we need to assume that the observed ordinal variables are representations of continuous underlying variables, where lower categories of the observed ordinal variable are related to lower scores on the continuous underlying variable, and vice versa. The model of continuous underlying variables (y*) yields estimates of means \(({\varvec{\upmu}}_{{y^{*} }} )\) and variances and covariances \(({\varvec{\Sigma}}_{{y^{*} }} ),\) which can be used in subsequent SEM analyses. SEM with discrete data has been explained elsewhere (e.g., [10, 18, 19, 2427, 32]). Table 1 gives an overview of the SEM approach for discrete data that is used in the present paper, including short descriptions of each step of the approach, the statistical procedures, and the item- and scale characteristics that are required to perform the associated statistical analyses. The steps in Stage 1 and Stage 2 of the SEM approach are similar, but in Stage 1 we operate under the assumption of multivariate normality and investigate the relation of observed scores with single underlying variables, and in Stage 2 we operate under the common factor model and investigate the relation with underlying common factors. Figure 1 shows the Stage 1 and Stage 2 models for an example of five observed discrete ordinal variables measured at two occasions.

Table 1 Stage 1 and Stage 2 of the SEM approach for discrete data
Fig. 1
figure 1

The models of Stage 1 and Stage 2 of the SEM approach for discrete ordinal data. The pentagons at the bottom represent observed discrete ordinal variables x 1x 5, the circles with \(y_{1}^{*}\)\(y_{5}^{*}\) represent the corresponding underlying continuous variables. The same \(y^{*}\) feature in Stage 2 (top of the figure), as the reflective indicator variables (the circles reflect the fact that they are not directly observed). Each y* is associated with a residual factor \(\varepsilon\). The residual factors represent everything that is specific to the corresponding y*. Residual factors of the same variable are correlated across measurement occasion. The circles at the top are the underlying common factors (ξ) at each measurement occasion and represent everything that \(y_{1}^{*}\)\(y_{5}^{*}\) have in common (e.g., health-related quality of life). In Stage 1, each observed discrete variable x is modeled to be reflective of a single underlying continuous variable y*. Assuming a bivariate normal distribution for each pair of y* variables, we can estimate the means \(({\varvec{\upmu}}_{{y^{*} }} )\) and variances and covariances \(({\varvec{\Sigma}}_{{y^{*} }} )\) on the basis of observed frequencies in the two-dimensional frequency tables of each pair of x variables. In Stage 2, the means and variances and covariances of y* are modeled using a common factor model with common factors ξ. Across occasion differences in estimates of measurement parameters are indicative of response shift. Specifically, in Stage 1 we investigate invariance of thresholds, and in Stage 2 we investigate invariance of intercepts, factor loadings, and residual variances (see also Table 1)

Stage 1: Observed discrete ordinal scores x are representations of underlying, continuous scores y*

Suppose we have an ordinal variable x with categories labeled 1, 2, and 3. The relations between the observed categories of the ordinal variable and the underlying continuous variable (y*) are defined using thresholds (δ), where:

$$\begin{aligned} & x = 1\quad {\text{if}}\,y^{*} < \delta_{1} , \\ & x = 2\quad {\text{if}}\,\delta_{1} < y^{*} < \delta_{2} , \\ & x = 3\quad {\text{if}}\,y^{*} > \delta_{2} . \\ \end{aligned}$$
(3)

In general, with m categories:

$$x = i\quad {\text{if}}\,\delta_{i - 1} < y^{*} < \delta_{i} ,$$
(4)

where

$$\delta_{0} \to - \infty ,$$

and

$$\delta_{m} \to + \infty .$$

The number of thresholds is thus equal to the number of response categories minus one. When we assume the underlying variable to follow a standard normal distribution (i.e., with a mean of zero and variance of one), then the threshold δ i defines an area under the curve left from the threshold that is equal to the proportion of observed responses in category i or lower (see Fig. 2).

Fig. 2
figure 2

The estimation of thresholds (δ): observed discrete scores x are representations of underlying continuous scores y*. There are 20, 45 and 35 % observed responses in categories 1, 2 and 3, respectively. The first threshold is located where the area under the curve to the left of the threshold is 20 % (δ 1 = −0.842). The second threshold is located where the area under the curve to the left of the threshold is 65 % (δ 1 = 0.385)

The correlations between the underlying variables can be estimated by assuming bivariate standard normal distributions. With two ordinal variables x 1 and x 2, the sample observations can be represented by a contingency table that contains the number of responses (n ij ) of category i on variable x 1 and category j on variable x 2. When we assume bivariate normality, we can estimate thresholds and correlations that yield expected frequencies that are as close as possible to the observed frequencies (see [21] for more details). When both variables have more than two response categories, the correlation is called a “polychoric” correlation; when both variables have only two response categories, it is called a “tetrachoric” correlation. These correlations indicate what the Pearson correlation would have been if these variables had been measured on a continuous scale.

Step 1: Testing the underlying bivariate normality

Polychoric correlations are estimated under the assumption of bivariate normality of the underlying continuous variables. The tenability of this assumption can be evaluated by comparing the expected proportions under bivariate normality to the observed sample proportions (see Table 1 for details on evaluation of model fit). When the hypothesis of bivariate normality holds for all pairs of variables, the assumption of multivariate normality is also supported. If the hypothesis of bivariate normality does not hold, then this indicates that the assumption of multivariate normality is not tenable. A possible solution for this problem is to eliminate the offending variable(s).

Step 2: Testing invariance of thresholds across measurement occasions

When the same variables are measured repeatedly (i.e., in longitudinal assessment), the imposition of invariant thresholds across measurement occasions is required for a common scale (see Supplement 1.1 for more details). The tenability of this restriction can be tested for each pair of variables by comparing the model with equality constraints on the thresholds to the Step 1 model without equality constraints on the thresholds (see Table 1). When the difference in model fit is significant, the hypothesis of equal thresholds across measurements must be rejected.

Step 3: Investigating possible non-invariance of thresholds

When the assumption of invariant thresholds across measurement occasions does not hold, this can be taken as an indication of recalibration response shift. Differences in thresholds of the same variable across measurement occasions indicate that the association between the scores of the underlying variable and the observed response category of that variable has changed; the underlying variables are not measured on the same scale. Occurrence of recalibration response shift in Stage 1 can be taken into account by allowing threshold parameters to be freely estimated across measurement occasions.

We introduce the term recalibration response shift in Stage 1, but want to emphasize that it is different from recalibration response shift in Stage 2. In Stage 1, differences between thresholds are detected given the model of bivariate normality of single underlying variables, and thus recalibration response shift is defined relative to the scale of the underlying variable. In Stage 2, differences between intercepts are detected given the common factor model and thus recalibration response shift is defined relative to the scale of the common factor (e.g., HRQL), and thus relative to the other variables measuring the same common factor.

To further investigate recalibration response shift, the tenability of equality restrictions on thresholds across measurement occasions can be evaluated for each threshold separately (see Table 1). This could give an indication as to whether the changes in the association between the scores of the underlying variable and the observed response categories can be attributed to a specific part of the measurement scale (e.g., non-invariance of the first threshold parameter would indicate that there is a shift in the meaning of the response scale’s values at the lower end of the measurement scale).

Step 4: Assessment of true change

To assess true change in the underlying variables, we can compare estimated means of the model from Step 2 across measurement occasions (see [21], for more details on the estimation of means of the underlying variables under equal thresholds). As invariant thresholds are required to enable a valid comparison of means of the underlying variables, true change can only be assessed for those variables for which the hypothesis of equal thresholds across measurements holds. True change estimates can be compared to observed change (i.e., the mean differences of the observed discrete variables). Table 1 provides information on the calculation of effect size indices of change. Effect size values of 0.2, 0.5, and 0.8 are considered “small,” “medium,” and “large” [12].

In other procedures for discrete data analyses, the tenability of bivariate normality and invariance of thresholds is usually assumed but not evaluated. By using the proposed four steps, we want to show that the underlying assumptions of the model of Stage 1 can be tested (i.e., Steps 1 and 2) that testing these assumptions can have important consequences (i.e., selection of items in Step 1), and may provide interesting information with regard to possible violations of these assumptions (i.e., recalibration response shift in Step 3), which will lead to a more valid interpretation of change (i.e., Step 4).

Stage 2: Continuous scores y* are explained by a common factor model

\({\varvec{\Sigma}}_{{y^{*} }}\) and \({\varvec{\upmu}}_{{y^{*} }}\) can be used in subsequent SEM analyses in the same way as for continuous variables, using the four steps as proposed by Oort [33]. However, the ML estimation method cannot be used with discrete data. One of the alternative estimation methods that can be used to yield unbiased parameter estimates and standard errors, and appropriate goodness-of-fit measures is the “weighted least squares” (WLS; [5]) method (see Supplement 1.2 for more details). When there are only two observed variables (e.g., a scale that consists of only two items), or when the observed variables are dichotomous (i.e., when analyzing a matrix of tetrachoric correlations), the SEM approach requires additional adaptations that are explained in Supplements 1.3 and 1.4, respectively.

Step 1: Testing the measurement model

The measurement model is a multidimensional model that includes multiple measurement occasions, but without any across occasion constraints (see Fig. 1 for an example of the measurement model with two measurement occasions). To achieve identification of all model parameters, scales and origins of the common factors can be established by fixing the factor means at zero and the factor variances at one. To test whether the measurement model holds, goodness-of-fit can be assessed using the WLS Chi-square test statistic (see Table 1).

Step 2: Testing the invariance of measurement parameters across measurement occasions

In Step 2, a model of no response shift is fitted to the data, where all measurement parameters associated with response shift are constrained to be equal across measurements. To achieve identification of model parameters, only first occasion common factor means and variances are fixed; factor means and variances at successive occasions are then identified due to invariance constraints on intercepts and factor loadings. To test for the presence of response shift, the no response shift model can be compared to the measurement model (see Table 1). If the invariance restrictions of the no response shift model lead to a significant deterioration in model fit, this indicates the presence of response shift.

Step 3: Investigating possible response shift effects

In case of response shift, a step-by-step modification of the no response shift model can be used to arrive at the response shift model in which all apparent response shifts are taken into account. Response shift is operationalized as across measurement occasion differences between the pattern of common factor loadings (reconceptualization), values of common factor loadings (reprioritization), differences between intercepts (uniform recalibration), and between residual variances (nonuniform recalibration). The identification of possible response shift effects can be guided by inspection of significant modification indices [20], correlation residuals (>0.10), or by an iterative approach where each constrained parameter associated with response shift is set free to be estimated one at a time, and the freely estimated parameter that leads to the largest improvement in fit is included in the model (see Table 1 for details on model fit evaluation).

Step 4: Assessment of true change

The parameter estimates of the final model, the response shift model in which all response shifts have been taken into account, can be used for the assessment of true change in the common factors (see Table 1).

In addition, evaluation of response shifts and true change for each individual variable can be done using the decomposition of change as proposed by Oort [33]. The change that is modeled using the common factor model is decomposed into change due to differences in intercepts (i.e., recalibration), change due to differences in factor loadings (i.e., reconceptualization and reprioritization), and change due to difference in the common factor means (i.e., true change). Table 1 provides information on the calculation of effect size indices of change.

Application

Patients

A total of 485 cancer patients undergoing active antineoplastic treatment were recruited in a cancer treatment center in Amsterdam. All patients were starting a new course of chemotherapy or radiotherapy. HRQL was assessed before the start of treatment, approximately 4 weeks after start of treatment, and approximately 4 months after start of treatment (see [1] for more details on data collection). For this study, we will only use the data obtained at baseline (pre-test) and immediate follow-up (post-test at 4 weeks). Attrition rate between the baseline and immediate follow-up period was 7.8 % (N = 38).

Measures

HRQL was assessed with the Dutch language version [1] of the SF-36 health survey [40]. The items of the SF-36 health survey can be clustered into eight subscales: mental health (MH; five items; six response categories), general physical health (GH; five items; five response categories), physical functioning (PF; ten items; three response categories), role limitations due to physical health (RP; four items; two response categories), bodily pain (BP; two items; five and six response categories, respectively), social functioning (SF; two items; five response categories), role limitations due to emotional health (RE; three items; two response categories), and vitality (VT; four items; six response categories). The eight subscales can be grouped into two summary measures: MH (i.e., MH, SF, RE and VT) and physical health (i.e., GH, PF, RP and BP). In addition, there is one item on Health Comparison (HC; one item; five response categories). Item response categories were coded such that higher scores indicate better functioning or better health. Missing item responses (0–1.6 %) were replaced by the nearest integer after expectation–maximization [12]. Imputation was only considered for data of patients who had <8 missing item responses to warrant reliability of imputation results. The total study sample therefore consists of 437 patients. Table 2 contains an overview of background variables and clinical variables of the selected study sample and the group of patients that was excluded due to attrition or due to too many missing values. There were no significant differences between the two groups with regard to age, gender, education, marital status, primary tumor site (breast, colorectal, lung or other), treatment modality (chemotherapy, radiotherapy, or combination therapy), and stage of disease (local or loco-regional vs. metastatic). The selected patients showed a significantly higher Karnofsky performance [22] and relatively fewer progressive tumors as compared to the excluded patients.

Table 2 Background and clinical variables of the selected study sample (N = 437) and the group of patients that was excluded due to attrition or due to too many missing values (N = 49)

Procedure

The SEM approach for discrete data was applied to all items of the SF-36. In order to reduce model complexity and facilitate interpretation of results, analyses were done for each subscale of the SF-36 separately. The information provided in the SF-36 manual about the clustering of items and published results of principal components analyses of the SF-36 [40] were used to establish the measurement model of each subscale. Response shift was operationalized as across occasion differences between the values of common factor loadings (reprioritization), and differences between intercepts (uniform recalibration). An iterative procedure was used to investigate possible response shift effects, where the across occasion constraints on the parameters associated with response shift were freed one at a time. The freely estimated parameters that were associated with the largest improvement in model fit were included in the model. Reconceptualization response shift was investigated by checking the significance of factor loading parameters (i.e., an item with an insignificant factor loading is not indicative of the common factor). Reconceptualization response shift due to other factors (e.g., other subscales, demographic or clinical variables) was not investigated. The investigation of differences between residual variances (nonuniform recalibration) is straightforward and does not require adaptations to the response shift detection procedure. As the residual factors do not affect assessment of true change, the residual variances are not considered in the present article. Statistical analyses were performed using the PRELIS (Stage 1) and LISREL (Stage 2) programs [20]. Syntax files for reported analyses are available in appendix A of Electronic Supplementary Material (Stage 1) and appendix B of Electronic Supplementary Material (Stage 2). Appendix C of Electronic Supplementary Material provides syntaxes that were used to calculate approximate fit indices (RMSEA and ECVI) with associated confidence intervals, Chi-square difference tests (CHISQdiff), and ECVI difference tests (ECVIdiff). The data are available upon request from the authors.

Results

Frequency distributions for the items of the SF-36 that were used for analyses can be found in Table 3. Results of statistical analyses from Steps 1–3 of Stage 1 and Stage 2 are presented in Tables 4 and 5, respectively. Estimates of change from Step 4 of both stages are displayed in Table 6. We report results for each subscale of the SF-36 separately. Results of the subscale MH are reported in detail, so that results of other subscales can be reported more concise.

Table 3 Frequency distributions of the items of the SF-36 at baseline and follow-up that were used for statistical analyses (N = 437)
Table 4 Hypothesis tests and parameter estimates of Steps 1–3 from Stage 1
Table 5 Goodness of overall model fit and difference in model fit of the models in Stage 2
Table 6 Assessment of change in the items of the SF-36: results from Step 4 of Stage 1 and Stage 2, expressed as effect sizes (standardized differences)

Mental health (MH): Stage 1

Results of Step indicated that the hypothesis of underlying bivariate normal distribution was tenable for all item pairs. In Step 2, equality constraints on thresholds across measurements lead to a significant deterioration in fit for Item 26 (“Have you felt calm and peaceful?”) (see Table 4). As it is not possible to impose equality restrictions on individual threshold parameters in PRELIS, we could not evaluate whether the non-invariance of thresholds could be attributed to specific thresholds. To evaluate the differences in thresholds of Item 26, we compared the freely estimated threshold at both measurement occasions. Inspection of threshold estimates showed that three out of five thresholds were lower at the second measurement occasion as compared to the first measurement occasion (see Table 4). This indicates recalibration response shift, where it was relatively easy for patients to score high on feeling calm and peaceful after treatment, compared to before treatment. All thresholds for Item 26 were set free to be estimated at both measurement occasions and the item was excluded from further response shift detection analyses in Stage 2. For all other items of MH, means and variances and covariances of the underlying variables were estimated under the restriction of equal thresholds across occasions.

In Step 4, inspection of the estimated mean differences of the underlying variables as compared to the observed mean differences showed that true change in Items 24 and 30 was significant and somewhat larger than the observed change; there was an improvement in the scores of Item 24 and a deterioration in the scores of Item 30 (see Table 6). True change in Item 25 was smaller than the observed change and not significant, and both observed and true change of Items 28 was not significant. There was no significant observed change in Item 26. True change of Item 26 is not given as it cannot be interpreted because the underlying variables have a different scale of measurement.

Mental health (MH): Stage 2

The estimated means, variances and covariances of the underlying continuous variables from Step 3 in Stage 1 were used for subsequent analyses in Stage 2. In Step 1, the measurement model yielded reasonable approximate fit (Model 1a, Table 4) and included a residual covariance between Item 26 (“Have you felt calm and peaceful?”) and Item 30 (“Have you been a happy person?”). This indicates that these items have something more in common than is captured by the common factor MH.

In Step 2, invariance restrictions on intercepts and factor loadings were imposed for all items except Item 26. The no response shift model yielded a significant deterioration in model fit as compared to the measurement model, according to both the Chi-square difference test and the ECVI difference test (see Table 5), indicating the presence of response shift.

In Step 3, three response shift effects were detected. Recalibration response shift of Item 24 (“Have you been a nervous person?”) was detected [CHISQdiff(1) = 54.8, p < .001], where the intercept was higher at follow-up than at baseline. Because items were scored such that higher scores indicate better health, the difference in intercepts indicates that it became relatively difficult to score high on nervousness after antineoplastic treatment, compared to the other items of MH. In addition, reprioritization response shift of the same item was detected [CHISQdiff(1) = 28.7, p < .001], where the value of the factor loading was higher at follow-up than at baseline. This indicates that the item became more indicative of MH after treatment. Recalibration response shift of Item 30 (“Have you been a happy person?”) was detected [CHISQdiff(1) = 11.8, p < .001], where the intercept was higher at baseline than at follow-up. This indicates that it became relatively difficult to score high on happiness after treatment, as compared to the other items of MH.

The response shift model, in which all apparent response shifts are taken into account, showed reasonable approximate fit according to the RMSEA, and equivalent model fit as compared to the measurement model (see Table 6). Results of Step 4 indicated that patients showed a significant improvement of MH (change = 0.06, p < .001; d = 0.08). Before taking into account response shift effects, the change was in the same direction and also significant (change = 0.05, p < .001; d = 0.08).

Estimates of decomposition of change are presented in Table 6. In general, modeled change in Stage 2 was similar to true change estimates from Stage 1. The estimated true change in Stage 2 showed small improvements in all items, although they were non-significant. Recalibration response shifts in Items 24 and 30 caused the observed improvement (d = 0.30) and deterioration (d = −0.16), respectively. Results of decomposition of change for Item 26 are not reported because interpretation is hindered due to the difference in measurement scales of the item across occasions.

General physical health (GH): Stage 1

The hypothesis of underlying bivariate normal distribution and the equality restrictions on thresholds across measurements were tenable for all pairs of items (see Table 4). In general, true change in the underlying variables was similar to that of observed change, although only the deterioration in true change of Item 33 was significant (see Table 6).

General physical health (GH): Stage 2

The measurement model of GH showed reasonable approximate fit (model 2a, Table 5). The no response shift model did not yield a significant deterioration in model fit, indicating that there was no evidence for response shift effects (see Table 5). Overall, patients showed a significant deterioration of GH (change = −0.10, p < .001; d = −0.19) and also in the items of GH, but only the deterioration in Item 36 was significant (d = −0.11; see Table 6).

Physical functioning (PF): Stage 1

The hypotheses of underlying bivariate normal distributions were tenable for all item pairs. Equality of thresholds across measurement occasions could not be evaluated, as items with three categories do not provide enough information to test the difference in LR test statistic (see also Table 1). Estimated true change was largely similar to observed change, with significant deterioration in Items 3, 6, 7, and 10. A notable difference occurred for the true change estimate of Item 12, which showed a significant improvement (d = 0.51) that was not found for observed change.

Physical functioning (PF): Stage 2

The measurement model of PF was modified to include residual covariances between Item 4 (“moderate activities”) and Item 5 (“lifting or carrying groceries”), and between Item 6 (“climbing several flights of stairs”) and Item 7 (“climbing one flight of stairs”). The measurement model that included these residual covariances showed reasonable approximate fit, and the close fit hypothesis could not be rejected (model 3a, Table 5).

The no response shift model fitted worse than the model without across measurement constraints (see Table 5), indicating the presence of response shift. Recalibration response shift of Item 12 (“bathing or dressing yourself”) was detected [CHISQdiff(1) = 173.7, p < .001], where the intercept was higher at follow-up than at baseline. Thus, patients scored higher on Item 12 after treatment, relative to the other items of PF. Because higher scores on Item 12 are indicative of fewer limitations, it became relatively difficult to endorse limitations on this item after antineoplastic treatment. In addition, reprioritization response shift of Item 12 (“bathing or dressing yourself”) and Item 4 (“moderate activities”) was detected [CHISQdiff(1) = 146.2, p < .001; CHISQdiff(1) = 14.0, p < .001], where the factor loadings of both items were higher at follow-up as compared to baseline, indicating that both items became more indicative of PF after treatment.

The response shift model yielded reasonable approximate fit according to the RMSEA, and equivalent approximate model fit as compared to the measurement model (see Table 5). Patients showed no significant change in PF (change = −0.05, p = .13), but before taking into account response shift effects the change was in the opposite direction and significant (change = 0.02, p = .041). Therefore, not taking into account response shift effects would have overestimated changes in PF.

Inspection of change estimates for individual items showed (non-significant) deterioration in all items. However, for Item 12 there was a significant improvement due to recalibration response shift (d = 0.51).

Role limitations due to physical health (RP): Stage 1

As RP consists of dichotomous items, the hypothesis of bivariate normality and equality of thresholds across measurement occasions could not be evaluated (see Table 1). Inspection of true change estimates revealed a significant improvement of Item 13 and a significant deterioration of Item 16 (see Table 6).

Role limitations due to physical health (RP): Stage 2

The measurement model of RP showed close approximate fit (model 4a, Table 5). To enable the investigation of response shift with dichotomous items, the no response shift model requires some adaptations (i.e., additional scaling parameters; see Supplement 1.4 for more details). As a result, only recalibration response shift can be investigated with dichotomous items, and the presence of recalibration response shift is evaluated based on overall goodness-of-fit of the no response shift model. The overall model fit of the no response shift model of RP was not good (model 4b, Table 5), indicating the presence of response shift. Recalibration response shift of Item 13 (“Did you cut down on amount of time you spent on work or other activities?”) was detected [CHISQdiff(1) = 21.2, p < .001], where the intercept was higher at follow-up than at baseline. Patients scored higher on Item 13 after treatment, relative to the other items of RP. Because higher scores on Item 13 are indicative of fewer limitations, it became relatively difficult to endorse limitations on this item after antineoplastic treatment. The response shift model that included this recalibration response shift showed an improvement in overall model fit as compared to the no response shift model, and reasonable approximate fit according to the RMSEA (see Table 5).

Inspection of common factor means showed no significant change of RP (change = −0.07, p = .15; d = −0.07). Taking into account recalibration response shift did not affect the interpretation of change. Inspection of change estimates for individual items showed (non-significant) deterioration for all items and that the improvement in Item 13 was explained by recalibration (see Table 6).

Bodily pain (BP): Stage 1

The hypotheses of underlying bivariate normal distributions were tenable for all pairs of items. The equality restrictions on thresholds across measurements showed a significant deterioration in fit for Item 21 according to the Chi-square difference test (p = .02, see Table 4), but the ECVI difference test showed no significant deterioration in approximate fit (ECVIdiff = 0.009, 90 % CI −0.005 to 0.040). Inspection of true change estimates showed a (non-significant) deterioration in Item 21, whereas Item 22 showed a significant improvement (see Table 6).

Bodily pain (BP): Stage 2

To achieve identification of the measurement model of the two-item BP subscale, we applied the constraint of zero residual covariances as this restriction yielded best model fit (see Supplement 1.3 for more details). The measurement model showed exact fit, but comparison with the no response shift model showed evidence of response shift (see Table 5). Investigation of response shift effects showed that the model could be improved by freeing the restrictions on the intercepts, indicating recalibration response shift. We chose to free the intercept of Item 21 “level of pain,” where it became relatively difficult to score high on this item after treatment as compared to the item “interference of pain.” The response shift model showed equivalent approximate fit as compared to the measurement model. Inspection of common factor means showed a small but non-significant improvement of BP (change = 0.18, p = .09; d = 0.19). Before taking into account response shift, the improvement in BP was slightly smaller, but significant (change = 0.13, p < .001; d = 0.14).

Inspection of change estimates for the two individual items showed that the difference in behavior of both items was explained by recalibration of Item 21 (d = −0.23), whereas the modeled change showed significant improvement for Item 22 (d = 0.16) but no significant change for Item 21 (see Table 6).

Social functioning (SF): Stage 1

The hypotheses of underlying bivariate normal distributions and the equality restrictions on thresholds across measurements were tenable for both items. Estimates of true change showed no significant differences (see Table 6).

Social functioning (SF): Stage 2

To achieve identification of the two-item measurement model of SF we applied the constraint of equal factor loadings for both items at each measurement occasion, as this restriction yielded best model fit (see Supplement 1.3 for more details). Both the measurement model and the no response shift model of SF showed exact fit (models 6a and 6b, Table 5), and there was no evidence for response shift. Inspection of common factor means showed a small but significant deterioration of SF (change = −0.05, p < .001; d = −0.05), although the change estimates for individual items were not significant (see Table 6).

Role limitations due to emotional health (RE): Stage 1

Because the subscale RE consists of dichotomous items, the hypothesis of bivariate normality and equality of thresholds across measurement occasions could not be evaluated. Both observed and true change showed improvements for all items, although only the estimated true change for Item 17 was significant (see Table 6).

Role limitations due to emotional health (RE): Stage 2

Both the measurement model and the no response shift model of RE yielded reasonable approximate fit (model 5a and model 5b, Table 5). Therefore, there was no evidence of (recalibration) response shift (see Supplement 1.4). Inspection of common factor means showed no significant change of RE (change = 0.09, p = .09; d = 0.10), but Item 17 showed a significant improvement (see Table 6).

Vitality (VT): Stage 1

The hypotheses of underlying bivariate normal distributions and the equality restrictions on thresholds across measurements were tenable for all item pairs. The estimated true change was similar to that of observed change, although true change estimates were slightly larger. All items showed a significant deterioration (see Table 6).

Vitality (VT): Stage 2

The measurement model included a residual covariance between Item 29 (“Did you feel worn out?”) and Item 31 (“Did you feel tired?”), and showed exact fit (model 6a, Table 5). The no response shift model also yielded exact fit, and equivalent model fit as compared to the measurement model, indicating no evidence of response shift (see Table 5). Overall, patients showed a significant deterioration of VT (change = −0.27, p < .001; d = −0.34) and also a significant deterioration in all individual items (see Table 6).

Health comparison (HC): Stage 1

The subscale HC consists of only one item, so we can only conduct Stage 1 analyses. Evaluation of bivariate normality showed that this hypothesis was tenable, and although the restriction of equality of thresholds across measurement occasions showed a significant deterioration according to the Chi-square difference test (p = 0.03, see Table 3), there was no significant deterioration in approximate model fit (ECVIdiff = 0.007, 90 % CI −0.004 to 0.035). There was a significant improvement across measurement occasions for both observed and true change (see Table 4).

Discussion

In this paper we explained how the SEM approach for detection of response shift and assessment of true change can be applied to discrete data by assuming underlying continuous variables with bivariate normal distributions (Stage 1), and how the resulting estimates can be used in a common factor model (Stage 2). The proposed SEM approach thus enables the detection of response shift and assessment of true change in discrete ordinal responses.

Substantive interpretation of results

We applied the proposed SEM approach to all items of the SF-36. In our sample of cancer patients, we found that the model of underlying bivariate normal distributed continuous variables was tenable for all items (Stage 1). We detected recalibration response shift in the item “Have you felt calm and peaceful?” of the MH subscale, where it was relatively easy for patients to score high on feeling calm and peaceful after treatment, as compared to before treatment. We assessed change in the underlying variables and found that estimated true change was mostly similar to observed change, although estimated true change was somewhat larger in general. When change of the observed variables would be assessed as if they have interval scales (i.e., without taking into account their discrete properties), there would be ten items that showed significant change. Whereas true change estimates showed significant change in 18 items. Moreover, only for one item the results of true change no longer showed a significant difference between measurement occasions. Taken together, these results indicate that the model of Stage 1 can be used to provide an informative assessment of change. Furthermore, the estimates of the model can be used to enable detection of response shift and assessment of true change in Stage 2.

In Stage 2, we used a common factor model to detect response shift and assess true change in each subscale of the SF-36 separately. Results showed that patients’ MH improved, while their physical health, VT and SF deteriorated. No change was found for PF, role limitations due to physical health, role limitations due to emotional problems and BP. In general, when asked to compare their current health state to their health state the previous year, patients indicated that their health had improved.

Response shift effects were detected in individual items of the subscales MH, PF, role limitations due to physical health and BP. For the MH subscale, recalibration and reprioritization response shift was detected in the item “nervousness”, where it became relatively difficult to score high on nervousness after antineoplastic treatment as compared to the other items, and nervousness became more important to the measurement of MH. An explanation for this result could be that the start of treatment causes patients to experience less nervousness relative to the other indicators of MH. In addition, it might be that the decreased nervousness becomes especially relevant for patients’ mental state. Recalibration response shift was also detected in the item about “happiness”, where it became relatively difficult to score high on happiness after antineoplastic treatment. Thus, it seems that even though patients’ MH improved over time, this improvement was not found to the same degree for patients’ happiness as compared to the other indicators of MH. Not taking into account response shift effects would have led to an underestimation of change in MH.

For the PF subscale recalibration and reprioritization response shift was detected for the item “bathing or dressing oneself”, where it became relatively difficult to endorse limitations with bathing and/or dressing oneself after antineoplastic treatment as compared to the other items, and the item became more important for the measurement of PF. In addition, the item “moderate activities” also became more important for the measurement of PF after treatment. Therefore, it seems that being able to execute these moderate and personal activities becomes more important for patients’ PF after treatment as compared to the other items. In addition, even though patients’ PF did not change, limitations with regard to bathing or dressing oneself showed an improvement across time. Not taking into account response shift effects would have led to an overestimation of change of PF.

For the subscale role limitations due to physical health recalibration response shift of the item “time for work and other activities” was detected, where it became relatively difficult to endorse limitations on this item after antineoplastic treatment. Thus, even though patients’ overall role limitations due to physical health did not change, it seems that patients experienced decreased limitations with regard to time for work and other activities. A possible explanation for this result could be that patients get used to changes with regard to the allocation of available time, or adapt to the possible limitations due to their physical condition.

Finally, for BP recalibration response shift was detected. As this scale consist of only two items, detection of response shift indicates that the two items of this subscale behave differently. In our example, patients indicated to experience relatively fewer limitations due to their experienced pain as compared to the level of experienced pain. A possible explanation for this result could be that patients get used to or adapt to the experienced limitations due to their physical condition.

Compared to the selected study sample, the group of patients that was excluded due to attrition or due to too many missing values showed lower Karnofsky performance and more progressive tumors. Therefore, it should be noted that the results of our study may not be generalizable to the full population.

Taken together, these results provide information about the behavior of individual items within each subscale of the SF-36. Specifically, the results give insight as to what extent changes at the item level can be attributed to changes at subscale level (e.g., MH or PF), and which items show response shift. To our knowledge, this is the first time that response shift has been investigated in all individual items of the SF-36 questionnaire—one of the most widely used measurement instruments in the literature of HRQL. Although item-level data have been considered in previous research of the SF-36 [2, 17], response shift was only investigated in the items of a single subscale [17], or response shift in all items was tested globally instead of in individual items [2]. Therefore, the application of the SEM approach for discrete data to the items of the SF-36 in the present paper provides a substantive contribution to the literature on response shift phenomena.

Limitations of the proposed SEM approach

In the application of the SEM approach for discrete data, the question arises when to treat item responses as discrete ordinal responses and when to treat them as continuous responses. Item response scales are usually discrete as they only have limited number of response categories. However, when the number of response categories is larger (e.g., seven or more), discrete ordinal responses can be considered to sufficiently approximate continuous interval scales, so that statistical analyses for interval variables may be appropriate [15]. The treatment of discrete item responses should therefore be based on both substantive considerations (e.g., can the underlying measurement scale be considered continuous?) and statistical considerations (e.g., does the distribution of scores of the observed variables approximate a normal distribution? Are the chosen statistical techniques appropriate?). In the present paper, we applied the SEM approach for discrete data to ordinal item responses with different numbers of response categories (i.e., two, three, five and six). In our example, we considered the measurement scale of all items to be discrete. By definition, univariate normality does not hold for discrete variables. However, the proposed SEM approach has the flexibility to include not only variables with different numbers of response categories, but also variables with different measurement scales (e.g., the PRELIS program can be used to calculate the appropriate correlations between the variables) and could even be applied to non-ordinal binary data.

In Stage 1, we test the assumption of underlying bivariate normality and derive estimates of polychoric correlations, variances and means of the underlying variables under equal thresholds across measurement occasions. Stage 1 also provides information on the detection of response shift, in addition to the usual detection of response shifts in Stage 2. Recalibration response shift in Stage 1 can be interpreted as scale recalibration relative to the scale of the underlying continuum, whereas recalibration response shift in Stage 2 can be interpreted as scale recalibration relative to the scale of the common factor (and thus relative to the other variables measuring the same common factor).

It should be noted that under some circumstances it is not be possible to detect recalibration response shift. First, invariance of thresholds can only be evaluated when the number of response categories is larger than three, for variables with fewer response categories invariance of thresholds is assumed to hold. Second, non-invariance of thresholds might not be detected if the thresholds differ by an additive constant (this would be captured by mean differences in the underlying variables) or a multiplicative constant (this would be captured by differences in the standard deviations of the underlying variables). Similarly, non-invariance of intercepts in Stage 2 might not be detected if the intercepts differ by an additive constant (this would be captured by mean differences in the common factors) or a multiplicative constant (this would be captured by differences in the standard deviations of the common factors).

Although it might be possible to investigate whether differences in thresholds can be attributed to specific threshold parameters, this was not applied in the present paper because it is not possible to impose equality restrictions on individual thresholds in the PRELIS program that was used for statistical analyses. It might be of substantive interest to further investigate non-invariance of specific thresholds, but it does not resolve the fact that the scales of the underlying variables are different. It might also be of substantive interested to test more restrictive hypotheses about thresholds, such as the hypothesis of equally spaced thresholds (e.g., the difference between different answer categories in terms of the underlying variables are equal).

Although the performance of the common factor model and the estimation of polychoric correlations are reasonably robust against moderate violations of normality (e.g., [3, 9, 13]), similar studies on the performance of the common factor model under violations of invariant thresholds are needed. Millsap and Yun-Tein [23] investigated the impact of non-invariant thresholds in a multigroup context and concluded that when invariant thresholds are erroneously imposed, group differences in thresholds may be mistaken for group differences in residual variances. It would be interesting to perform a simulation study with the proposed methods for response shift detection and investigate the impact of (violations of) threshold invariance, number of response categories, number of variables in the common factor model, size of the bias, sample size, missing data, etc. Such a simulation study would be helpful to further substantiate the appropriateness of the proposed SEM approach for discrete data under different circumstances.

The SEM approach for discrete data was applied to the individual items of each subscale of the SF-36 separately. A limitation of this approach is that it does not allow for detection of reconceptualization response shift due to other factors, such as other subscales or demographic or clinical variables. However, the proposed approach can be extended to enable the detection of reconceptualization response shift due to these factors. For example, it would be interesting to investigate response shift in all the items of the SF-36 simultaneously by using one common factor model that includes all eight multi-item subscales, and the one-item scale of health comparison. However, it should be noted that such highly complex models require much larger samples in order for the proposed methods to work appropriately. As an alternative strategy one might conduct pairwise analyses of subscales, to reduce the model complexity while still enabling the investigation of reconceptualization response shift due to another subscale. A similar approach could also be used to investigate the effects of possible explanatory or confounding variables (e.g., gender, age, type of disease, or treatment modality). In the present paper, we chose to investigate all subscales separately to enable the explanation of the proposed methods for various situations (i.e., different number of response categories and different number of items per scale) and facilitate the analyses and interpretation of results (i.e., more parsimonious models). Further extensions of the proposed methods that include more measurement occasions, other subscales, or explanatory variables, would be an interesting topic for future research.

SEM with discrete data can be done using standard statistical computer programs [18, 28, 31, 34]. However, differences exist between programs in how they handle the analyses of discrete data. For example, the underlying assumptions of Stage 1 (i.e., bivariate normality and equal thresholds) are usually not tested but assumed to hold. Moreover, not all computer programs make an explicit distinction between the estimation of polychoric correlations and the fitting of structural equation models to the polychoric correlations. Some programs might test invariance of thresholds as an alternative to the invariance of intercepts (e.g., see [23]), and as a consequence test conceptually different hypotheses (i.e., differences in thresholds are conceptually different from differences in intercepts). In addition, different programs may use different (default) corrections for the resulting Chi-square values, and different options for evaluation of overall goodness-of-fit and differences in model fit may lead to different results. For the present paper, analyses were performed using the PRELIS (Stage 1) and LISREL (Stage 2) programs [20]. With PRELIS it is possible to evaluate the Stage 1 model for discrete data. In Stage 2, the WLS Chi-square value was used to evaluate model fit, as it provides a valid test statistic under non-normality and has the convenient property that it can also be used for the calculation of approximate fit indices and for the comparison of nested models. However, when sample sizes are small or models are large, the performance of the WLS test statistic might not be stable and one might consider alternative adjustment to the Chi-square statistics (see also Supplement 1.2). One should be aware that there are notable differences between different computer programs in handling discrete data and that the choice of computer program may also influence the ease with which one can apply the required analyses.

Besides SEM techniques, there are other statistical techniques for the detection of response shift available, such as ordinal logistic regression, the contingency tables methods, and probit regression. Methods relying on item response theory (IRT) analysis are probably the most popular method for the analysis of discrete ordinal data. Factor analysis methods are not conceptually different from IRT methods, but the former are usually applied to continuous data. The relationship between IRT and factor analysis has been described by [26]. Takane and De Leeuw [39] showed that WLS estimation with polychoric correlations in factor analysis is equivalent to fitting the normal ogive model with marginal ML estimation in IRT. However, advantages of SEM are that the models can be easily extended to multidimensional models (e.g., longitudinal models, or models that include multiple subscales) and that the hypothesized dimensional structure of the model can be tested.

Conclusion

Investigation of response shift and assessment of change at the individual item level can give insight into which items of a subscale contribute to changes at the subscale level, or which items behave differently from the other items. Analyses of items therefore provide different information than analyses of subscales. For example, it could be that there is no change (or no occurrence of response shift) at the subscale level, while there are changes at the level of individual items (or possible response shift effects) that cancel each other out. In addition, item-level analyses enable the identification of items that are most important to changes at the level of the subscale. Although the proposed SEM approach for discrete data needs further scrutiny using simulation studies, it leads to a better understanding of the response shift phenomena and enhances interpretation of change in the area of HRQL.