Testing for response shift in treatment evaluation of change in self‐reported psychopathology amongst secondary psychiatric care outpatients

Abstract Objectives If patients change their perspective due to treatment, this may alter the way they conceptualize, prioritize, or calibrate questionnaire items. These psychological changes, also called “response shifts,” may pose a threat to the measurement of therapeutic change in patients. Therefore, it is important to test the occurrence of response shift in patients across their treatment. Methods This study focused on self‐reported psychological distress/psychopathology in a naturalistic sample of 206 psychiatric outpatients. Longitudinal measurement invariance tests were computed across treatment in order to detect response shifts. Results Compared with before treatment, post‐treatment psychopathology scores showed an increase in model fit and factor loading, suggesting that symptoms became more coherently interrelated within their psychopathology domains. Reconceptualization (depression/mood) and reprioritization (somatic and cognitive problems) response shift types were found in several items. We found no recalibration response shift. Conclusion This study provides further evidence that response shift can occur in adult psychiatric patients across their mental health treatment. Future research is needed to determine whether response shift implies an unwanted potential bias in treatment evaluation or a desired cognitive change intended by treatment.


| INTRODUCTION
It is generally assumed that the subjective standard of measurement used in self-report instruments is the same between time points and that comparisons made between them are valid measures of true change. However, there are indications that subjective standards of patients and their interpretations of, and response to, items may change across treatment. For instance, how patients view their symptoms may change due to recovery from the underlying mental disorder, improved cognitive abilities, and psychoeducation, which can affect how patients respond to self-report items (e.g., Fokkema, Smits, Kelderman, & Cuijpers, 2013). This phenomenon is called a "response shift" (Golembiewski, Billingsley, & Yeager, 1976;Nolte, Mierke, Fischer, & Rose, 2016;Wu, 2016).
If persons change their perspective, this may alter the way they conceptualize, prioritize, and calibrate items. Consequently, three main types of response shifts have been identified: reconceptualization, reprioritization, recalibration (Nolte et al., 2016;. Reconceptualization means that patients redefine the meaning of a concept such as "depression." For instance, before treatment, patients had never considered somatic symptoms (e.g., sleep problems) as a component of their depression.
However, after successful treatment, patients may consider somatic symptoms as part of their depression (Oort, 2005). Reprioritization means that the importance of specific symptoms changes in the overall measurement (Oort, 2005). For example, before treatment, when patients do not work due to sick leave, they may score items concerning concentration as not so important. However, after treatment, when patients resume their work, they may score these items as more important because they realize that concentration is crucial to their job performance. Finally, recalibration means a change in the patient's interpretation of response scale values. For example, after treatment, the Likert-score of 1 (rarely) on the suicidal ideation item may represent another level of depression and rumination about suicide than before treatment. Uniform recalibration means a recalibration of the item scale, which influences all response options within an item and all subjects to the same extent and in the same direction. Non-uniform recalibration means that the recalibration of the item scale differs in extent or direction across subjects and/or response options (Fokkema et al., 2013;Oort, 2005).
Studies have already convincingly shown that response shift can occur across treatment of a chronic somatic disease (for an overview: e.g., Schwartz et al., 2006;Vanier, Leplège, Hardouin, Sébille, & Falissard, 2015). To date, only three studies specifically aimed at response shift testing regarding pretreatment versus post-treatment self-report scores amongst adult psychiatric patients (Fokkema et al., 2013;Nolte et al., 2016;Smith, Woodman, Harvey, & Battersby, 2016 Carlier, Schulte-Van Maaren, et al., 2012;Carlier et al., 2017). We tested the occurrence of all response shift types across treatment in a naturalistic sample of secondary psychiatric care outpatients. We expect above all response shift in the domain depression/mood, because it was suggested that depression in particular is sensitive to response shift (e.g., Nolte et al., 2016).

| Design and procedure
This study was conducted by the Department of Psychiatry of Leiden University Medical Centre (LUMC), using already available ROM data of a previous Dutch multicenter pre-post treatment study (Carlier et al., 2017). Time between pre-post assessments varied (Table 1), depending on how ROM was implemented (e.g., monthly, every 3-4 months, later). Consequently, the second assessment was not necessarily the end assessment (possibly interim assessment), nor was it due to meeting treatment goals or patient disengagement.
For the purpose of this study, we selected ROM data of outpatients with common mental disorders who had both pre-post treatment data of SQ-48 (Carlier et al., 2017). General criteria to be eligible for ROM are: all psychiatric inpatients and outpatients, who are literate and have sufficient command of the Dutch language, and who are willing and able to complete self-report instruments. Most common reason that patients are not eligible for ROM is insufficient command of the Dutch language, in which case they get treatment without ROM (de Beurs et al., 2011; instruments for non-Dutch-speaking patients in preparation). Within ROM, patients are enrolled in treatment (instead of research). Dropout or missing data in ROM generally have to do with patients who stop their treatment or no-shows at their ROM measurement appointment, respectively. In the present study, we had no information regarding such data.
Patients are administered a battery of measures which continues for as long as the patient is being treated. ROM measures generally may include: psychiatric interview (optional, Mini-International Neuropsychiatric interview; Sheehan et al., 1998;Van Vliet & De Beurs, 2006), observer-rated instruments (optional), and self-report questionnaires (generic and disorder-specific). Measures are administered by independent assessors (trained research nurses/psychologists) through computerized self-report, which prevents missing data as item-completion is necessary for progression to the next item. For a detailed description of Dutch ROM, see de Beurs et al. (2011). Dutch ROM is fairly comparable with ROM abroad (e.g., USA, UK) in terms of objectives; ROM data are collected systematically to assess treatment effectiveness in everyday clinical practice, to inform clinicians and patients about treatment progress (Carlier, Meuldijk, et al., 2012;Lambert, 2017;Lambert, Whipple, & Kleinstäuber, 2018). Also, implementation of ROM in clinical practice forms a common challenge in most countries (e.g., Boswell, Kraus, Miller, & Lambert, 2015;Essock, Olfson, & Hogan, 2015;Roe, Drake, & Slade, 2015 (Carlier et al., 2017).
The Medical Ethical Committee of the LUMC approved the general study protocol in which ROM is considered as an integral part of treatment process (no written informed consent is required). Patients may refuse ROM measurement and/or the anonymous use of their ROM data for scientific research without consequences (i.e., they receive necessary treatment). If patients refuse to take part in scientific research, their ROM data are removed from the ROM database (Carlier et al., 2017).

| Participants
The study sample consisted of 206 outpatients (see Table 1).

| Symptom Questionnaire-48 (SQ-48)
The SQ-48 is a generic self-report questionnaire that assesses common psychopathological symptoms within seven subscales (seven factors with a total of 37 items): Aggression (four items), Mood/depression (six items), Somatic complaints (seven items), Anxiety (six items), Social phobia (five items), Agoraphobia (four items), Cognitive complaints (five items). Two additional subscales do not measure psychopathology and were therefore excluded for this study (Carlier et al., 2017): Vitality/optimism (six items), Work/study functioning (six items). All SQ-48 items are for frequency on a 5-point  Note. Data are expressed as percentages or means ± standard deviation, with range. Sample was used to test the factor invariance of the SQ-48. Other psychopathology: somatoform disorders (most); personality disorders; bipolar disorders; disorders usually first diagnosed in infancy, childhood or adolescence; adjustment disorders; impulse-control disorders not elsewhere classified; dissociative disorders; eating disorders; mental disorders due to a general medical condition not elsewhere classified. On the basis of Table S1 of Carlier et al. (2017) and adapted for this study. and a patient-group with mainly depression and anxiety disorder (n = 242; CFI = 0.97; RMSEA = 0.06 ;Carlier, Schulte-Van Maaren, et al., 2012). The SQ-48 showed good internal consistency as well as good convergent and divergent validity in psychiatric outpatients and healthy reference-group (Carlier, Schulte-Van Maaren, et al., 2012). It also showed excellent test-retest reliability and good responsiveness to therapeutic change in psychiatric outpatients (Carlier et al., 2017). Detailed information about the development of the SQ-48 is described elsewhere (see Carlier, Schulte-Van Maaren, et al., 2012).
The Dutch SQ-48 was translated into English according to evidence-based guidelines for translation and cultural adaptations of questionnaires (Carlier, Schulte-Van Maaren, et al., 2012;Wild et al., 2005; see Supporting information). Given that this study was Dutch, we have used the Dutch SQ-48 version.

| Statistical analyses
First, changes in total and sub-scores of SQ-48 were analysed using a doubly multivariate design with repeated measures in order to understand the impact of treatment and in preparation for response shift testing (Lix & Hinds, 2004). Cohen's d effect size was calculated.
Because post-treatment SD could be affected by treatment, we used baseline SD when computing Cohen's d (Cohen, 1992). Bear in mind that these effect sizes can only be interpreted when we find at least partial measurement invariance in most of the items (Byrne, Shavelson, & Muthén, 1989;Reise, Widaman, & Pugh, 1993;Steenkamp & Baumgartner, 1998;Vandenberg & Lance, 2000; see Appendix). Moreover, it may be affected by population heterogeneity (Greenland, Schlesaelman, & Criqui, 1986).
More detailed information about CFA (including Models A, B, C, and D) and its interpretation in terms of response shift types can be found in the Appendix and Figure S1.   All subscale scores at pre-and post-treatment showed significant decreases (all p-values <0.001; Table 2), except for subscale Aggression. The Cohen's d effect size ranged from 0.14 to 0.45, which is considered small (Cohen, 1992).

| Change in overall factor structure; reconceptualization
The 7-factor psychopathology structure of the SQ-48 ( Figure S1) was analysed at both pretreatment and post-treatment. The factor structure had a poor fit at pretreatment (not in  Table 4). When looking at RMSEA, we found that four subscales had a poor fit: Mood, Anxiety, Somatic complaints, and Cognitive problems. Compared with the other fit indices, RMSEA is generally considered less reliable with relative modest sample sizes and large amount of parameters (Jackson, 2003).
All item-specific pretreatment and post-treatment factor loadings, threshold estimates for the four item categories, and residuals (residuals of pretreatment fixed to 1 with theta parametrization, see Appendix) of the configural model are demonstrated in Table 3. At pretreatment, all factor loadings were significant with p values of <.001 with the exception of Item 13 ("I considered my death or suicide," p = .46), Item 19 ("I did not want to live anymore," p = .05), Item 38 ("I felt hopeless," p = .48), and Item 43 ("I wanted to hit people if I was provoked," p = .22; see Table 3). These items, all within the Mood and Aggression subscale, loaded in the opposite direction (negative). At post-treatment, we again saw that these items loaded negatively, but only Item 43 (Aggression) was not statistically significant (p = .16). We found that the factor loading of all items increased with the exceptions of Item 10 ("I argued with others"), Item 1 ("I was short of breath with minimal effort"), Item 25 ("I did not dare to go alone to a crowded shop"), and Item 36 ("I felt uncomfortable when other people looked at me").
The factor correlations of the factor model are presented in Table   S1. All correlations were generally strong with a significance of The overall factor fit increased significantly over treatment and criteria for configural invariance could not be met. We found four items that loaded negatively on their common factor. Item 13 ("I considered my death or suicide"), Item 19 ("I did not want to live anymore"), and Item 38 ("I felt hopeless") seemed to form a separate latent factor consisting of suicidal ideation and hopelessness. Subsequently, the model did not fit well within the Mood subscale which had consequences for the overall model fit. These negative factor loadings increased and were considered insignificant at pretreatment and significant at post-treatment. Consequently, although items 13, 19, and 38 were already distinct from the rest of the mood items, an increase in factor loadings after treatment indicated a change in item scale meaning or reconceptualization response shift.

| Factor metrics over time; reprioritization
In order to examine factor loading change in the factor model, the loadings were constrained between time points within the metric invariance model (Model B; Table 2). The change in factor fit was analysed ( The metric variance detected in items 3, 7, and 40 was due to reconceptualization in the other Mood items, rather than reprioritization (see paragraph 3.2). Lifting constraints on these items resulted in invariant outcomes (Table 4, Table S2). Within these items, the factor loadings increased, suggesting reprioritization response shift.

| Thresholds over time; uniform recalibration
Thresholds are constrained between pretreatment and post-treatment within the partial Strong invariance model (Model C; Table S3).
Because Items 3, 7, 26, 40, and 47 were not invariant in the metric model, they were kept from further constraints in the strong model.

Differences in thresholds between pretreatment and post-treatment
were analysed by testing the change in factor fit between Model B and Model C. The overall model fit remained the same (CFI = 0.90).
The chi-square difference tests per factor and for all factors combined were insignificant and ΔCFI did not exceed 0.01 (Model B versus C).
No uniform change in measurement or uniform recalibration response shift could be detected.

| Residual variances over time; non-uniform recalibration
In order to test for change in residual variance between pretreatment and post-treatment, residual variance was constrained between time points within the strict invariance model. Strict invariance assumes that the residual variance does not change during treatment. In order to assess partial strict invariance, residual variance is constrained between pretreatment and post-treatment for all factors combined as well as each factor separately. Because strict measurement invariance was estimated with theta parameterization, which fixes residual variances to 1, item specific residuals could not be interpreted (Table S4) Table S4.

| DISCUSSION
We tested the occurrence of response shift concerning self-reported psychopathology in adult psychiatric outpatients across their mental health treatment. We found pretreatment and post-treatment differences in factor structure and item factor loadings. In terms of response shift can be concluded that we found reconceptualization within the Mood subscale: items consisting of suicidal ideation and hopelessness became more distinct, and patients seemed to approach suicidal ideation after treatment as a separate concept from depression. So, it is possible that a considerable proportion of our sample may have less mood-related symptoms after treatment without experiencing a decrease in suicidal ideations (see also Bringmann, Lemmens, Huibers, Borsboom, & Tuerlinckx, 2015;Nock, Hwang, Sampson, & Kessler, 2010). Second, we found reprioritization within at least two items of the subscales Somatic complaints and Cognitive problems. After treatment, patients seemed to place more value on these problems. Perhaps cognitive and somatic problems became more important when patients returned to work after sick leave. In conclusion, our hypothesis about response shift in especially the subscale depression/mood was only partly confirmed, as we also found response shift in other subscales. This may imply that not only depression is sensitive to response shift but also other psychopathology.
Our results are largely in line with current literature which indicates that response shift seems to be the rule rather than the exception. Only three studies were focused specifically on response shift in psychiatric patients (Fokkema et al., 2013;Nolte et al., 2016;Smith et al., 2016) and all found some level of response shift. Other relevant mental health  clinician-rated scale and by de Beurs, Fokkema, de Groot, de Keijser, and Kerkhof (2015) with self-report scale.
There is discussion on how strict the requirements should be concerning testing response shift by longitudinal measurement invariance (e.g., Fokkema et al., 2013). A first view states that full measurement invariance is an assumption that is too strict and, therefore, that comparisons of means across treatment are still meaningful when partial invariance is obtained and at least one item within each factor is invariant (Byrne et al., 1989;Steenkamp & Baumgartner, 1998). A second view is more strict and states that most (subscale) items should be invariant in order to make meaningful comparisons of the mean (Reise et al., 1993;Vandenberg & Lance, 2000;Wu, 2016). A third view assumes that true change in scores may be directly linked to respondents' changing perspective as a result of adaption, coping, or treatment (Boucekine et al., 2015;Oort, Visser, & Sprangers, 2009). In this view, response shift should not be considered as a measurement bias but as a true change.
Our study can be approached with all three views. Although we found response shift, this was present within a limited amount of items. This had no significant effect on the standardized mean difference between pretreatment and post-treatment. Additionally, our patients were mainly treated with CBT, which can cause a shift in cognition and therefore may result in response shift. This is in line with response shift theory, which assumes that changes in a person's health status (e.g., diagnosis and treatment) are the requisite catalyst for response shift (Rapkin & Schwartz, 2004;. This was confirmed by Wu who found response shift across treatment in depressed adolescents (Wu, 2015) but not in nonclinical adolescents (Wu, 2016). Accordingly, Ahmed, Sawatzky, Levesque, Ehrmann-Feldman, and Schwartz (2014) found no response shift in chronic physically ill individuals with stable physical health, which supports the assumption that response shift is not expected in patients with relatively stable health conditions (Ahmed et al., 2014). Finally, there may also be other potential explanations for our results then response shift. One of these alternative explanations is a decrease of variability of items after treatment (Fried et al., 2016). Due to a decrease of severity, items may approach a mean of zero, resulting in small SDs that cannot exhibit substantial correlations anymore. This would explain variance of certain symptoms that may have low severity amongst a treated sample (e.g., acute suicidal ideation). However, in our study the SDs slightly increased and factor fit increased, suggesting that it is unlikely that this explains our findings.
A strength of this study is that ROM data were collected in a naturalistic sample of real-life patients. We measured a wide range of psychological symptoms in a broad sample of adult psychiatric outpatients.
Also, we have examined all response shift types, which has been done so far in only two adult mental health care studies (Fokkema et al., 2013;Nolte et al., 2016). This study also has limitations. We had no detailed individual information on therapists or types of treatments.
So, for instance, it is not clear whether response shift varied by treatment type (psychotherapy and pharmacotherapy). Also, treatment length for participants varied, depending on the specific treatment required and its progress. It may be noted, however, that response shift already can occur after only 1 month of mental health treatment (e.g., Elhai et al., 2013;Latini et al., 2009). Moreover, after dividing our data in treatment longer-and shorter than 16 weeks (median), we found response shifts in the same items for both strata, with exception of Item 38 and 39 (Tables S5 and S6). Note that these sensitivity analyses should be interpret with caution because of limited sample sizes. Our sample size was also not large enough to examine potential subgroup differences regarding response shift between mental disorders. Finally, generalization of our results is at least limited to Dutch-speaking patients. Generalization may also be limited by our study population (outpatients), our design (observational pre-post-treatment), and our instrument (generic self-report questionnaire). However, this is unlikely because response shift has also been found concerning inpatients (e.g., Elhai et al., 2013;Nolte et al., 2016), randomized controlled trials (e.g., Fokkema et al., 2013), disease-specific self-report questionnaires (e.g. Elhai et al., 2013;Fokkema et al., 2013;Fried et al., 2016), and clinician-rated instruments (e.g., Fried et al., 2016).
Future research with multiple follow-ups could specify more exactly what type of response shift occurs at what moment across treatment (soon after the beginning of treatment or after a certain duration of it). Second, further research may evaluate the relative importance of the response shift types (Jakola, Solheim, Gulati, & Sagberg, 2016). For example, it was suggested that recalibration is the only true response shift, because reprioritization and reconceptualization can be seen as coping strategies instead of response shifts (Blome & Augustin, 2015;Gerlich et al., 2016). Third, more research is needed on predictors of which psychiatric patients may experience response shift (Daltroy, Larson, Eaton, Phillips, & Liang, 1999;Wu, 2016;Rapkin, Garcia, Michael, Zhang, & Schwartz, 2017). For instance, response shift seemed more likely to occur in psychotherapy patients than in those treated with medication (Fokkema et al., 2013;Fried et al., 2016;Uher et al., 2008). Additionally, further response shift research is needed to examine possible differences in mental disorders.
On the whole, this study provides additional evidence that response shift may occur in adult psychiatric patients across treatment. The exact meaning of this response shift is not clear: is it an unwanted potential bias in treatment evaluation or mainly a coping strategy and desired cognitive change intended by mental health treatment (e.g., CBT)? Future research in this area would be able to give more clarity on this question. Equality was tested with chi-square difference tests. However, as these tests are highly dependent on sample size, the more robust ΔCFI was calculated to see whether the CFI value was substantially different between CFA models (ΔCFI > 0.01; Gregorich, 2006;Kim, 2005 ;Rutkowski, 2013). A decrease in model fit is considered significant with a chi-square difference test (χ) above p = .05, in conjunction with a ΔCFI above 0.01 (Chen, Curran, Bollen, Kirby, & Paxton, 2008;Cheung & Rensvold, 2002;Hu & Bentler, 1998). When full measurement invariance could not be obtained, the equality constraints of the non-invariant parameters were lifted in order to assess partial measurement invariance (Oort, 2005;Wu, 2016).
These longitudinal measurement invariance tests were used as a framework to test the occurrence of four types of response shifts: reconceptualization, reprioritization, uniform recalibration, and nonuniform recalibration.

Reconceptualization
A factor model is assumed to be configural invariant, meaning that the same factor loading pattern is present at both time points (pre/post).
Each item should load on the same common factor, both at pretreatment and post-treatment. When items load on different latent factors after treatment, this is indicative for a shift in concept. Violation of configural invariance is indicative for the occurrence of reconceptualization (Oort, 2005).
To test for configural invariance, the 7-factor structure was fitted for both pretreatment and post-treatment (Gregorich, 2006). The factor model is assumed to be configural invariant, when the difference of fit indices between pretreatment and post-models is insignificant.
We used bootstrapping (9,999 replicates) to test whether this difference is statistically significant (Canty, 2002;Oort, 2005). Furthermore, we compared the item-specific factor loadings in order to check for salient changes of factor loading directions.

Reprioritization
The metric invariance model requires corresponding factor loadings to be equal across time points. An increase or decrease in factor loadings after treatment suggests that there is response shift in the form of reprioritization (Oort, 2005); for example, items seem to be more or less indicative for a certain latent factor. We computed a configural model with both time points (pretreatment and post-treatment) combined, without equality constraints (Model A, see Table 4). We then computed a metric model with the factor loadings constrained equally between time points (Model B, see  (Gregorich, 2006;Kim, 2005;Rutkowski, 2013).
Analyses were computed for the whole 7-factor psychopathology model at once as well as per subscale separately. If invariance on subscale level was not met, we further examined partial item-level measurement invariance (Wu, 2016).

Uniform recalibration
To test for uniform recalibration response shift, we must assess whether the regressions of items onto their associated common factors yield similar threshold values across time-points. When equality is established, the model shows strong invariance, meaning that there is no indication for uniform recalibration (Oort, 2005). In other words, patients appraise the SQ-48 item response options after treatment the same as before treatment. However, when there is variance, treatment may have changed a patient's idea of the amount of the hopelessness indicated by the answer option "often" (item 38; Oort, 2005). We computed a third model with both factor loadings (when invariant in the metric model) and thresholds constraint to be equal between time points (strong invariance model; Model C, see Table 4) and compared the factor fit with that of Model B. If threshold values are equal between pre-and post-treatment, model fit should be similar after thresholds are constrained to be equal between time points. A chi-square difference test in conjunction with ΔCFI was calculated in order to quantify the differentiations (χ = p > .05; ΔCFI <0.01; Gregorich, 2006;Rutkowski, 2013). Strong invariance tests were computed for the whole 7-factor model at once, per subscale separately.
When reprioritization or recalibration was found, the variant items were excluded from further constraints (partial measurement invariance; Wu, 2016).

Non-uniform recalibration
When observed variance estimates across time-points are compared, changes should reflect differences in common factor variation rather than contamination by changes in residual variation. Equality between pre-and post-treatment residual variances is called strict invariance.
Changes in residual variance assume the presence of non-uniform recalibration response shift. Non-uniform recalibration means that some SQ-48 item response options are associated with a greater level of that item's specific construct and other response options are not.
For example, the response option "sometimes" is related to greater levels of hopelessness (item 38) than the response option "rarely" (Oort, 2005). These non-uniform recalibrations result in changes in variances that can not be attributed to change in common factor variances, i.e. residuals (Oort, 2005). In order to test the equality of residual variances, theta parameterization is used. Theta parameterization is presently the most reliable method for constraining residual variance with WLSMV estimation (Hirschfeld & von Brachel, 2014;Muthén & Asparouhov, 2002). The theta approach fixes the residual variance into 1 for all variables in the reference group (pre-treatment).
In the strict invariance model, the residuals of the post-treatment group are also fixed into 1, in order to test the residual equality between pre-and post-treatment (Muthén & Asparouhov, 2002).
Equality constraints on corresponding factor loadings, thresholds, and residual variances were computed in the strict invariance model (Model D, see Table 4). If residual variances are equal between preand post-treatment, Model D factor fit should be similar as Model C.
The differences in factor fit of Model D in comparison with Model C were compared in order to detect discrepancies. Equality was assumed, when the chi-square difference test was insignificant (p ≥ .05) and ΔCFI < 0.01 (Gregorich, 2006;Rutkowski, 2013). Analyses were computed for the whole 7-factor model at once and per subscale separately. Items that were not invariant prior to the strict invariance model were not constrained. If invariance on subscale level in the strict model was not met, we further examined partial item-level measurement invariance (Wu, 2016).