Measurement invariance and differential item functioning of the PHQ-9 and GAD-7 between working age and older adults seeking treatment for common mental disorders

Background: The nine-item Patient Health Questionnaire (PHQ-9) and seven-item Generalised Anxiety Disorder (GAD-7) scale are widely used clinically and within research, and so it is important to determine how the measures, and individual items within the measures, are answered by adults of differing ages. This study sought to evaluate measurement invariance and differential item functioning (DIF) of the PHQ-9 and GAD-7 between working age and older adults seeking routine psychological treatment. Methods: Data of working age (18 – 64 years old) and older ( ≥ 65) adults in eight Improving Access to Psychological Therapies (IAPT) services were used. Confirmatory factor analysis (CFA) was used to establish unidi-mensionality of the PHQ-9 and GAD-7, with multiple-group CFA to test measurement invariance and The Multiple Indicators, Multiple Causes Models approach to assess DIF. The employed methods were applied to a propensity score matched (PSM) sample in sensitivity analyses to control for potential confounding. Results: Data from 166,816 patients (159,325 working age, 7491 older) were used to show measurement invariance for the PHQ-9 and GAD-7, with limited evidence of DIF and similar results found with a PSM sample ( n = 5868). Limitations: The localised sample creates an inability to detect geographical variance, and the potential effect of unmeasured confounders cannot be ruled out. Conclusions: The findings support the use of the PHQ-9 and GAD-7 measures for working age and older adults, both clinically and in research settings. This study validates using the measures for these age groups to assess clinically significant symptom thresholds, and monitor treatment outcomes between them.


Introduction
Depression and anxiety disorders are some of the most commonly presenting mental health problems (Craske and Stein, 2016;Malhi and Mann, 2018).As disorders, they cause substantial individual impairment, are associated with increased direct and indirect healthcare costs, and reduced productivity (Trautmann et al., 2016).For older adults (≥65 years of age), common mental disorders, defined as depression and anxiety conditions, are particularly problematic as they have been associated with increased disability and use of physical health services (Beekman et al., 2002).As populations age (Harper, 2014), the problematic effect of common mental disorders will only continue to worsen.
Meta-analyses of controlled trials have suggested that psychological interventions are equally effective for depression for both older and working-age adults (Cuijpers et al., 2018;Haigh et al., 2018), and that older adults are less likely to benefit from intervention for anxiety disorders (Gould et al., 2012).Despite that, evidence from routine psychological treatment services suggests that older adults are more likely to benefit from them than working age adults (Saunders et al., 2021).Whilst there are likely to be some differences in the characteristics of older adults taking part in randomised controlled trials compared to attending routine treatment, there may also be differences in how different age groups interpret measures used to assess treatment effectiveness, resulting in artefactual rather than actual differences in treatment outcomes.
Compared to working age adults, the prevalence of such disorders are reported to generally be less for older adults (Volkert et al., 2013;Wolitzky-Taylor et al., 2010).Whilst this may reflect real differences, there is potential that these two groups interpret commonly used screening tools to measure common mental disorder symptom severity, which if true would cause issues comparing scores between age groups.The Patient Health Questionnaire 9-item depression scale (PHQ-9; Kroenke et al., 2001) and 7-item Generalised Anxiety Disorder scale (GAD-7; Spitzer et al., 2006) are two of the most validated and widely used measures for screening common mental disorders, and evaluating treatment efficacy in research (Kroenke et al., 2010) and clinical practice (Clark, 2018).However, to use such measures in group comparisons it is important to establish that the scales measure their respective constructs consistently across different groups of people.
Measurement invariance (Chen, 2008) and differential item functioning (DIF; Ellis, 1989) are tools that can be used to establish this consistency.Measurement invariance assesses whether an instrument or scale is consistently interpreted between different groups of individuals.DIF will identify whether a given item on a scale is answered differently for one group, compared to another, when the same two groups have the same level of the underlying trait of interest.For example, anhedonia appears to be a more common symptom of depression experienced by older adults, compared to tearfulness or sadness (Gum et al., 2010).The PHQ-9 has been shown to exhibit DIF determined by age group (although the cut-off was ≥54 years of age), specifically on items addressing anhedonia, fatigue and low mood (Cameron et al., 2013).Previous research has shown measurement invariance for the PHQ-9 (Patel et al., 2019) and GAD-7 (Shevlin et al., 2022), although not by age group or in a large routine clinical sample.Given the comorbidity of depression and anxiety (Tiller, 2013), there is notably limited research testing DIF of the PHQ-9 and GAD-7 together in a clinical sample, despite the widespread use of these measures.
The aim of this study was to assess the measurement invariance of the PHQ-9 and GAD-7 and DIF of the individual scale items between working age (18-64 years old) and older adults (≥65 years old) seeking psychological therapy for common mental disorders.

Participants
Eight Improving Access to Psychological Therapies (IAPT; now known as NHS Talking Therapies, for anxiety and depression) services provided data on referrals received between January 2011 and August 2020.These services were all members of the North and Central East London IAPT Service Improvement and Research Network (NCEL IAPT SIRN; Buckman et al., 2021;Saunders et al., 2020).These services, grouped together geographically and managed by local NHS Trusts, support the provision of evidence-based psychological treatments for common mental disorders using a stepped-care model (Clark, 2018).
For this analysis, only scores recorded at the initial assessment were included, regardless of whether individuals received formal treatment by the services at later contacts.Further, individuals were included if they had item-level data available for both the PHQ-9 and GAD-7 at their assessment and were at least 18 years of age.Those whose diagnosis (referred to as 'problem descriptor' by services) was recorded as a severe mental illness, such as schizophrenia or substance misuse problems, were also excluded.This is because these primary-care based services do not have standardised treatment protocols for severe mental illness (Clark, 2018), although they can support people with depression or anxiety in the context of a severe mental illness where it is safe to provide care without the input or oversight of a multidisciplinary team.Therefore, if the problem descriptor is recorded as a severe mental illness for people (indicating it is the focus of treatment) then they would have had a different pathway into the services and so will be different from the main analytic sample.To determine the age group comparison in all the analyses, individuals who were 18-64 were recorded as being 'working age' and ≥ 65 considered as 'older'.

Measures
The Patient Health Questionnaire-9 (PHQ-9; Kroenke et al., 2001) is used to measure the degree of depression symptom severity.The nine items within the measure address: anhedonia, low mood, sleep fatigue, appetite, low self-esteem, concentration, psychomotor disturbance and suicidal ideation.The questions are scored between 0 ('not at all') and 3 ('nearly every day') so total scores range between 0 and 27.
The Generalised Anxiety Disorder-7 (GAD-7; Spitzer et al., 2006) is a measure used to assess the severity of symptoms of generalised anxiety disorder, as classified by DSM-IV.The seven items address: nervousness, uncontrollability of worrying, pervasiveness of worrying, issues relaxing, restlessness, irritability and anticipatory fear.The items are scored in the same manner as the PHQ-9, with total scores ranging between 0 and 21.
Both the PHQ-9 and GAD-7 are used by services to measure symptom severity at assessment to identify clinical need, but are also collected on a sessional basis as part of routine outcome measurement to monitor treatment progress.Within the initial assessment, patients answered a range of additional questions to provide sociodemographic and clinical information.As part of this, their age, gender, ethnicity, employment status and whether they are currently prescribed or taking psychotropic medication is recorded.The Index of Multiple Deprivation (IMD; Noble et al., 2006) was also calculated based on the lower layer super output area (LSOA) and collapsed into quintiles, where a lower quintile indicated greater local area deprivation.

Analysis
To explore the latent factors of depression and anxiety, the evidenced unidimensional structures of each measure were considered (Bianchi et al., 2022;Rutter and Brown, 2017).This is how the factors are commonly considered in clinical practice and research, as positive summative correlations between them exist (Boothroyd et al., 2018;Smith et al., 2020).
In the first instance, confirmatory factor analysis (CFA) was undertaken using the R package 'lavaan' (Rosseel, 2012) for the proposed model (see Fig. 1).Two latent variables that distinctly represented depression and anxiety were constructed using both the PHQ-9 and GAD-7, as well as their correlation (Shevlin et al., 2022).
Three commonly used metrics were calculated to estimate the fit of the CFA model.This included the comparative fit index (CFI) with a threshold for 'good' fit defined as a value of at least 0.95 and an 'acceptable' fit as >0.90 (Hu and Bentler, 1999).The root mean squared error of approximation (RMSEA) was also used whereby a threshold of <0.05 was taken to indicate a 'close' fit, 0.05-0.08 as an 'acceptable' fit and 0.08-0.10 as a 'moderate' fit (Schermelleh-Engel et al., 2003).Finally, the standardised root mean square residual (SRMR) was estimated with <0.05 used to indicate 'good' fit (Hu and Bentler, 1999) and < 0.10 as 'acceptable' (Schermelleh-Engel et al., 2003).
Multiple-group CFA (MGCFA) was used to assess measurement invariance across different groups (Chen, 2008).To do so, several models with increasing strictness were constructed to estimate the level of invariance: Evidence of measurement invariance was determined by comparing the change in model fit statistics between the given model (M) and the preceding model (M-1).When measurement invariance had been established, the model could be adopted.To determine this, models were considered within predetermined tolerated ranges: CFI value change was (ΔCFI) <0.01, ΔRMSEA <0.015 and ΔSRMR <0.03 (Cheung and Rensvold, 2002;Shevlin et al., 2022).χ 2 values were recorded, but not used in deciding to adopt a model or not due to issues with doing so when using larger samples (Cheung and Rensvold, 2002).
Differential item functioning (Ellis, 1989) determines differences in individual item scores among certain groups (i.e.age groups) or levels of a variable (such as a depression item as individual PHQ-9 question score, or an anxiety item as individual GAD-7 question score) while considering the overall construct being measured.The Multiple Indicators, Multiple Causes Models (MIMIC; Jöreskog and Goldberger, 1975) approach has been applied to assess DIF in similar previous research (MacIntosh and Hashim, 2003;Shevlin et al., 2022).MIMIC is adapted in the current analyses to explore individual item differences between working age and older adults.
The MIMIC models seek to provide information on (1) the factor loadings of the PHQ-9/GAD-7, (2) the regression coefficients between the predictor variables and the latent variables (that show means differences in the latent variable, based on different levels of the predictor variables) and (3) the direct effects between predictor variables and PHQ-9/GAD-7 items, unaffected by latent variable variability (significant direct effects suggest the presence of DIF).The MIMIC model included two correlated latent variables (depression and anxiety), 16 covariates (the individual items of the PHQ-9/GAD-7) and a single predictor (age group).
Modification indices (MIs) and standardised expected parameter change (SEPC) values were used to decide which direct effects to include within the model.MIs provide an indication of which path could substantially improve the model fit if it was freely estimated, indicated by a reduction of chi-square by >3.84 (the critical value for one degree of freedom, p < .05).However, to avoid adding insignificant parameters, a more moderate value of 10 was used to determine potential direct effects based on MI scores.SEPC indicated the estimated value of a fixed parameter and reflected the expected standardised regression coefficient.Since MIs are partially a product of sample size (Chou and Bentler, 1990), the SEPC was used in combination to determine which parameters should be added to the model (Kaplan, 1989).The criteria for adding a direct effect to the model was: MI > 10 and SEPC>0.20 (Shevlin et al., 2022).The model was repeatedly estimated by adding the path with the largest MI/SEPC until there were no MIs/SEPCs >10/ 0.2.The package 'MplusAutomation' (Hallquist and Wiley, 2018) was used for the MIMIC/DIF related analysis.
Parameters of the model were estimated using robust maximum likelihood estimation (MLR; Tucker and Lewis, 1973), and the same fit statistics as the MGCFA.Chi-square tests of independence were used for categorical variables (with Cramér's v to indicate the magnitude), and independent samples t-tests for continuous variables (with Hedge's g to indicate the magnitude).
Differences have been observed in sociodemographic and clinical variables between older and working age individuals attending psychological treatment services at assessment in previous research (Saunders et al., 2021).As such, sensitivity analyses were conducted in which older adults were matched on covariates (excluding age) to working age individuals using propensity score matching (Austin, 2011).Matching was conducted using the ethnicity, gender, mental health service trust, psychotropic medication, IMD quintile, referral year and problem descriptor variables.The PHQ-9 and GAD-7 scores were not used for matching on to avoid impacting the measurement invariance analyses (Saunders et al., 2023).Individuals with missing data on matching covariates were excluded from these sensitivity analyses and matching with replacement was employed, using a narrow caliper of 0.0001 (Gruber et al., 2022).The MGCFA and DIF procedures described above was then repeated for the matched control sample of older adults and their working age matches.

Ethics
NHS ethical approval was not needed for this study, as confirmed by the Health Research Authority July 2020 #81/81.The IAPT services provided data as part of a service improvement project, and the research adhered to procedures specified by the data hosting providers and was registered with the relevant NHS Trusts overseeing the IAPT services (project reference: 00519-IAPT).

Descriptive statistics
There were 173,578 people with PHQ-9 and GAD-7 item-level data available.Of these, 1315 individuals were <18 years of age or did not have any age data available.Additionally, 5447 were treated for a mental health disorder where there was no standardised IAPT treatment protocol and so were then also excluded.The analytic sample was 166,816 individuals, with 159,325 (95.5 %) of working age adults (18-64 years old) and 7491 (4.5 %) that were older adults (≥65 years old).This is shown in Fig. 2.
The descriptive statistics for the sample used within the analysis are presented in Table 1, separated by age category.Significant group differences were observed on all baseline variables except for gender.

Confirmatory factor analysis
CFA was applied to the whole sample and then stratified by age range groups.Within the whole sample, model fits were within the acceptable range (RMSEA = 0.079, CFI = 0.907, SRMR = 0.049).Similarly, acceptable metrics were obtained for the working age and older adult age range groups (working age: RMSEA = 0.079, CFI = 0.906, SRMR = 0.049; older adult: RMSEA = 0.074, CFI = 0.917, SRMR = 0.045).Consequently, unidimensionality of the GAD-7 and of the PHQ-9 (as independent scales) was indicated within the models.

Multiple-group confirmatory factor analysis
The results from the MGCFA are presented in Table 2. Measurement invariance was tested with incremental increases of model strictness from M1 to M6, with changes to model fit statistics being below the criteria values.In the initial model to be tested, configural invariance (M1), similar fit statistics to those seen within the CFA conducted on the whole sample were found.In the metric invariance (M2) model there were sub-criteria changes in fit statistics, indicating that loadings were similar between the working and older age categories.In the scalar invariance (M3) and residual invariance (M4) models, minimal changes were observed in fit statistics.Residual invariance (M4) and factor mean (M5) models led to minimal changes in model fit statistics, with the same for factor variances (M6) included.Consequently, measurement invariance of the GAD-7 and PHQ-9 between working age and older adults was indicated within the model.

Matched sample multiple-group confirmatory factor analysis
Propensity score matching was undertaken to create a matched sample of working age and older adults who did not have missing covariate data.This led to n = 24,940 (15.7 %) working age and n = 1525 (20.4 %) older adults being excluded.Within the sample of 5966 older adults, there were 98 individuals (1.6 %) for whom adequate matches could not be found and so they were subsequently excluded from these analyses.Therefore, there were 5868 older adults with matched controls (aged 18-64) included as part of the analysis.The significant group differences that were found pre-matching (see Table 1) were nonsignificant post-matching are shown in Supplementary Materials 1 Table 1.
The measurement invariance results of the matched sample are presented within Supplementary Materials 1 Table 2 and show   patterns of change.The greatest contrast is the difference between Scalar Invariance (M3) and Residual Invariance (M4), with ΔCFI -0.018 in the matched sample which exceeds the tolerance range.Further, the move from Residual Invariance (M4) to the restriction of factor means led to a ΔCFI -0.008 and ΔSRMR 0.023 which while below their respective measure tolerance ranges, are both greater differences than observed anywhere in the non-matched sample.These results indicate that there is not sufficient evidence of measurement invariance between the working age and older samples when they are matched on covariates.

Differential item functioning
Initially, the greatest MI and SEPC was a direct effect between age group and the sixth PHQ-9 item ('Feeling bad about yourself -or that you are a failure or have let yourself or your family down': MI = 593.561,SEPC = − 0.266).The direct effect was added and then the model was reestimated.This showed the next greatest MI/SEPC values to be between the age group variable and the sixth GAD-7 item ('Becoming easily annoyed or irritable', MI = 443.709,SEPC = − 0.241).Once this direct effect had also been added to the model, no variables met the MI/SEPC criteria.
The model with all direct effects added indicated that they all had small magnitudes (Age -> PHQ-9 item 6 = − 0.051, p < .001;Age -> GAD-7 item 6 = − 0.048, p < .001).Additionally, the overall model differences in R-square with the two items included as direct effects was small.The R-square for PHQ-9 item 6 increased from 0.461 to 0.462, suggesting that DIF accounted for 0.001 % of the variance in that item.For GAD-7 item 6, the R-square increased from 0.311 to 0.312, indicating that DIF account for 0.001 % of the variance for that item.Table 3 shows the DIF Model fit statistics.
As part of a sensitivity analysis, the same DIF process was undertaken within the propensity score matched sample and produced similar results.PHQ-9 item 6 and GAD-7 item 6 were identified as items in which there was evidence of DIF; these results are shown in Supplementary Materials 1 Table 3.

Discussion
This study has demonstrated measurement invariance of the PHQ-9 and GAD-7 between working age and older adults in a large sample of individuals seeking treatment for common mental disorders.Differential item functioning has been shown for one item of the PHQ-9 (item 6) and one item of the GAD-7 (item 6), although the effect was minimal.The findings indicate that the same underlying constructs (depression and anxiety) exist for working age and older adults, supporting their use within clinical practice.In the matched sample analysis some potential measurement invariance was detected as was differential item functioning in the same items that were identified in the primary analysis.
The demonstrated measurement invariance of the PHQ-9 and GAD-7 in the unmatched sample provides further validation of the measures and supports their use for adults in clinical practice.This validation is important for the use of these measures as screening and outcome monitoring tools (necessary for service evaluation and performance estimation), as well as their use within research.However, it is noteworthy that in the matched sample analysis, fit statistics exceeded the tolerance ranges, indicating a degree of measurement variance.While limited comparable research exists for the specific age split (≥65 years), similar research with predominantly clinical samples has established measurement invariance for the PHQ-9 (Lamela et al., 2020) and GAD-7 (Moreno et al., 2019).
The results indicate that when controlling for the overall level of anxiety, older adults scored lower on PHQ-9 item 6 (low self-esteem) compared to working age adults.However, the effect size was small and the minimal variance suggests that DIF would not be the likely explanation for group differences for responses to the item.Additionally, when controlling for the overall level of depression older adults scored lower on GAD-7 item 6 (irritability), but the effect size and variance was minimal and not a likely cause for DIF.The detection of potential DIF of these specific items within the measures could be possibly be partly explained by the greater significant group effect size differences, relative to other measure items, recorded for GAD-7 item 6 and PHQ-9 item 6.The findings are generally supported by research showing measurement invariance (and absence of DIF) across multiple sociodemographic variables (Lamela et al., 2020;Moreno et al., 2019).

Limitations
This study should be considered within the context of some limitations.Despite the substantial size of the sample, it is drawn from a localised area and so other sources of geographical variance may not have been detected, limiting the generalisability of the findings.Additionally, although propensity score matching was undertaken, the effect of potential confounding by unmeasured factors cannot be ruled out.Further, the sample was matched on problem descriptors and that may have indirectly constrained GAD-7/PHQ-9 scores, although significant differences were still reported for each continuous variable.The analyses also only tested for uniform rather than non-uniform DIF (where the effect of the independent variable on the item varies depending on Notes.†GAD = Generalised anxiety disorder, Mixed A&D = Mixed anxiety and depression, OCD = Obsessive-compulsive disorder, PTSD = Post-traumatic stress disorder.the level of the latent variable), but this is in-line with prior research that has consistently demonstrated unidimensionality of the measures.

Implications
Measurement invariance of the PHQ-9 and GAD-7 in the unmatched sample supports their clinical use for adults of all ages.However, the variance that arose in the matched sample would indicate that there is the potential for bias when comparing scores across groups of working and older adults.Despite this, the very limited magnitude of the effect sizes and minimal variability from the measurement invariance and DIF analysis may not be clinically or individually meaningful to any given patient (Bauer-Staeb et al., 2021), although such thresholds have not been tested across age groups.As tools used in clinical decision making, with the presence of measurement invariance and absence of DIF in the PHQ-9 and GAD-7, the use of uniform measures thresholds between groups can be undertaken with greater confidence.The limited magnitude of any potential effect would also suggest that the measures are suitable tools for routine outcome measuring, as the results do not indicate that alternative measures are needed to compare across age groups.

Conclusions
This study observed measurement invariance in an unmatched sample for the PHQ-9 and GAD-7 between working age and older adults, with potential variance detected in a propensity score matched sample.Differential item functioning was minimally detected for two items of both measures and the findings were replicated in a matched sample.These results support their use within clinical practice and research, although future work would possibly seek to test differential item functioning with different covariates such as ethnicity or gender, where intersectionality may impact findings.In addition, to increase the robustness of the findings it would valuable to use geographically different samples, both nationally (in the UK) and internationally (Cromarty et al., 2016;Knapstad et al., 2018).

Table 1
Sample demographics with group differences.

Table 2
Multiple-group confirmatory factor analysis and fit indices (full sample).

Table 3
Differential item functioning model fit statistics for depression and anxiety.