Measurement Invariance Testing of the Patient Health Questionnaire-9 (PHQ-9) across People with and without Diabetes Mellitus from the NHANES, EHMS and UK Biobank datasets

Background: The prevalence of depression is higher among those with diabetes than in the general population. The Patient Health Questionnaire (PHQ-9) is commonly used to assess depression in people with diabetes, but measurement invariance of the PHQ-9 across groups of people with and without diabetes has not yet been investigated. Methods: Data from three independent cohorts from the USA (n=1,886 with diabetes, n=4,153 without diabetes), Quebec, Canada (n= 800 with diabetes, n= 2,411 without diabetes), and the UK (n=4,981 with diabetes, n=145,570 without diabetes), were used to examine measurement invariance between adults with and without diabetes. A series of multiple group confirmatory factor analyses were performed, with increasingly stringent model constraints applied to assess configural, equal thresholds, and equal thresholds and loadings invariance, respectively. One-factor and two-factor (somatic and cognitive-affective items) models were examined. Results: Results demonstrated that the most stringent models, testing equal loadings and thresholds, had satisfactory model fit in the three cohorts for one-factor models (RMSEA = .063 or below and CFI = .978 or above) and two-factor models (RMSEA = .042 or below and CFI = .989 or above). Limitations: Data were from Western countries only and we could not distinguish between type of diabetes. Conclusions: Results provide support for measurement invariance between groups of people with and without diabetes, using either a one-factor or a two-factor model. While the two-factor solution has a slightly better fit, the one-factor solution is more parsimonious. Depending on research or clinical needs, both factor structures can be used.


Introduction
People with diabetes have an increased risk of developing clinical and sub-clinical depression (Nouwen et al., 2010;Albertorio et al., 2017). Because depression in diabetes has been associated with adverse consequences, including a higher risk of diabetes complications (Nouwen et al., 2019), mortality (van Dooren et al., 2013), and cognitive decline (Schmitz et al., 2018), the regular screening of depressive symptoms among people with diabetes has been suggested to clinicians (Young-Hyman et al., 2016). Originally developed in the US to screen for depressive disorders in primary care (Spitzer et al., 2001), the Patient Health Questionnaire-9 (PHQ-9) has become a widely used self-report depression screening tool for clinicians and researchers across the world. The scale is based on DSM-IV diagnostic criteria (Diagnostic and Statistical Manual of Mental Disorders, fourth edition; DSMIV; American Psychiatric Association, 1994) and assesses the affective, cognitive, and somatic symptoms of depression.
Despite the PHQ-9 having been developed and validated for use in primary care (Kroenke et al., 2010;Sung et al., 2013), the PHQ-9 is also commonly used in samples of people with diabetes. Some studies have provided psychometric support for use of the PHQ-9 in people with type 1 and type 2 diabetes. For instance, the PHQ-9 has been shown to have good internal consistency and convergent validity in people with diabetes (Janssen et al., 2016). However, questions remain regarding measurement invariance and the PHQ-9's underlying factor structure, using one or two factors, in diabetes samples (Boothroyd et al, 2019). In primary care samples, studies relying on exploratory factor analysis tend to report a single factor solution (i.e., all items loading on the same single factor) (e.g. Cameron et al., 2008;Dum, et al., 2008, Hanssen et al., 2009Huang et al., 2006), while studies using confirmatory factor analysis have found that a two-factor solution, with one factor reflecting the somatic items of the PHQ-9 and the other factor the cognitive-affective items, has better model fit (Beard et al., 2016;Chilcot et al., 2013;Petersen et al., 2015). The only study examining the PHQ-9 factor structure in people with diabetes, to our knowledge, found a better fit for a two-factor solution than the one-factor solution both in people with and without diabetes (Janssen et al., 2016).
Measurement invariance can be used to compare the stability of a scale's factor structure between groups (van de Schoot et al., 2015). A number of studies have shown that the PHQ-9 is scale or measurement invariant, that is, that the PHQ-9 items measure the same underlying factor structure across different groups and settings for both the one-and two-factor models. Specifically, invariance has been shown for different demographic (age, sex, education level and marital status; González-Blanch et al., 2018;Leung et al., 2020;Harry et al., 2019;Petersen et al., 2015;Villarreal-Zegarra et al., 2019), cross-cultural (Keum et al., 2018;Miranda and Scopetta, 2018), US student populations; Galenkamp et al, 2017), Netherlands Dutch and ethnic minorities (Arthurs et al., 2012;Merz et al., 2011), and patient groups (e.g. primary and secondary care) in the US and elsewhere. However, only one study reported assessing whether the PHQ-9 is scale invariant between people with and without diabetes, but no detailed statistical information on measurement invariance testing was provided. Research demonstrating measurement invariance is needed, as it currently remains unclear whether the meaning of the PHQ-9 items is similar for people with and without diabetes. This is particularly relevant for a scale assessing depressive symptoms in people with diabetes, given the overlap between the somatic symptoms of depression (e.g. trouble falling/staying asleep, sleeping too much; feeling tired or having little energy; poor appetite or overeating) with those of diabetes (Harding et al., 2019;McDade & Watson, 2011;Roy et al., 2012).
Taken together, depression is a common comorbidity of diabetes that is often assessed in research and clinical practice using the PHQ-9. However, the PHQ-9 has not yet been validated in terms of measurement invariance in this population. Therefore, the aim of this study was to first determine whether the PHQ-9 is measurement invariant between people with and without diabetes, and second, whether a one-factor or a two-factor was a better fit for diabetes samples, using three large data sets, namely the National Health and Nutrition Examination Survey (NHANES; USA), the Emotional Well-Being, Metabolic Factors and Health Status (EMHS) study and the Health, Inflammation, and Depression (HID) studies (Quebec, Canada), and the UK-Biobank (UK).

Samples
Participants were from three data banks, namely the US NHANES 2009-2014, the EMHS-HID from Canada, and the UK-Biobank.
NHANES-USA 2.1.1. Study design Data source-The National Health and Nutrition Examination Survey (NHANES) is a multipurpose cross-sectional health survey that measures the health and nutritional status of the civilian noninstitutionalized U.S. population. NHANES uses a complex survey stratified multistage, probability cluster design to select a representative sample of the population. In the U.S., the NHANES is the only national health survey that contains a dual protocol for the collection of self-report health information and clinical, physical examination, and laboratory. The data collection is systematically achieved, allowing the identification of both diagnosed and undiagnosed health conditions (Boltri et al., 2005).
NHANES data collection is completed in two phases. The first phase is a face-to-face interview in the participant's household. The second phase consists of a series of private interviews, and physical and laboratory examinations held in a mobile examination center (MEC). Both the household and MEC interviews are performed using a standard protocol with trained staff and recorded using computer-assisted personal interviewing (CAPI) protocols. NHANES data collection protocol has been approved by the National Center for Health Statistics Research Ethics Review Board. All the participants provided written informed consent. More information about NHANES collection protocols can be found elsewhere (National Health andNutrition Examination Survey, 2013, 2014).
Analyses were performed using NHANES public-use data files. NHANES data files are based on an independent two-year cycle. Each data cycle contains its respective complex survey elements such as weight, primary sample unit, and strata to conduct extrapolation to the general non-institutionalized adult population. For the development of an analytical study, we combined three NHANES cycles to increase sample size, subsequently increasing the statistical power, and precision of the subdomain of diabetes (National Health and Nutrition Examination Survey, 2011). Following this strategy, three NHANES data cycles were merged for this study (i.e., 2009(National Health and Nutrition Examination Survey 2013).
For the present study, a total of 8,221 eligible adults ages 20 and older who visited the MEC and provided complete case data on diabetes and depression were selected. The PHQ-9 was administered by trained interviewers. Of those eligible, 1,886 participants self-reported having a diagnosis of diabetes and 4,153 participants self-reported never having received a diagnosis of diabetes. Participants who had diabetes as confirmed by a positive lab test (glucose and HA1c) but were unaware of having the condition (n = 2,182) were excluded from the analysis.
Diabetes status in NHANES was based on self-reported physician diagnosis. Physician-diagnosed diabetes was obtained during the household interview. Women who reported having diabetes only during pregnancy were excluded from the sample. For the current study, sample weighting was not included in the invariance analysis since we aimed to study the measurement invariance and not derive non-institutionalized population estimates.
EMHS-HID -Canada The EMHS and HID cohorts were combined for use in the present study to examine people with (HID) and without (EMHS) self-reported diabetes. Both samples were recruited from the original CARTaGENE study (www.cartagene.qc.ca), a large health survey of 20,004 Frenchand English-speaking residents of the Canadian province of Quebec between the ages of 40 and 69 years in 2009(Awadalla et al., 2013. The principal aim of EMHS study was to examine the combined role of metabolic dysregulations and depressive symptoms in the risk of type 2 diabetes (Schmitz et al., 2016). The principal aim of the HID study was to examine the role systemic inflammation in depression among people with diabetes. The EMHS and HID study procedures were approved by the Douglas Mental Health University Institute Ethics Board and the St Justine Hospital Research Ethics Board.
A total of 2525 participants without diabetes at baseline from the CARTaGENE study participated in the EMHS follow-up in 2014-2015, and a total of 719 participants with diabetes from the CARTaGENE study participated in the HID follow-up in 2017. There were 87 participants that did not have diabetes at the CARTaGENE baseline assessment but developed diabetes by the EMHS assessment, and they were thus included the diabetes sample for the present analyses. For the EMHS study, individuals with depressive symptoms and metabolic factors were oversampled. For the HID study, individuals with depressive symptoms were oversampled. The original CARTaGENE study collected survey information on depressive symptoms, lifestyle, and demographic information, as well as health-related information, in addition to the collection of blood samples. CARTaGENE participants who participated in EMHS and HID completed a phone interview 5 years and 7 years, respectively, following the initial CARTaGENE assessment. These follow-up studies included the administration of the PHQ-9, used in the present study's analyses. Data were collected by CAPI.
Diabetes was assessed by self-report based on a diagnosis of diabetes made by a physician (EMHS) or by either a self-reported diagnosis of diabetes or HbA1c levels equal to or above 6.5 during the CARTaGENE baseline assessment (HID). Demographic variables were assessed by selfreport.
For the present study, a total of 2,411 EMHS participants without diabetes and 800 HID participants with diabetes were included (total n = 3,211), based on having complete available data on all PHQ-9 items. French-and English-speaking participants were included, though the majority (n = 2,913; 91%) were French-speakers.
UK Biobank The UK Biobank includes a population-based cohort consisting of 501,726 individuals aged between 40-69 years in 2006-2010, who were recruited through direct mailing invitations to 9.2 million National Health Service contact details living in reasonable proximity to one of 22 assessment centres throughout the United Kingdom. Assessment included demographic, socio-economic, psychosocial, and environmental factors, health status, and a range of physical measures (Sudlow et al., 2015). Participants entered their responses using a touch screen. Support was available at all times and a mouse and keyboard were available for those not comfortable with touchscreen computers (see https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/Touchscreen.pdf).
In 2016, an online link for a mental health questionnaire including the PHQ-9 was sent to all UK Biobank participants with an email address (n = 339,092) and 46% (n = 157,367) had responded (31% of total cohort) (Davis & Hotopf, 2019). The UK Biobank and subsequent amendments including the added mental health questionnaire were approved by the North West Multi-centre Research Ethics Committee.
For this study, we used a data set that was created to examine the genetical aspects of depression in people with diabetes and only included participants with white European heritage (n=487,320). Of those, 25,474 participants self-reported having diabetes mellitus, while 459,727 participants self-reported never having received a diagnosis of diabetes.
PHQ-9 data was available for 5,122 participants with diabetes and for 148,311 participants without diabetes. Among those, 4,981 participants with diabetes (97%) and 145,570 controls without diabetes (98%) had complete PHQ-9 data sets and were included in the current study (total n = 150,551).

Measures
Depressive symptoms. In all cohorts, the 9-item Patient Health Questionnaire (PHQ-9; Kroenke et al., 2001) was used to assess depressive symptoms. Item names, along with their frequencies across the three cohorts, are presented in Supplementary Table 1. Participants scored the items on 4-point scales ranging from 0 ("not at all"), 1 ("several days"), 2 ("more than half of the days"), and 3 ("nearly every day"). The total score ranged from 0 to 27. Because in the UK Biobank the PHQ-9 response options ranged from 1-4, these were recoded into 0-3.
Demographics included age, sex of birth, education level, and ethnic Note: Unweighted percentages for NHANES education level do not add to 100% due to missing data background (for details see Table 1).

Statistical analysis
A confirmatory factor analysis was first conducted to determine the fit of the PHQ-9 one-factor and two-factor solutions in each of the three cohorts for participants with and without diabetes. For the two-factor structure, the first factor included items reflecting a cognitive-affective domain of depression (items 1,2,6,9) and the second factor included items reflecting a somatic domain of depression (items 3,4,5,7,8) (Galenkamp et al., 2017). To examine PHQ-9 measurement invariance across groups, we conducted a series of measurement invariance tests using multigroup confirmatory factor analysis (CFA) with the weighted least squares mean and variance (WLSMV) estimator to account for the ordinal nature of the data (Svetina, et al., 2020). Measurement invariance tests are performed in a hierarchical manner, with increasingly stringent constraints applied to the parameters to determine if they are equal across groups. Measurement invariance support in each hierarchal level would lend strong support for the notion that the instrument is indeed measuring the same construct in a similar way across diabetes status groups. For the present study, we tested whether the one-factor and the two-factor structures of the PHQ-9 were invariant across groups of adults with diabetes and without diabetes.
For each test, PHQ-9 items were modelled as categorical variables. The recommended guidelines outlined by Wu and Estabrook (2016) and Svetina et al. (2020) for invariance testing of ordinal data were followed. This approach differs from the traditional approach for measurement invariance using continuous outcome variables (e.g., Vandenberg and Lance, 2000), though is better suited for ordinal variables such as the PHQ-9. Three increasingly restrictive models were hierarchically examined to determine the degree of measurement invariance between participants with and without diabetes within each cohort. Configural, threshold, and threshold and loading invariance were tested using the WLSMV estimator for ordinal data. In the first step, configural invariance was tested by examining whether the underlying factor structures are measured by the same items across those with and without diabetes, with an equal number and pattern of parameters across groups. In the second step, threshold invariance was tested by imposing an additional constraint on item thresholds, where these are constrained to be equal among diabetes groups. In a final step, threshold and loading invariance was tested by additionally constraining factor loadings to be equal among diabetes groups. An additional exploratory test was carried out to examine measurement invariance on the PHQ-9 across the three cohorts included in the present study (NHANES, EMHS-HID, and the UK Biobank). Analyses were conducted using MPlus version 7.4 (Muthén and Muthén, 2018).
Goodness of fit statistics were estimated for each CFA and measurement invariance models and were based on standard evaluative criteria (Chen, 2007). While a non-statistically significant chi-square value is often used to indicate model fit, this value is sensitive to sample size. Given the large sample sizes of the present study, we focused on indicators of model fit that are less sensitive to sample size (Chen, 2007), although we report chi-square statistics for completeness. A Root Mean Square Error of Approximation (RMSEA) equal to or below .08 and a Comparative Fit Index (CFI) value equal to or above .90 are indicative of acceptable model fit. For measurement invariance testing, an additional set of criteria were examined to determine measurement invariance based on changes in CFI and RMSEA values in the increasingly stringent measurement invariance models. Changes in RMSEA <0.015 and changes in CFI <0.01 indicate support for that level of measurement invariance (Chen, 2007). A chi-square difference test was also conducted, comparing the more stringent models with the model from the previous step (i.e., the thresholds only model was compared to the configural invariance model, and the thresholds and loadings model was compared to the thresholds only model), with a non-statistically significant chi square difference test indicating measurement invariance.
However, the chi-square difference test is also sensitive to sample size and therefore the focus for determining measurement invariance remained on changes in RMSEA and CFI.

Results
Sample characteristics are presented in Table 1. The frequency of responses on all PHQ-9 items, stratified by cohort, are presented in Supplementary Table 1. Overall, individuals with diabetes in all cohorts tended to report a greater severity of depressive symptoms compared to those without diabetes.
Results of the CFA tests demonstrated good model fit for both the one-factor and the two-factor solutions ( Table 2) across each cohort. However, there was stronger support for a two-factor structure in all cohorts, and particularly for the UK biobank according to the RMSEA value. In the one-factor solution, RMSEA was .076 for those with diabetes and was .074 for those without diabetes, whereas in the two-factor solution, RMSEA was .052 for those with diabetes and was .049 for those without diabetes. Table 3 presents the results of the measurement invariance tests per cohort for the one-factor solution models. We found that applying increasingly stringent constraints across the three levels of invariance testing did not significantly reduce model fit (all ΔCFI values were below 0.01 and all ΔRMSEA values were below 0.015) and model fit was acceptable for the configural, thresholds only, and loadings and thresholds models in all three cohorts. The most stringent model, testing equal loadings and thresholds, demonstrated satisfactory model fit in the three cohorts (RMSEA = 0.063 or below and CFI = .978 or above), providing support for measurement invariance between groups of people with and without diabetes using a one-factor solution. Table 4 presents the results of the measurement invariance tests per cohort for the two-factor solution models. Similarly, applying increasingly stringent constrains did not significantly reduce model fit according to the critical values (all ΔCFI values were below 0.01 and all ΔRMSEA values were below 0.015), and model fit for the configural, thresholds only, and loadings and thresholds models in all three cohorts was acceptable. We also found that the most stringent model, testing equal loadings and thresholds, was satisfactory across the three cohorts (RMSEA = 0.042 or below and CFI = .990 or above). Model fit indices and chi-square difference tests provided greater support in the models testing a two-factor solution compared to the models testing a one-factor solution.
Finally, results indicated support for measurement invariance in the models comparing the PHQ-9 items in participants with diabetes across the three cohorts included in the present study (Table 5). Although the chi-square difference tests for the more stringent models were statistically significant, this was not surprising given the large sample sizes. Configural, equal thresholds, and equal thresholds and loadings measurement invariance was supported by examining the changes in RMSEA and CFI values across the increasingly stringent models (all ΔCFI values were below 0.01 and all ΔRMSEA values were below 0.015), and model fit was acceptable for the one-factor and the two-factor solutions, though with better fitting models in with a two-factor solution ( Table 5).

Discussion
The present study examined whether the PHQ-9 was measurement invariant across people with and without diabetes in three large independent datasets from the USA (NHANES), Quebec, Canada (EMHS-HID), and the UK (UK-Biobank). The results showed that the PHQ-9 was measurement invariant across people with and without diabetes in each of the datasets. Measurement invariance was found for both the onefactor and the two-factor solutions of the PHQ-9. The findings suggest that the meaning of the PHQ-9 items is similar for people with and without diabetes. Thus, despite differences in the level of depression between people with diabetes and without diabetes, and the potential overlap of some depression symptoms (e.g. trouble falling asleep, sleeping too much; feeling tired or having little energy; poor appetite or overeating) with symptoms of diabetes, these differences do not seem to be attributable to differences in interpretation of the PHQ-9 items.
Comparing the diabetes samples across the three datasets also showed measurement invariance of the PHQ-9 despite differences in ethnic make-up with NHANES being a multi-ethnic cohort, the EMHS-HIS cohort of predominantly composed of French Canadians, and the UK-Biobank cohort in this study entirely composed of white Europeans. Our results extend those of earlier studies showing PHQ-9 measurement invariance across various demographic, cultural and ethnic, and patient groups and settings (Merz et al., 2011;Arthurs et al., 2012;Petersen et al., 2015;Galenkamp et al., 2017;González-Blanch et al., 2018;Keum et al., 2018;Miranda and Scopetta, 2018;Harry et al, 2019;Leung et al., 2020;Villarreal-Zegarra et al. 2019) and demonstrate the robustness of the PHQ-9 across groups and settings.
It is important to note that despite measurement invariance indicating that any differences between cohorts cannot be attributed to differences in interpretation of the PHQ-9 items, gender, socio-economic and cultural differences may exist between countries affecting the way depressive symptoms are perceived and expressed. Further research is needed to address this issue.
We also examined whether the factor structure of the PHQ-9 was better explained by a one-factor solution or a two-factor solution with one factor reflecting the somatic items and the other factor the cognitive-affective items. The results demonstrated that both factor solutions had good fit to the data, with the two-factor solution showing a somewhat better fit. Our results are consistent with Janssen et al. (2016) who also found support for both a one-factor and two-factor solutions in a sample of people with diabetes and with research in non-diabetes samples (e.g. Chilcot et al., 2013;Petersen et al., 2015;Beard et al., 2016). While the two-factor solution seems to better fit PHQ-9 data across populations, the one-factor solution is more parsimonious. The implication for using the PHQ-9 in people with diabetes is that both factor structures can be used for measuring depressive symptoms, and factor solution can be selected according to the research or clinical Note. The two-factor models including a somatic symptom cluster (items 3,4,5,7,8) and a cognitive-affective symptom cluster (items 1,2,6,9). needs. The two-factor solution can provide additional information on the cognitive and somatic domains of depression in people with diabetes, whereas the one-factor solution can be used to screen for depression, with recommended cut-off scores that can be applied to the total scale summary score to indicate potentially high levels of depression (Spitzer et al., 2001).
The results of the current study should be interpreted against its limitations. First, the three datasets did not include information to distinguish between types of diabetes, although the cohorts were likely predominantly comprised of people with type 2 diabetes. For example, using treatment-based algorithms Mosslemi et al. (2020) showed that 94% of NHANES participants had type 2 diabetes, while Thomas et al. (Thomas et al., 2018) estimated that 4% of white Europeans with diabetes in the UK-Biobank had type 1 diabetes based on genetic risk scores. Further studies are needed to establish whether the factor structure of the PHQ-9 and measurement invariance differs between the two types of diabetes. Second, the study was based on data from three culturally closely linked western countries (USA; Quebec, Canada; UK) and, despite the availability of the PHQ-9 in many countries and languages across the globe, the results may not be generalisable to other countries, cultures, and languages. While previous research has shown that the PHQ-9 is measurement invariant between different ethnic groups in the Netherlands (Baas et al., 2011;Galenkamp et al., 2017), Germans and native Russians living in Germany (Hirsch et al., 2013) and across sex, race/ethnicity and education within the NHANES data set 2015-2016 (Patel et al., 2019), only partial measurement invariance was found between Chinese and German student samples (Zhou et al., 2020). Further research is needed to examine whether the results can be extended to other cultures and languages. Third, two of the study samples (EMHS-HID and the UK Biobank) had a limited age range and consisted only of middle-aged participants. However, despite the differences in age range between these two cohorts and the NHANES cohort, the PHQ-9 showed invariance across the three cohorts. Moreover, our findings are in line with those of González-Blanch et al. (2018) who found measurement invariance of the PHQ-9 across younger (20-39 years old) and older (40-65 years old) adults in primary care patients in Spain. Fourth, the NHANES and UK Biobank assessed diabetes status by self-report, whereas the EMHS-HID cohort included a combination of self-report and HbA1c in the diabetes assessment. However, given that measurement invariance was found in all three cohorts, the nature of the  Indications for change in goodness of fit statistics: ΔCFI <0.01 and ΔRMSEA <0.015 indicating support for that level of invariance.
A. Nouwen et al. diabetes assessment, in this study, did not seem to impact the overall findings. Fifth, there were differences in the way the PHQ-9 was administered between the cohorts. In NHANES and EMHS-HID, trained interviewers were used while in the UK-Biobank participants entered their responses via a touchscreen. However, despite these differences, the PHQ-9 was measurement invariant between the cohorts for people with diabetes. Finally, participants in the EMHS-HID with high depressive symptoms were oversampled resulting in higher mean levels of depression compared to the NHANES and UK Biobank cohorts. It can be concluded that the meaning of the PHQ-9 items is similar for people with and without diabetes. Both a two-factor solution and the one-factor solutions show good fit to the data suggesting that both factor structures can be used for measuring depressive symptoms among people with diabetes, and factor solution can be selected based on research or clinical need. Comparisons of PHQ-9 scores between these two groups are likely to be meaningful, whether presented as a total score or as sub-scores of somatic and/or cognitive-affective items.

Author statement
AN and SD conceptualized the research question, conducted the statistical analyses, and wrote the original draft of the manuscript. NS, ZB, IP, and JAD provided assistance with methodology. All authors reviewed and edited drafts of the manuscript and approved the final manuscript.

Role of the Funding Source
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of Competing Interest
All authors declare that they have no conflicts of interest.