Valid group comparisons can be made with the Patient Health Questionnaire (PHQ-9): A measurement invariance study across groups by demographic characteristics

Objective Analyze the measurement invariance and the factor structure of the Patient Health Questionnaire-9 (PHQ-9) in the Peruvian population. Method Secondary data analysis performed using cross-sectional data from the Health Questionnaire of the Demographic and Health Survey in Peru. Variables of interest were the PHQ-9 and demographic characteristics (sex, age group, level of education, socioeconomic status, marital status, and area of residence). Factor structure was evaluated by standard confirmatory factor analysis (CFA), and measurement invariance by multi-group CFA, using standard goodness-of-fit indices criteria for interpreting results from both CFAs. Analysis of the internal consistency (α and ω) was also pursued. Results Data from 30,449 study participants were analyzed, 56.7% were women, average age was 40.5 years (standard deviation (SD) = 16.3), 65.9% lived in urban areas, 74.6% were married, and had 9 years of education on average (SD = 4.6). From standard CFA, a one-dimensional model presented the best fit (CFI = 0.936; RMSEA = 0.089; SRMR = 0.039). From multi-group CFA, all progressively restricted models had ΔCFI<0.01 across almost all groups by demographic characteristics. PHQ-9 reliability was optimal (α = ω = 0.87). Conclusions The evidence presents support for the one-dimensional model and measurement invariance of the PHQ-9 measure, allowing for reliable comparisons between sex, age groups, education level, socioeconomic status, marital status, and residence area, and recommends its use within the Peruvian population.

Evidence on the PHQ-9's measurement invariance according to sex is less consistent than in other group comparisons. Some studies support a strong invariance for sex comparisons [11,16,17], whereas other studies report weak or even no measurement invariance across sexes [18,19], indicating that men and women could be interpreting the PHQ-9 items differently. In regard to other demographic variables, strong invariance has been evidenced across races/ethnicities [11], age, marital status, and educational level [16]. However, there are no studies on PHQ-9's measurement invariance across urban/rural areas or socioeconomic status. Despite this scant evidence, several studies make comparisons between these groups using the PHQ-9 [13,14]. Ultimately, the empirical evidence on the measurement invariance of the PHQ-9 across these demographic characteristics is still insufficient.
Indeed, it remains unclear if the PHQ-9 is consistently one-dimensional as originally developed [20]. Some studies have found only one underlying dimension that summarizes depressive symptoms as a whole [16,18,21]. However, depressive symptomatology is a multidimensional construct which includes cognitive, emotional, social, sexual, and other disruptions, and not all instruments have been designed for detecting and scaling each of its dimensions [22]. Other studies have identified at least two dimensions in the PHQ-9, including somatic, non-somatic or cognitive-affective dimensions [23][24][25][26][27]. Indeed, a recent systematic review identified that evidence on the dimensional structure of the PHQ-9 is still inconclusive [28]. Therefore, an important first step is to consider the dimensionality of the PHQ-9.
In view of the deficiencies in the evidence base relating to the PHQ-9's psychometric properties, we sought to evaluate the measurement invariance of the PHQ-9 across groups by selected demographic characteristics, following three steps: 1) identify the most appropriate factor structure for the PHQ-9 in the Peruvian population (one-dimensional or two-dimensional model); 2) to assess PHQ-9 measurement invariance by sex, age, education level, socioeconomic status, marital status and rural-urban area; and 3) to estimate the PHQ-9 reliability.

Study design
A secondary data analysis was conducted using data from the Peruvian Demographic and Health Survey (ENDES in Spanish), a nationally representative survey conducted annually. Since 2014, the ENDES has included a Health Questionnaire that assesses different aspects of health, such as mental health, oral health, and chronic diseases. Only cross-sectional information from 2016 ENDES Health Questionnaire was used, which is available at the website of the National Institute of Statistics and Informatics (INEI, in Spanish) [29].
ENDES design includes a two-stage random sampling technique, differentiated for rural and urban areas. In rural areas, the primary sampling units were groups of 500-2000 individuals and the secondary sampling units were the households within each of these groups. On the other hand, in urban areas, the sampling units consisted of blocks or groups of blocks with more than 2,000 individuals and an average of 140 households, and the secondary sampling units were the same as in rural settings [30].
items of PHQ). Initially there were 31,622 participants; however, after excluding participants with incomplete information, data of 30,449 participants were analyzed (see Fig 1).

Measurements
The PHQ-9 is a Likert self-report consisting of nine items designed from the nine criteria evaluated by the DSM IV for major depression (MDD). PHQ-9 has four response options (0 = not at all; 1 = several days; 2 = more than half the days; 3 = nearly every day), with scores ranging from 0 to 27. This instrument reports the indicators of depressive symptomatology during the last two weeks. [12] In other samples, the PHQ-9 has presented adequate levels of reliability (α = 0.84) [31], and adequate levels of specificity (>0.90) but low sensitivity, between 0.39 and 0.73 [32].
Other variables were added to analyze the characteristics of the population as well as the measurement invariance of the model. These variables were: sex, education level (primary education [up to 6 years], secondary education [7-11 years , marital status (married, never married and previously married), residence area (urban and rural) and natural region (coast, mountains and jungle); the latest not used for measurement invariance analysis.

Statistical analysis
A polychoric correlation matrix was calculated using sampling weights and used for the subsequent analysis (see S1 Table). Subsequently, a confirmatory factor analysis determined the dimensionality of the PHQ-9 in our target population. After identifying the number of dimensions, the measurement invariance was assessed to establish the PHQ-9 equivalence across groups by demographic characteristics. Finally, we performed the reliability analysis to determine the internal consistency of the PHQ-9 measures.
Confirmatory factor analysis. One-dimensional and two-dimensional measurement models that have been shown to be feasible for the PHQ-9 [20,[23][24][25] were evaluated to identify optimal fit in the target population (see M1, M2, M3, and M4 in S1 Fig). The estimator used was weighted least squares means and variance adjusted (WLSMV), which allows handling non-normality in the confirmatory factor analysis (CFA) [34].
The adjustment of the models was evaluated through two successive steps. First, the Comparative Fit Index (CFI) and the Tucker-Lewis Index (TLI), both with appropriate values �0.90; the Standardized Root Mean Square Residual (SRMR); and the Root Mean Square Error of Approximation (RMSEA) with a confidence interval of 90%, and with adequate values <0.08, were used to compare model fit [35,36]. As a second and last step, the correlation between the somatic and affective-cognitive dimension was evaluated (in the case of twodimensional models), since a very high correlation would indicate that both dimensions would be overlapping. A clear differentiation between both dimensions can be considered when the correlation is less than 0.80 [37].
Measurement invariance. Multiple models of the CFA measurement invariance were evaluated through groups defined by relevant variables (sex, age group, education level, socioeconomic status, marital status, and residence area). Thus, four measurement models with progressive restrictions were compared between categories of these groups (e.g. between females and males) [10,38]. Change in the CFI (ΔCFI) was used as the main criterion for comparing models with more restrictions against models with fewer restrictions. Simulation evidence suggests that ΔCFI < .01 between successively more restricted models provides evidence for measurement invariance [10]. Models first assumed configural invariance (i.e. similar factor structure across groups) as the base model, progressing to metric invariance (i.e. similar factor loadings and factor structure across groups), strong invariance (i.e. similar thresholds, factor loadings and factor structure across groups), and strict invariance (i.e. similar residual item variances, thresholds, factor loadings and factor structure across groups). Between each model, the ΔCFI was examined to establish if the more restricted model was appropriate. We preferred ΔCFI over χ 2 comparisons, since the first is not sensitive to big sample sizes [10,38].

Participants characteristics
The sample consisted of men (n = 13,196, 43.3%) and women (n = 17,253, 56.7%), the ages ranged from 18 to 98 years old, the mean age was 40.5 (SD = 16.3) and on average, participants had 9 years of education (SD = 4.6) (see Table 1). Likewise, the participants of our study are compared with the results of the last Peruvian census (see S2 Table).

Confirmatory factor analysis
It was identified that the models of one and two dimensions present adequate indexes of goodness-of-fit (see S3 Table). However, the correlations between the dimensions in the two-factor models (somatic and cognitive-affective) ranged between 0.97 and 0.99. Therefore, the onedimensional model was carried forward for measurement invariance testing (see Fig 2).

Measurement invariance
The values of ΔCFI were <0.01 when all models, with progressive restrictions, were compared across age groups, sex, level of education, socioeconomic status, marital status, and residence area (see Table 2). All groups reported strict invariance.

Reliability
The reliability of the PHQ-9 scores was high, reaching coefficients of internal consistency of α = 0.870 and ω = 0.873. On the other hand, the item-test correlation fluctuated between 0.62 and 0.77 (see Table 3).

Main findings
The PHQ-9 showed consistently good measurement invariance, allowing comparisons between groups by age, sex, educational level, socioeconomic status, marital status, and residence area. Measurement invariance provides confidence that any difference between PHQ-9 one-dimension measures across these groups comes from a real difference in depressive symptomatology and not from group-specific properties of the instrument itself. Additionally, our evidence supported an optimal reliability of PHQ-9.

Factorial structure
Though goodness-of-fit indices indicated that two-dimensional models fit the data better than the one-dimensional model, the correlation between these two factors ("somatic" and "cognitive-affective") was consistently very high across all models (.967 to .988). This indicates a substantial overlap between the two factors, complicating the interpretation of the results of the test [37], and pointing to the value of a more parsimonious unidimensional solution. It should be noted that the single-factor model is the most studied and used in applications of the PHQ-9 [28,32], and indeed that the PHQ-9 was designed as a one-dimensional screening tool to evaluate the nine DSM diagnostic indicators [11]. Other evaluation instruments, such as the Beck Depression Inventory (BDI-II) and the Center for Epidemiological Studies Depression scale (CES-D), consider that depression is a multidimensional construct and evaluate additional items of sexual problems, indecision, self-criticism, feelings of anxiety, among others. These instruments' additional dimensions does not imply that the PHQ-9 collects partial information on the construct, since these additional indicators are not part of the main diagnostic criteria for major depression disorder. Several studies conducted in the general population and primary care support the one-dimensional model of the PHQ-9. For instance, a study in primary care centers in Spain (n = 836), in primary care patients with different ethnic origins and risk of depression from the Netherlands (n = 1,772), and another in the general population of Hong Kong (n = 6,028), coincide with our findings that the one-dimensional model of the PHQ-9 is the most parsimonious and stable [16,18,21].
However, two studies in American samples have found two dimensions: one study using a representative sample (n = 26,202) and one study in a sample of soldiers (n = 2,615) [11,26]. Yet the relationship between the two factors was very high (0.87 in both cases). Both studies coincide with our results, suggesting an overlap between both dimensions. On the other hand, a German study in patients with major depression (n = 626) and another in cancer patients with palliative care from the UK (n = 300) [25,27], identified the somatic and affective-cognitive dimensions as related but distinct, with a correlation between the latent dimensions of 0.58 and 0.30, respectively. A possible explanation for the heterogeneity of the results on the internal structure of the PHQ-9 is the population evaluated. Investigations that report a clear differentiation between the somatic and cognitive-affective dimensions predominantly draw on clinical populations [25,27], whereas those that report an overlap between both dimensions are performed in the general population [11,26]. Living several years with depression or with a chronic disease that significantly affects physical health could cause people to differentiate physical or somatic indicators from those affective-cognitive. For example, it is possible that cancer patients might have a high score on the items on sleep disturbance, fatigue, and appetite changes (often associated with the somatic dimension), because these are side effects of treatment, but score low on cognitive-affective items. This would diminish the correlation between the two dimensions and give rise to the appearance of differentiation. In terms of behavior, the dimensionality of the detection of depressive symptoms would be mediated by contingencies associated with physical comorbidity [16]. Finally, it is possible that cultural factors play a role in whether the depressive symptomatology is perceived as a single construct, or as two related elements (somatic and cognitive-affective) [46]. However, it is not possible to identify what might be the psychological mechanisms that would generate an overlap between the two dimensions in these culturally different studies. Likewise, it is necessary to point out the practical disadvantages generated by having a model of two subdimensions instead of a one-dimensional model. In addition, the original qualification method is based on a one-dimensional model that sum up the direct scores of all items. [12] It should be noted that the original cutoff points for determining levels of morbidity (�5, �10, �15 and �20) have proven to be more appropriate compared to alternative classification methods [32].

Measurement invariance
Our results support that the PHQ-9 presents convincing measurement invariance in the groups of age, sex, educational level, socioeconomic status, residence area, and marital status, allowing meaningful group comparisons. Other studies, in university students from the United States (n = 857) and primary care in Spain (n = 836), support our results as they report strong measurement invariance according to sex [16,17]. On the other hand, a study in primary care patients at high risk of depression in the Netherlands identified that the measurement invariance at the level of factor loadings was violated [18] because women presented higher loadings for "sleep disturbance" and men for "loss of interest". These results suggest that the measurement invariance between men and women is not met in a population at high risk of depression or depression. That is to say, as the depressive symptomatology increases, the differences between both sexes accrue, meaning that women and men tend to score higher in a different group of items (i.e. sleep disturbance and loss of interest). Different studies also report that the prevalence of sleep and appetite problems is greater in women than in men [47,48]. Our results support that it is possible to make comparisons between men and women in the Peruvian population. Similarly, accumulated international evidence supports the possibility of making comparisons between men and women using the one-dimensional model in the general population [49].
Our results support the presence of a strong invariance according to the age, educational level, and marital status, allowing comparisons between groups in the Peruvian population. A Spanish study conducted in a small group of primary care patients also found strong invariance between age groups, marital status, and educational level [16]. At the level of measurement invariance between age groups, there is evidence that depression among young people and older adults is qualitatively different, in addition to the fact that older adults have a higher prevalence of depressive symptoms [48]. However, this does not seem to affect the factor structure or how they understand the construct. With respect to measurement invariance by educational level and marital status, it is not possible to identify a plausible psychological or biological mechanism that could justify a possible violation of invariance. However, these analyzes are necessary because of their impact on practice since several studies make comparisons between these groups [13,14]. Our results, like those developed in the Spanish study, support making comparisons validity between age, educational level, and marital status.
The present study supports the possibility of making comparisons between socioeconomic status and area of residence (urban and rural); however, our results could not be compared with other studies since no other research was found that evaluated the measurement invariance according to these groups using the PHQ-9. Despite the limited evidence on measurement invariance in these groups, different studies have already made comparisons between the direct scores of the PHQ-9 in people from urban and rural areas, as well as between people of different socioeconomic status [14]. Theories of social disadvantage and structural determinants of health could explain a possible difference between direct scores, due to limited access to opportunities and limited access to specialized health services [50].

Reliability
Our results present optimal reliability coefficients of the scores derived from the measure of the PHQ-9 in the Peruvian population. This is consistent with other findings in the literature. In particular, two studies with a similar sample size carried out in the general population of China [21] and Germany [20], identified very similar values of internal consistency (classical alpha) of 0.82 and 0.86, respectively. Therefore, despite being studies in culturally different populations, the measurements of depression symptoms using the PHQ-9 present a similar internal consistency. Some differences in the characteristics of the participants of the ENDES and the Peruvian census of 2017 are identified, especially in the level of education, marital status, natural region, and area of residence. This may be because ENDES was originally designed to be representative of women of fertile age. This would be generating that the characteristics of the participants differ between ENDES and the Peruvian census. It should be noted that this should not affect the conclusions of the study.

Relevance in public health
The PHQ-9 is one of the instruments most used by researchers and mental health professionals around the world to evaluate depressive symptoms [32]. Within its applications in public health, its use is recommended to evaluate depressive symptomatology in clinical trials and research in general [24,32], since it is an instrument with solid evidence of validity and reliability. Likewise, different countries promote its use in primary care [51,52], owing to its brevity, easy scoring (add the nine items), and applicability across heterogeneous sociodemographic characteristics. Our results support the use of the Spanish version of the PHQ-9 in the Peruvian population. These evidences suggest that it is possible to use the PHQ-9 in other Spanish-speaking countries in Latin America.

Strengths and limitations
Among the strengths of the study are the large sample size and the representativeness of the study sample. This is the only study reported in Latin America that evaluates the measurement invariance of PHQ-9. However, the study is not free of limitations. Our results and conclusions are focused on the Peruvian population, so, results can be extrapolated with caution towards potentially similar populations. Additionally, neither the inter-rater effect (different level of experience of the interviewers) nor the inter-family effect (participants from the same family group) was controlled. However, this should not change our results, since all the evaluators received intensive training for several weeks before conducting the evaluations, so it is expected that such training will homogenize the evaluation process and gather information [30]. On the other hand, it is expected that although there are cases where two or more participants belong to the same family group, this number should be minimal compared to the total evaluated. Despite these limitations, our results are still valid and reliable.

Conclusions
The evidence presents support for the one-dimensional model and measurement invariance of the PHQ-9 measure, allowing for reliable comparisons between sex, age groups, education level, socioeconomic status, marital status, and residence area, and recommends its use within the Peruvian population.  Table. Confirmatory factor analysis and reliability in the Patient Health Questionnaire-9 (two weeks).