Measuring depression severity in global mental health: comparing the PHQ-9 and the BDI-II [version 1; peer review: 1 not approved]

We recently completed a randomised controlled trial in Goa India Background: in which we observed a pattern of discordance with our two primary outcome measures; the Beck Depression Inventory (BDI-II) classified patients as moderately severe at the end of treatment, whilst the Patient Health Questionnaire (PHQ-9) classified these same patients as being only mildly depressed. The aim of this study is to explore whether the disparity between these two measures is seen in other settings


Introduction
We recently completed a randomised controlled trial in Goa India comparing a culturally-adapted version of behavioural activation called the Healthy Activity Program (HAP) (Chowdhary et al., 2016) plus Enhanced Usual Care (EUC) delivered by lay counsellors to EUC alone (Patel et al., 2017).The authors found that HAP plus EUC was superior to EUC alone in treating moderate to severe depression both at the short-term (3-months post-randomization) (Patel et al., 2017) and long-term (12months post-randomization) (Weobong et al., 2017) in general practice settings.Both primary outcome measures of depression, the revised Beck Depression Inventory (BDI-II) (Beck et al., 1996) and the Patient Health Questionnaire (PHQ-9) (Spitzer et al., 1999) showed superiority of the HAP plus EUC over EUC at both of these time-points.However, we observed a pattern of discordance in terms of depression severity between our two depression measures at both 3 months and again at 12 months; the modal patient was at the low end of the moderate range of severity on the BDI-II, whereas the same patient was indicated as having only mild residual symptoms on the PHQ-9.The aim of this study is to explore this discrepancy, since it has implications for how effective HAP is seen in absolute terms and as both measures are widely used.
We therefore searched the literature for other studies that administered both measures to the same participants and found two, one conducted in the United Kingdom (UK) (Cameron et al., 2011) and the other in the United States (US) (Kung et al., 2013).We contacted the lead authors of both studies and invited them to join us in investigating this discrepancy by virtue of sharing their patient level data and both compiled.If one of the measures is problematic, then perhaps it should not be used in other cultures.This is particularly important given there are growing concerns regarding the validity of measures for assessing severity of depression (Cameron et al., 2008;Cameron et al., 2011;Reddy et al., 2010), and little by way of evidence on the objective psychometric comparison of these outcome measures.
Global mental health depends on the use of culturally appropriate measures if we are to accurately assess the burden of depression, and more importantly improve treatment plans/decision-making.In this paper, we address two questions: whether the discrepancy in terms of absolute scores observed in the India trial is similar in the other two UK and US studies, and whether the proportion of patients for whom the BDI-II score observed is classified in a higher severity category than the PHQ-9 score differs across the studies.

Methods
Only studies that used both the BDI-II and PHQ-9 as measures were eligible for the analysis in this paper.Both measures are endorsed by the National Institute for Health and Clinical Excellence to measure baseline depression severity and responsiveness to treatment in primary care (Smarr & Keefer, 2011).
Approvals were obtained for the collection and use of the primary data (including additional studies such as this study) for each of the studies.Consent was also provided by all participants in each of the studies involved in this analysis.For India, ethics approval was sought from the Indian Council of Medical Research, the Sangath Institutional Review Board (IRB), and the London School of Hygiene and Tropical Medicine.For UK, ethics approval was sought from the North of Scotland Research Ethics Committee.For US, ethics approval was sought from Mayo Clinic Department of Psychiatry and Psychology IRB.

Participants
The Indian study consisted of 438 participants (a subset seen at 3 and 12 months outcome time-points) of either gender aged 18-65 with probable diagnoses of moderately severe and severe depression based on PHQ-9 scores greater than 14 at baseline (Patel et al., 2017).The BDI-II was not administered at baseline.Participants were all drawn from a parallel arm comparison of HAP plus EUC to EUC alone conducted in 10 primary health centres in the state of Goa on the west coast of India.All scores were drawn from the 3-and 12-month post-treatment assessments at the end of the trial.
The UK sample consisted of 267 participants of either gender aged 16 and above with diagnoses of depression as ascertained by their general practitioner (Cameron et al., 2011).The study compared the performance of three different self-report measures of depression (the BDI-II and the PHQ-9) with a widely used clinician-rated instrument -the Hamilton Rating Scale for Depression (Hamilton, 1960).Participants who could not read the self-report measures because they were illiterate were ineligible for the study.
The US sample consisted of 625 depressed participants of either gender, aged 18-76 years (338 inpatients and 287 outpatients) (Kung et al., 2013).The BDI-II and PHQ-9 were collected as part of routine clinical care and analysed retrospectively to compare their performance in that setting.As in the UK sample both scales were self-administered in English by participants who could read.

Measures
The BDI-II consists of 21 items covering a number of symptoms of depression.Each of the 21 items assess a different symptom with four different response options each a full sentence long.For example, the first item "Sad" is followed by response options ranging from: "0 -I do not feel sad." "1 -I feel sad much of the time.""2 -I am sad all the time.""3 -I am so sad or unhappy that I can't stand it" with total scores found by summing the highest response to each given item.The BDI-II has strong psychometric properties and historically is the most widely used self-report outcome measure of depression in trials.The BDI-II defines symptom severity at four levels recommended by Beck (Beck et al., 1996), and in reference to the structured clinical interview for the Diagnostic and Statistical Manual of Mental Disorders, Third edition (Spitzer et al., 1999): 0-13 Minimal Depression; 14-19 Mild Depression; 20-28 Moderate Depression; 29-63 Severe Depression.However, these were based on a sample drawn from a primary care site in University of Pennsylvania and may not generalize effectively to other primary care settings, particularly in Low and Middle-Income Countries (LMIC) (Cameron et al., 2011).
The PHQ-9 is a structured questionnaire that enquires after the nine symptom-based criteria for a diagnosis of DSM-IV and DSM-5 depression.The instrument presents a common stem "Over the past two weeks how often have you been bothered by any of the following problems?"and then follows with nine specific questions such as "Little interest or pleasure in doing things".Each item is rated on a single four-point scale from "not at all" to "nearly every day" and total scores are summed across the items.Like the BDI-II, the PHQ-9 has been found to have good sensitivity and specificity (Kroenke et al., 2001) and is coming into increasing widespread use as a measure of depression severity.The PHQ-9 defines symptom severity at five levels recommended by Kroenke (Kroenke et al., 2001): 1-4 Minimal Depression; 5-9 Mild Depression; 10-14 Moderate Depression; 15-19 Moderately to Severe Depression; 20-27 Severe Depression.

Procedures
Both scales were administered as self-report instruments in the UK and US studies, the standard means of administration, and included all 21 items on the BDI-II.In the India study, because the vast majority of the participants were illiterate, study personnel read the items to the participants and recorded their responses in the three major local languages in the study area (Konkani/Marathi/Hindi).This followed a rigorous forward and back translation process consistent with the five major criteria for cross-cultural equivalence in psychiatric research: content equivalence, semantic equivalence, technical equivalence, criterion equivalence and conceptual equivalence (Flaherty et al., 1988).A forward translation was first completed by trained and experienced field researchers and these translations reviewed by a clinical psychologist fluent in the three local languages, together with senior and more experienced research team members, at the second stage.Where there were disagreements between the clinician and senior research team members on the quality of the forward translation, these were discussed with a psychiatrist with experience of working in both the UK and India-Goa, to advise on the concepts captured by the original English wording of each item to guide the choice of local language expressions.The draft consensus translation was then back-translated into English by a bilingual independent non-mental health professional, following which further modifications were made on the basis of the back-translation, if required.The item inquiring about interest in sex was omitted from the BDI-II in India so as not to offend participants.

Statistical analyses
We first estimated the reliability of each measure using Cronbach's alpha.Following this, we compared scores using Pearson product-moment correlation statistics in order to ascertain whether both measures were assessing the same construct of depression.In order to address our first objective regarding the observed discrepancy between the BDI-II and PHQ-9 scores in the India trial, we first examined the association between scores on the two measures within each study using linear regression.Following this, we assessed whether there was evidence of moderation by study by fitting an interaction term.We then used the predicted BDI-II score and modelled what this would be for participants with PHQ-9 score of 10 (moderate depression) for each study.Finally, we assessed if the intercepts differed between the three studies and generated scatter plots of the fitted values for the BDI-II and PHQ-9 for each sample.Effect sizes are reported as regression coefficients (with 95% CI) for the increase in BDI-II score for each unit increase in PHQ-9 score.In addition, to address our second objective patients were classified with respect to the prespecified depression severity categorical outcomes on each measure and rates of discordance compared across the studies, and the association was assessed with the chi-square statistic.We further explored the number and proportion with a higher category on the BDI-II than the PHQ-9 for each PHQ-9 category.We ruled out the possibility of temporal effects on the observed discrepancy in the India trial at the 3-month endpoint, by repeating the regression analysis using follow-up data of the same participants on the BDI-II and PHQ-9 at 12-months post-enrolment.Sensitivity analyses were conducted after dropping the sex item on the UK and US studies.Statistical analyses were conducted using STATA 15.

Results
A detailed description of the conduct of each study is provided in the respective publications (Cameron et al., 2011;Kung et al., 2013;Patel et al., 2017).

Score distribution
At the 3-month end-point for the India study, the regression coefficients were similar for the three studies (India: β=1.58, 95%CI 1.47-1.70;UK: β =1.58, 95%CI 1.47-1.70;US: β=1.48, 95%CI 1.39-1.58),and there was no evidence of moderation by study (p=0.32).As would be expected given differences in the scales, scores on the BDI-II were higher than on the PHQ-9 in each of the studies, but more so in the India study than in the other two (Figure 1).For example, at a PHQ-9 score of 10 (moderate depression) in the India study, the BDI-II mean score was 24.3 (95% CI 23.5, 25.1), and this was significantly different from the UK study 20.8 (95%CI 19.6, 21.9) and the US 20.5 (95%CI 19.5, 21.4).Similar results were observed at the 12-month end-point for the India sample; the regression coefficient increased slightly to (β=1.67, 95%CI 1.56, 1.78) but the greater discrepancy between scores in the India study compared to the UK and US was maintained (Figure 2).At a PHQ-9 score of 10 (moderate  depression) in the India sample, the BDI-II mean score was 23.2 (95% CI 22.4, 23.9), still significantly different from the UK and US studies.1c show the cross-classification of individual participants on each of the categorical values used to describe absolute outcomes on the respective measures.As can be seen from Table 1a-Table 1c, the participants in the India sample were more likely to be classified as having poorer outcomes on the BDI-II than the PHQ-9 for each severity band of the PHQ-9, and this was significantly different between both the India and UK samples (prevalence difference (PD): -15.9%, 95% CI -23.2%, -8.7%; p<0.0001) and the India and US samples (PD: -15.8%, 95% CI -21.9%, -9.5%; p<0.0001).

Severity banding Table 1a-Table
Results were similar when the sex item was dropped from the UK and US studies.*Excluded the highest severity band of the PHQ-9 from this analysis because there was a disproportionate distribution of severe depression as assessed by the PHQ-9; the proportion was much higher in the US sample (40%) compared to the India (11%) and UK (17%) studies.*Excluded the highest severity band of the PHQ-9 from this analysis because there was a disproportionate distribution of severe depression as assessed by the PHQ-9; the proportion was much higher in the US sample (40%) compared to the India (11%) and UK (17%) studies *Excluded the highest severity band of the PHQ-9 from this analysis because there was a disproportionate distribution of severe depression as assessed by the PHQ-9; the proportion was much higher in the US sample (40%) compared to the India (11%) and UK (17%) studies.

Discussion
Patients reported higher severity scores on the BDI-II relative to the PHQ-9 in our India sample than they did in either the UK or the US.We think this reflects differences in the method of administration across the studies; in India, we read the translated local language version items to our patients whereas in both the UK and the US studies literate patients read the items themselves.The BDI-II is a relatively complex instrument that requires participants to hold four different options in memory before giving a response to each item whereas the PHQ-9 requires only that the participants respond with the same simple frequency rating to each of its nine items.The BDI-II is sometimes criticized for being too transparent to respondents and thus easily faked by those wishing to present themselves in a favourable or unfavourable light, but that same critique is as likely to apply to the PHQ-9 as the BDI-II (Wang & Gorenstein, 2013).
The fact that correlations were high and comparable across the samples suggests that both measures were assessing the same underlying construct of depression, but the fact that scores on the BDI-II were higher relative to the PHQ-9 in our India study than in the other two studies suggests that absolute scores on the BDI-II are inflated relative to the PHQ-9.Given that participants in LMICs are often illiterate and would require interviewer administration, the PHQ-9 might be preferred over the BDI-II as a measure of depression severity.The PHQ-9 is also easily accessible as it is free, whereas the BDI-II is only available on purchase.
The strengths of this investigation include the cross-cultural approach and large sample sizes from well-designed studies.We acknowledge some limitations including our inability to account for the potential confounding effect of order of administration of the two measures (for the India and US studies) and other factors such as social desirability, educational attainment and sex of respondents (Cronbach, 1990).That being said order of administration was largely constant in the India study and though this may not have been the case for the US study, the large sample size and the randomness of which measure was completed first means it is unlikely order effects accounted for the differences.Moreover, comparison of the discordance among categorical responses on the two measures was complicated by the fact that the PHQ-9 defines five categories of depression while the BDI-II defines only four (the former adds a "moderately severe to severe" category).However, that difference in categorization was consistent across the studies and should not have contributed to differences in concordance.Additionally, the level of depression severity in the three studies may have influenced our findings.For example, the patients from the US study were either from the "Mood Clinic" or "Mood Disorder Unit" meaning they were referred or admitted for depression treatment.This might explain why the US sample had a disproportionately higher band of severely depressed PHQ-9.We however dealt with this by dropping this category from the analysis comparing the severity bands between the BDI-II and PHQ-9 in all three studies.Furthermore, we adjusted for study in our regression analysis.Finally, even though we strictly adhered to principles of cross-cultural psychiatric research in adapting the BDI-II and PHQ-9 in the India study, we are unable to completely rule out loss of meaning in translation.Admittedly, this limitation would apply to both measures though the BDI-II would pose more translation challenges given its relative complexity.Dropping the BDI-II item on sex (sexual desire) in the India study could have offset its psychometric properties, and more importantly for this analysis meant that the samples may have been incomparable.However, dropping one item (sex) would likely result in lowering the mean score on the BDI-II in the India study but in sensitivity analysis we observed in the prediction model that in the India study the BDI-II scored people higher compared to the PHQ-9, and this was significantly higher compared to the UK/US studies.This suggests the robustness of our findings without the sex item, and we posit that the differences observed would have been much stronger if the sex item were maintained in the India study.
It is possible that it was the PHQ-9 that was problematic in our sample and not the BDI-II.Ours is the first study to examine head-to-head the severity categorisation of the PHQ-9 and BDI-II, comparing studies from high versus low and middle-income settings.The PHQ-9 is the simpler measure and places fewer demands on short-term memory than the BDI-II.
Administering both measures orally in literate samples and seeing if that inflates absolute scores on the BDI-II relative to the PHQ-9 could resolve this issue.Such a study would be relatively easy to conduct and is encouraged given the reported concerns regarding the validity of measures for assessing severity of depression (Cameron et al., 2008;Hansson et al., 2009;Reddy et al., 2010).Until such a study is done we have reservations about interpreting absolute values on the BDI-II and prefer to use the PHQ-9 instead.It may be the case that the PHQ-9 is more suitable as interviewer-administered in illiterate populations given findings from a study in Spain that the PHQ-9 performed similarly when read out over the phone compared to selfadministration (Pinto-Meza et al., 2005).What this could mean is that studies where illiteracy is a concern, particularly in LMICs, researchers might be well advised to use the less complicated PHQ-9 than the BDI-II if the scales must be read to illiterate participants.Both appear to be valid measures of the underlying construct when participants read and complete the scales themselves.Further work is required to assess their performance when read out to participants.
self-administered) were answered via question and answer.The authors state in the limitations that socially desirable responses may have been elicited, which is good, but it remains a big concern.
2) In the statistical analyses, please define "CI." Results: 1) when comparing within-group scores on the PHQ-9 and BDI-II, why not mean center them?That would presumably show that the difference between scores within the India study sample is true and not simply a function of range of scores for each measure.
2) For figures 1 and 2, please use a different legend as it is very difficult to distinguish groups in black and white printouts.
3) For the PHQ-9, what is the clinical/practical significance of falling into the minimal depression versus mild depression categories? Discussion: 1) The conclusions drawn, from the reader's standpoint, feel like they are hard to justify without further detail of the samples from which the data are derived.
2) The points made about the BDI-II requiring more working memory because of 4-item choices requires justification/citation.
3) Beginning of second paragraph: correlation coefficients do not say anything about the underlying construct.This is problematic in present form.
4) Similarly, higher BDI-II versus PHQ-9 scores in the India study but not others may be sample-dependent.
5) The comment that if the "sex" item on the BDI-II were included in the India study that observed differences would be much stronger seems like a stretch.Do people, in generally, commonly endorse the "sex" item?Are there high rates of sexual dysfunction in the India study sample?I would think in practice people would be reluctant to answer honestly on that item, particularly in an interview format.I have read this submission.I believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Figure 1 .
Figure 1.Scatter plot with fitted regression lines of BDI-II and PHQ-9 scores of the three studies (comparison with 3-month outcome data in India trial).Plot of regression model fitted with interaction term i.e. allowing slope of PHQ-9 with BDI-II to differ by study.

Figure 2 .
Figure 2. Scatter plot with fitted regression lines of BDI-II and PHQ-9 scores of the three studies (comparison with 12-month outcome data in India trial).Plot of regression model fitted with interaction term i.e. allowing slope of PHQ-9 with BDI-II to differ by study.
the work clearly and accurately presented and does it cite the current literature?PartlyIs the study design appropriate and is the work technically sound?PartlyAre sufficient details of methods and analysis provided to allow replication by others?NoIf applicable, is the statistical analysis and its interpretation appropriate?PartlyAre all the source data underlying the results available to ensure full reproducibility?Yes Are the conclusions drawn adequately supported by the results?Depression and bipolar disorder, with a recent publication examining the measurement invariance of the BDI-II across race/ethnicity in inpatient adolescents from a U.S. sample.