Original article
Assessing validity of a depression screening instrument in the absence of a gold standard

https://doi.org/10.1016/j.annepidem.2014.04.009Get rights and content

Abstract

Purpose

We evaluated the extent to which use of a hypothesized imperfect gold standard, the Composite International Diagnostic Interview (CIDI), biases the estimates of diagnostic accuracy of the Patient Health Questionnaire-9 (PHQ-9). We also evaluate how statistical correction can be used to address this bias.

Methods

The study was conducted among 926 adults where structured interviews were conducted to collect information about participants' current major depressive disorder using PHQ-9 and CIDI instruments. First, we evaluated the relative psychometric properties of PHQ-9 using CIDI as a gold standard. Next, we used a Bayesian latent class model to correct for the bias.

Results

In comparison with CIDI, the relative sensitivity and specificity of the PHQ-9 for detecting major depressive disorder at a cut point of 10 or more were 53.1% (95% confidence interval: 45.4%–60.8%) and 77.5% (95% confidence interval, 74.5%–80.5%), respectively. Using a Bayesian latent class model to correct for the bias arising from the use of an imperfect gold standard increased the sensitivity and specificity of PHQ-9 to 79.8% (95% Bayesian credible interval, 64.9%–90.8%) and 79.1% (95% Bayesian credible interval, 74.7%–83.7%), respectively.

Conclusions

Our results provided evidence that assessing diagnostic validity of mental health screening instrument, where application of a gold standard might not be available, can be accomplished by using appropriate statistical methods.

Introduction

Screening for mental health problems in clinical- and community-based settings is an important component in overall public health disease prevention and health promotion strategies. The success of screening procedures, however, is largely dependent on the accuracy of the diagnostic procedure, used as the gold standard or criterion standard, to evaluate the screening instruments [1], [2], [3]. Psychometric properties such as sensitivity and specificity are common measures used to evaluate the quality of a screening test. These psychometric properties are unbiased if the screening test results are compared with a gold standard measure [2], [3]. We recognize that there is no perfect gold standard and we use the term to refer to the best available method used to determine the presence or absence of the condition or disease of interest. However, verification of the true status using the gold standard may be impossible to obtain because of cost and human resources. In some instances, the gold standard may be invasive, impractical to obtain, or unethical to conduct. For example, gold standard diagnosis of Alzheimer's disease cannot be ascertained until a patient dies and an autopsy is performed. In epidemiologic studies, where a comparison with a gold standard is not possible, validation studies often compare screening instruments with instruments that are imperfect but more precise than the screening instrument. The key assumption is that the measurement error for the imperfect reference or gold standard is unlikely to be correlated with the screening instrument [4]. If an imperfect standard is used as if it were a gold standard, the estimated accuracy of the tests would be biased due to misclassification [3], [5]. Zhou et al. [3] call this type of bias “imperfect gold standard bias.” A number of authors have proposed model-based estimates or estimates that make use of prior information to reduce or correct this imperfect gold standard bias without retesting subjects [1].

The Patient Health Questionnaire-9 (PHQ-9) is a very brief, easy to administer, and interpret depression screening instrument [6]. Because of its brevity, the PHQ-9 is widely used as a depression screening instrument in primary care settings among racially and ethnically diverse populations. Additionally, it has been reported to be a valuable tool for the detection and management of depression [6]. Use of the PHQ-9 instrument in new clinical and research settings requires evaluating the validity of the instrument in comparison with a diagnostic “gold standard.” The Schedules for Clinical Assessment in Neuropsychiatry (SCAN), a semistructured clinical interview widely considered as gold standard, is used to assess and diagnose psychiatric disorders including depression among adults [7]. The instrument offers flexibility for clinicians to phrase questions about particular symptoms taking into account local context. However, it requires that clinicians make their clinical decisions following the definitions and criteria provided in the Diagnostic and Statistical Manual (DSM) [8], [9]. Although the SCAN has been reported to have excellent accuracy in depression diagnosis, it is time demanding, expensive, and requires a trained clinician, thus limiting its use in resource-limited clinical settings [10].

To address these limitations, alternative diagnostic tools for the measurement of depression have been developed. One of these tools is the Composite International Diagnostic Interview (CIDI). The CIDI is a fully structured lay-administered diagnostic interview that can be used to diagnose major depressive disorder (MDD) according to DSM criteria, although it is regarded as a less optimal gold standard compared with the SCAN [11], [12]. In an effort to use a less burdensome assessment, some investigators have used the CIDI as the criterion standard to evaluate the diagnostic accuracy of the PHQ-9 [13], [14]. Use of CIDI as gold standard (an imperfect but more precise instrument than PHQ-9) is expected to bias the validity of the PHQ-9. However, no study has systematically evaluated the impact of using the CIDI as a gold standard. Therefore, we conducted this study to assess the extent to which the use of what we hypothesized to be an imperfect gold standard [3] biases the estimates of diagnostic accuracy of a screening instrument (the PHQ-9). We further sought to demonstrate how statistical methods can be used to correct the bias and improve psychometric properties of the PHQ-9.

Section snippets

Participants

The study was conducted at Saint Paul General Specialized Hospital, a major referral and teaching hospital, in Addis Ababa, Ethiopia between the months of July and December 2011. We used a two-stage study design where a total of 926 adults (aged 18 years or older) attending outpatient departments were first interviewed by research nurses using the PHQ-9 and CIDI depression scales. Then those who screened positive for depression on the PHQ-9 questionnaire (PHQ-9 score ≥ 10) and a randomly

Results

The median PHQ-9 score was 5 (range 0–27). A total of 258 participants fulfilled DSM-IV criteria for MDD on the PHQ-9 (27.8%; 95% CI, 24.9%–30.7%) using a score of 10 or more, whereas 668 were classified as nondepressed (Table 1). In comparison with CIDI, a total of 86 subjects were classified as depressed and 592 as nondepressed on both instruments. Distributions for other PHQ-9 score cutoffs are also displayed in Table 1. As shown in Figure 1, PHQ-9 scores were significantly higher (Wilcoxon

Discussion

Overall, improvements in psychometric properties of the PHQ-9 (when compared with CIDI, an imperfect gold standard) were noted after using Bayesian latent class modeling approaches compared with the relative estimates. Overall, the sensitivity estimates calculated using Bayesian modeling approaches were closer to the values obtained using the SCAN gold standard (regarded as the true or perfect gold standard in this study), whereas the relative estimates of the specificity were closer to the

Conclusions

Our findings underscore the importance of evaluating the feasibility and validity of using an imperfect gold standard. In addition, using Bayesian statistical approaches in the absence of a gold standard may result in improvements of psychometric properties of mental health screening instruments. The methods presented here are useful in assessing the diagnostic validity of screening tests in the absence of a gold standard. The Bayesian modeling framework allowed incorporation of additional

Acknowledgments

This research was supported, in part, by an award from the National Institute of Minority Health and Health Disparities (T37-MD001449). The authors wish to thank the staff of Addis Continental Institute of Public Health for their expert technical assistance. The authors would also like to thank Saint Paul Hospital for granting access to conduct the study. This research was done as partial fulfillment for the requirements of a PhD degree by one of the authors (B.G.) in the Department of

References (32)

  • J.K. Wing et al.

    SCAN. Schedules for Clinical Assessment in Neuropsychiatry

    Arch gen psychiatry

    (1990)
  • A. Janca et al.

    New versions of World Health Organization instruments for the assessment of mental disorders

    Acta Psychiatr Scand

    (1994)
  • G. Andrews et al.

    A comparison of two structured diagnostic interviews: CIDI and SCAN

    Aust N Z J Psychiatry

    (1995)
  • S.R. Hirsch et al.

    Schizophrenia

    (1995)
  • T.S. Brugha et al.

    A general population comparison of the Composite International Diagnostic Interview (CIDI) and the Schedules for Clinical Assessment in Neuropsychiatry (SCAN)

    Psychol med

    (2001)
  • B. Arroll et al.

    Validation of PHQ-2 and PHQ-9 to screen for major depression in the primary care population

    Ann Fam Med

    (2010)
  • Cited by (26)

    • The role of religiosity types in the phenomenology of hallucinations: A large cross-sectional community-based study in a predominantly Muslim society

      2024, Schizophrenia Research
      Citation Excerpt :

      More details about these measures are found in Supplementary Table 1. We used the nine-item Patient Health Questionnaire (PHQ-9) to measure depressive symptoms (American Psychiatric Association, 2013; Gelaye et al., 2014; Kroenke et al., 2010). The 2-item Generalized Anxiety Disorder (GAD-2) scale was used to measure anxiety symptoms (American Psychiatric Association, 2013; Kroenke et al., 2010).

    • Prevalence and potential determinants of subthreshold and major depression in the general population of Qatar

      2019, Journal of Affective Disorders
      Citation Excerpt :

      Researchers entered responses directly into Blaise survey management software as they interviewed participants over the phone (Blaise, Statistics Netherlands, n.d.) The nine-item Physician Health Questionnaire (PHQ-9) is a relatively brief and well-validated screening measure of depression used globally in both clinical and general population samples (Alonso et al., 2004; Gelaye et al., 2014; Hyphantis et al., 2011; Kiely and Butterworth, 2015; Kocalevent et al., 2013; Kroenke et al., 2010; McGuire et al., 2013; Mitchell et al., 2016; Navines et al., 2012). The PHQ-9 (dependent variable) captures the frequency of nine symptom criteria for diagnoses of depressive disorders in the DSM-5 (American Psychiatric Association, 2013) within the past 2 weeks with 4-point response options for each symptom: 0 = “not at all,” 1 = “several days,” 2 = “more than half the days,” and 3 = “nearly every day.”

    • Predictors of treatment initiation for alcohol use disorders in primary care

      2018, Drug and Alcohol Dependence
      Citation Excerpt :

      Exclusion criteria were (1) marked functional impairment from bipolar disorder or schizophrenia (Arbuckle et al., 2009; Luciano et al., 2010); and (2) currently receiving OAUD treatment. The baseline assessment included demographics; homeless status; past 30-day use of alcohol and opioids using the Timeline Follow-back (TLFB) (Sobell and Sobell, 1992); DSM IV diagnoses of alcohol, heroin, and prescription opioid abuse or dependence using the Comprehensive International Diagnostic Interview (CIDI) Version 3.0, sections 11 and 12 (Forman et al., 2004; Haro et al., 2006); consequences of substance use using the Short Inventory Of Problems Alcohol and Drugs (SIP-AD) (Alterman et al., 2009; Blanchard et al., 2003); depression symptoms using the Patient Health Questionnaire-8 (Gelaye et al., 2014; Kroenke et al., 2001); emergency department or hospital stay related to substance use in the past 90 days; and previous use of substance use treatment, including whether or not they had spoken with a professional about their use. Patients were included in the current study if they had a diagnosis of an AUD based on the CIDI and did not report opioid use in the 30-days prior to the assessment; some had a co-morbid opioid use disorder that was in remission.

    View all citing articles on Scopus

    Conflict of interest: None declared.

    View full text