Validity of the Chinese Language Patient Health Questionnaire 2 and 9: A Systematic Review

Introduction: Chinese Americans with limited English proficiency have higher mental health needs than English speakers but are more likely to be undiagnosed and undertreated for depression. Increasing anti-Asian hate crimes during the COVID-19 pandemic has increased the urgency to accurately detect depressive symptoms in this community. This systematic review examines the validity of the Patient Health Questionnaire (PHQ)-2/9 for depression screening in Chinese-speaking populations. Methods: We queried PubMed, Web of Science, Embase, and PsycINFO databases, examining studies through September 2021. Studies were included if they evaluated the Chinese language PHQ-2 or PHQ-9 and diagnosed depression using a clinical interview. Two investigators independently extracted study data and assessed quality using the QUADAS-2. Study sensitivities and specificities were combined in random effects meta-analyses. Results: Of 513 articles, 20 met inclusion criteria. All examined the PHQ-9; seven also examined the PHQ-2. Studies were conducted in Mainland China (17), Hong Kong (1), Taiwan (1), and the United States (1). Fourteen studies were published in English; six in Chinese. Studies were diverse in setting, participant age, and comorbidities. For the Chinese language PHQ-9, Cronbach's alpha ranged from 0.765 to 0.938 for included studies (optimal cutoff scores ranged from 6 to 11). For the PHQ-2, Cronbach's alpha ranged from 0.727 to 0.785 (optimal cutoff scores 1–3). Overall, the PHQ-9 pooled sensitivity was 0.88 (95% CI 0.86–0.90), and pooled specificity was 0.87 (95% CI 0.83–0.91). Similarly, the pooled PHQ-2 sensitivity was 0.84 (95% CI 0.80–0.87), and pooled specificity was 0.87 (95% CI 0.78–0.93). The overall risk of bias was low (12 studies) or indeterminate (8 studies). Discussion: While limited by missing study information, the Chinese language PHQ-9 appears to be a valid depression screening tool among Chinese-speaking populations across geographic and clinical settings. Further research should explore optimal cutoff scores for this population for routine depression screening and the validity of the tool to measure response to depression treatment.


Introduction
Depression is a major public health concern, which affects 19.4 million adults in the United States 1 and 280 million people worldwide. 2 Depression leads to poor quality of life, 3 worse health outcomes with increased morbidity and mortality, 4,5 and increased health care costs. 6 While patients with limited English proficiency (LEP) are more likely to present with more severe depressive symptoms compared with Englishonly speakers, [7][8][9] clinicians are less likely to diagnose these patients with depression. 10 This exacerbates existing disparities in access to mental health care among individuals with LEP. [11][12][13] Almost 3 million U.S. residents speak Chinese at home, making it the third most spoken language in the nation. 14 Past studies have found that Chinese Americans with LEP have high mental health burden, with the prevalence of depressive symptoms among Chinese monolingual primary care patients in the United States as high as 20%; however, Asian patients in the United States face disparities in mental health care access and have lower odds of receiving needed services than patients from other ethnic groups. 13,15 Furthermore, Asian patients with LEP who are able to access the health care system may find that their symptoms go unrecognized compared with their English-proficient counterparts. [16][17][18][19] In fact, one study of English, Spanish, and Chinese-speaking primary care patients found that physicians were least likely to diagnose depressive symptoms in Chinese-speaking patients. 10 Thus, one pathway to improving access to depression treatment and specialty mental health services for Chinese patients with LEP is ensuring that physicians are using evidence-based tools to better identify patients with depressive symptoms.
Of particular note, the heightened anti-Asian racism throughout the COVID-19 pandemic has been associated with an increase in depression and anxiety in the Asian American community, further highlighting the need for physicians to screen effectively for mental health symptoms in this population. 20 The Patient Health Questionnaire (PHQ)-9 has long been recognized as an effective screening instrument for depression among English-proficient adults. 21 It is commonly used in primary care settings as a first-line measure for detecting depressive symptoms in adults, 22 as recommended by the U.S. Preventive Services Task Force. 23 The PHQ-2 is a briefer version of the PHQ-9 with similar sensitivity but higher specificity when paired with the PHQ-9 to follow up on positive screens, 24 which is commonly used due to its efficiency. While the original PHQ-9 was developed and validated in English, it has since been translated and used in many other languages, including Chinese. 25 However, given that the presentation of depression can vary across cultures and languages, 26,27(p.1) we must determine the validity of the PHQ-2 and PHQ-9 in Chinese languages.
Two systematic reviews of Chinese language depression screening tools have been previously conducted but both had limitations that affect generalizability to our population of interest. These reviews 28,29 focused on a variety of screening tools, with fewer studies specifically evaluating the PHQ-9 (a maximum of four studies in one review). Additionally, both research teams excluded studies conducted outside of China, limiting their applicability to Chinese-speaking immigrants in the United States. Both reviews included studies that compared the PHQ-9 with a variety of different instruments, including more widely used research tools such as Center for Epidemiologic Studies Depression Scale; neither conducted a clinical interview as the gold standard for diagnosis of depression. Furthermore, the systematic review by Sun et al, which concluded that the PHQ-9 was ''acceptable,'' was published in Chinese only, and thus remains inaccessible to Englishspeaking clinicians who may wish to apply this evidence to their practice.
Chiu and Chin concluded that the PHQ-9 was sensitive and ''highly effective'' for screening for depression in Chinese primary care; however, they only looked at articles published in English, with only four studies included in final review. Since these reviews were conducted in 2016, multiple studies evaluating the PHQ-2 and PHQ-9 have been published, warranting re-evaluation of the evidence.
We conducted a systematic review of the current literature evaluating the validity of both the Chinese language PHQ-2 and PHQ-9 for depression screening, specifically in comparison to a clinical interview as the gold standard for diagnosing depression, across geographic and practice settings.

Publication search
To find relevant articles, we performed comprehensive searches in PubMed, Web of Science, Embase, and Psy-cINFO databases with a university librarian (author P.T.). Searches were developed around these concepts: the PHQ-2 and PHQ-9, screening for depression and depressive disorders, and the validity and efficacy of the questionnaires, with a focus on the tool in Chinese languages. We chose to include multiple spoken Chinese languages (Mandarin, Cantonese, etc.) as written traditional/simplified Chinese does not distinguish between them and the PHQs are usually administered in written form. We used multiple synonyms for the different concepts to create sensitive searches that would not miss any eligible articles. We used both index terms (Mesh, Emtree) and keywords to develop the searches, and limited the search in PsycINFO to peer-reviewed articles (because this database includes non-peer-reviewed sources such as news articles and dissertations).
The full search strategies for each database can be found in the search appendix (Appendix A1). The initial searches were performed in December 2020, and a search update was performed in September 2021 to capture any relevant studies in the interval period. We also searched reference lists of retrieved articles and systematic reviews for relevant articles.

Study selection
Studies that met all of the following criteria were included in this systematic review: (1) participants were 16 years of age or older; (2) participants were primarily Chinese speakers of any language or dialect; (3) studies specifically examined either the PHQ-2 or PHQ-9; (4) questionnaire validity was studied for the purpose of screening specifically for major depression; (5) the questionnaire(s) studied were validated against a clin-ical interview as the gold standard for diagnosing depression; and (6) outcomes included biometric properties of the questionnaire(s).
We chose criterion five following best practices of diagnostic research, wherein validity studies should utilize the gold standard for comparison if one exists; in the field of psychiatry, the gold standard for depression diagnosis is the clinical interview, structured or semistructured, performed by a trained health professional or researcher. Examples include the Structured Clinical Interview for DSM (SCID), Mini-International Neuropsychiatric Interview (MINI), Composite International Diagnostic Interview (CIDI), or Schedules for Clinical Assessment in Neuropsychiatry (SCAN). [30][31][32] We excluded studies on the basis of one or more of the following: (1) inappropriate population (e.g., children, English-or other non-Chinese language spoken); (2) studies conducting factor analysis alone; (3) inappropriate gold standard (i.e., studies evaluating the PHQ-2 or PHQ-9 against another clinical scale or questionnaire only); (4) studies examining other versions of the PHQ (e.g., PHQ-15 or PHQ-8); and (5) studies examining diagnosis or screening for disorders other than major depression, such as postpartum depression.
Two investigators (L.Y. and H.P.) reviewed the titles and abstracts for all citations to identify studies that met inclusion criteria. If the reviewers could not determine from the abstract whether a particular study met inclusion criteria, the article advanced to a full-text review. Articles that were selected for inclusion based on the title and abstract also advanced to full-text review.

Data extraction
Three investigators (L.Y., S.T., and R.L.) independently used a standardized data extraction form to collect the following: first author name, publication year, country and setting (community, outpatient clinics, inpatient), participant characteristics (age, study inclusion criteria), sample size, study design, years of study, depression screening tool (PHQ-2 and/or PHQ-9), gold standard comparison, screening tool and gold standard administration protocol (e.g., timing and blinding procedures), outcome measures, and main results (with a focus on biometric properties and internal consistency). For any missing data, if valid author contact information was available, we reached out directly to request information, allowing for a response time of 2 months. Data from articles published only in Chinese were abstracted by two bilingual investigators (L.Y. and R.L.).

Biometrics and meta-analysis
Sensitivity and specificity from the studies were combined in random effects meta-analyses separated by PHQ-9 and PHQ-2. Subgroups within these groups were analyzed by the ideal cutoff for the questionnaires (a cutoff of 10 for PHQ-9 and a cutoff of 3 for PHQ-2). Results are presented in forest plots with the random pooled effect size (sensitivity or specificity) and 95% confidence bounds. As studies did not provide complete sets of their original data, a meta-analysis could not be performed on the Cronbach's alpha or area under the curve (AUC). We, therefore, present a range of Cronbach's alpha and AUCs for the PHQ-9 and PHQ-2 for all included studies, as well as test/retest reliability for those studies with this information.

Quality assessment
Two investigators (L.Y. and S.T.) independently assessed the methodological quality of the studies using the QUADAS-2, a tool specifically developed to assess the quality of diagnostic accuracy studies inclu-ded in systematic reviews. The tool investigates potential for bias in four domains: (1) patient selection, (2) index test, (3) reference standard (including blinding), and (4) flow and timing. As recommended by the tool development team, 34 we made several modifications according to relevance to our research question, and classified each study as overall low, high, or indeterminate risk for bias after taking all four domains into consideration. For the full version of our modified QUADAS-2, please see Appendix A2. This systematic review is considered exempt by the University of California, San Francisco IRB criteria.

Study characteristics
Our search strategy yielded 513 articles, of which 20 were included (see Fig. 1 for PRISMA flow diagram). Table 1 summarizes the characteristics of the included studies. As we were looking specifically for screening tool validation studies, all included studies had a cross-sectional design. Six studies [35][36][37][38][39] were available only in Chinese; data were abstracted and translated  by our team (L.Y., R.L.) for analysis. Seventeen studies were set in mainland China, one in Hong Kong, one in Taiwan, and one in the United States. Across studies, we observed a wide range in sample size (n = 148-2639) as well as clinical setting (primary care vs. specialty outpatient care vs. hospital inpatients). Samples ranged from patients with specific medical conditions such as cardiac disease or psoriasis, to general primary care patients, to individuals in the community. All samples consisted of patients who were Chinese speaking only, except for the study set in the United States. That study stated that the majority of their patient population were ''less acculturated Chinese immigrants,'' although they did not identify the proportion of their sample that truly had LEP. 25 All 20 included studies examined the PHQ-9. Of these, Yeung et al utilized a bilingual (English and Chinese) PHQ-9, which the investigator team translated themselves. 25 All other teams examined only the Chinese language PHQ-9; Liu et al 40 44 For all studies, except one, the PHQ-9 was completed before the gold standard (therefore blinded to the results of the clinical interview); the remaining article, Chen et al, 35 did not specify the details of the study protocol and we could not confirm the details by reaching out to the investigators. For 13 studies, we were able to identify the intervals between conducting the PHQ-9 and the gold standard, which ranged from immediate to four weeks. We were able to confirm that investigators conducting the gold standard were blinded to PHQ-9 results for 10 studies. For eight studies, only a select subset of the study sample (selected a priori) was asked to complete the gold standard for comparison.

Biometrics and meta-analysis
For the Chinese language PHQ-9, internal consistency varied across studies, with the Cronbach's alpha ranging from 0.765 to 0.938. Five studies preset a cutoff value and calculated the Chinese language PHQ-9 sensitivity/specificity using that cutoff value-two stud-ies chose 10 as the cutoff, 37,39 two studies chose 15, 25,38 and one study chose 9. 45 Studies that did not use a preset cutoff value and conducted receiver operating characteristic (ROC) curve analyses found that the area under the ROC curve ranged from 0.78 to 0.977. These studies identified ideal cutoff values ranging from 6 to 11, with 10 being the most common (6/15 studies).
For the nine studies that identified or preset a cutoff value of less than 10, the meta-analysis demonstrated a pooled sensitivity of 0.91 (95% CI 0.86-0.94) and a pooled specificity of 0.85 (95% CI 0.82-0.88; see Fig. 2). For the eleven studies that identified or preset a cutoff value of greater than or equal to 10, the pooled sensitivity was 0.86 (95% CI 0.83-0.89) and the pooled specificity was 0.88 (95% CI 0.82-0.93). Overall, the pooled sensitivity of studies evaluating the Chinese language PHQ-9 was 0.88 (95% CI 0.86-0.90), and the pooled specificity was 0.87 (95% CI 0.83-0.91).
Ten studies additionally analyzed the test/retest reliability for the Chinese language PHQ-9. Of these, four retested their patients after an interval of 1 week, resulting in coefficients ranging from 0.824 to 0.955 38,44,45 ; five retested their patients after 2 weeks, resulting in coefficients ranging from 0.70 to 0.87 41,42,46,48,49 ; and one study retested their patients after 4 weeks, resulting in a coefficient of 0.873. 50 Patient health questionnaire-2 Seven of our included studies used a subset of their data to examine the Chinese language PHQ-2. Cronbach's alpha ranged from 0.727 to 0.785. The area under the ROC curve (AUC) ranged from 0.802 to 0.94. Five studies identified 3 as the ideal cutoff score; at this cutoff, the pooled sensitivity was 0.84 (95% CI 0.79-0.88) and the pooled specificity was 0.89 (95% CI 0.81-0.96; Fig. 3). For the two remaining studies that identified the ideal cutoff value as 2, the pooled sensitivity was 0.87 (95% CI 0.85-0.88) and the pooled specificity was 0.81 (95% CI 0.79-0.83). Overall, the pooled sensitivity of studies evaluating the Chinese language PHQ-2 was 0.84 (95% CI 0.80-0.87), and the pooled specificity was 0.87 (95% CI 0.78-0.93).
Only two studies evaluated the test/retest reliability for the Chinese language PHQ-2. One study evaluated the reliability after 1week, resulting in a coefficient of 0.813; another study evaluated the reliability after 4 weeks, resulting in a coefficient of 0.829. 42 Quality assessment After assessment with our modified QUADAS-2 tool, none of the included studies had a high risk of bias  (Table 2). Twelve included studies had a low risk of bias, while eight studies had an indeterminate risk of bias, attributed to missing key information, including whether study team members conducted the PHQ while blinded to the gold standard results or vice versa. In particular, Ye et al 44 did not describe their exclusion criteria when enrolling patients; this limited our ability to evaluate, for example, how much of their sample had pre-existing psychiatric illness that would invoke bias when studying the efficacy of a depression screening tool.  Overall risk of bias

Discussion
In this systematic review, we found that the available literature supports the use/validity of the Chinese PHQ-9 and PHQ-2 as a tool for screening for depression in monolingual Chinese patients. We found high sensitivity and specificity for depression for both the PHQ-2 and PHQ-9 among individuals who spoke Chinese languages, across a variety of clinical settings and with a range of clinical comorbidities. Our findings are consistent with the two previous systematic reviews that have been conducted in this area. Our review has unique strengths, including a greater number of studies, comparison to gold standard clinical interviews as an inclusion criterion, studies encompassing broad geographic settings and patient populations and, therefore, better generalizability, and examination of both English-and Chinese-language articles. The studies included in our review that evaluated the validity of the Chinese PHQ-9 at multiple cutoff scores identified different ideal cutoffs, ranging from 6 to 11, with 10 identified as the optimal cutoff score in 6 of 15 studies. This is consistent with how the PHQ-9 is currently used in primary care settings, with a score of 10 as the cutoff for a positive screen across languages. 21 Notably, it is also comparable to the English language PHQ-9 at this cutoff. 52 However, not all studies in our review agreed on this cutoff, with many identifying lower scores as the optimal cutoff for diagnosing depression. This points to the need for further investigation to ensure that we are not missing depression in Chinese patients with LEP, who are already at high risk of depression under-recognition and undertreatment.
Additionally, the English language PHQ-9 can also be used to evaluate symptom severity, with scores of 5, 10, 15, and 20 indicating mild, moderate, moderately severe, and severe depression, respectively. 21 Of the studies we found, only Chen et al identified score cutoffs for different levels of symptom severity: 6, 12, and 15 for mild, moderate, and severe depression. 34 Yeung et al indirectly acknowledged this by setting the cutoff for a positive screen at the higher score of 15 instead of 10, to identify subjects whose depression was significant enough to warrant treatment; Xu et al also set their cutoff at 15 and did not state their justification, but presumably had similar reasoning. 25,38 Although our review did not explicitly address this question, for providers who wish to use the PHQ-9 to monitor response to treatment, further research could help confirm ideal cutoffs for depression symptom severity.
Less than half the articles we found evaluated the PHQ-2 in addition to the PHQ-9; all seven of these articles validated the PHQ-2 as a screening tool, with four out of seven studies agreeing on 3 as the best cutoff value for screening positive for depression, as is used for the English language PHQ-2.
We recognize several limitations to our systematic review. First, despite a rigorous search for relevant articles, it is possible that some were missed; in particular, although we were able to include six Chinese language studies that were identified through our search, we did not specifically examine the Chinese language literature or databases and may have missed studies that were published only in Chinese. However, as our purpose is to apply these findings to monolingual Chinese speakers in the United States, we felt it was appropriate to limit to Chinese language articles in English language databases for this review.
Second, we did not target any specific practice setting for our search; our ability to make strong recommendations for clinicians may thus be limited by the variability in patient comorbidities or countries of residence among the included studies. However, the broad range of populations represented in our review improves generalizability for the PHQ-2/9 as a broad screening tool. Third, although our search was internationally targeted, most studies that fit our inclusion criteria were conducted in mainland China.
Although a single study was conducted in the United States, Yeung et al, similarly found high sensitivity and specificity, the dearth of studies around the use of the Chinese PHQ in settings with patient/provider language discordance points to the need for more research in this direction. In the United States, while Asian Americans account for 5.7% of the population, less than 1% of National Institutes of Health funding goes to research on Asian American health. 54 Additionally, for immigrant populations, the preferred language is frequently used as a measure of acculturation 55 ; U.S. patients preferring the Chinese language PHQ are therefore more likely to be recent immigrants and/or less acculturated to the United States, implying some crossapplicability to research conducted with nondiasporic Chinese patients. Fourth, the quality of the diagnoses made through clinical interviews may vary depending on the individual investigator and the specific clinical interview used, which could affect the internal validity of our included studies.
Fifth, as the PHQ-2 and PHQ-9 are usually given as written questionnaires, we did not choose to distinguish between the various dialects of spoken Chinese (as all literate speakers read the same written form). However, in studies where some questionnaires were verbally administered by a research assistant, variation between the spoken dialects may have impacted tool validity. Finally, although we contacted study authors when possible to inquire about missing information, in several cases, we were unable to ascertain the exact translation of the Chinese PHQ-9 or PHQ-2 used, whether the studies appropriately excluded patients with pre-existing psychiatric illness, or whether the investigators were double blinded to the PHQ and gold standard results. This particularly impacted our evaluation of the six studies published only in Chinese, which did not provide contact information for the study authors.

Conclusion
Chinese patients with LEP and depression are more likely to be underdiagnosed and undertreated, leading to worse health outcomes and quality of life. As the mental health burden for the Asian American community has increased during the COVID-19 pandemic with the rise in racism and violence, it is more urgent than ever for us to ensure we are using the right tools to identify patients with depression. Despite the limitations of our review, we found strong evidence supporting the accuracy of Chinese language versions of the PHQ-9 and PHQ-2 for screening for depression across practice settings. However, studies reported a wide range of cutoff scores for the PHQ-9, with many demonstrating high sensitivity and specificity at lower cutoff scores, alluding to the possibility that the ideal cutoff score for Chinese monolingual patients may differ from the score used for English speakers. If so, the PHQ-9 as currently used in practice may miss depressive symptoms in some Chinese monolingual patients.
To effectively address mental health disparities for patients with LEP in the United States, more research is necessary to investigate this possibility specifically among Chinese monolingual patients living in the United States and to establish the validity of depression screening tools in other commonly spoken non-English languages. Finally, once the research is robust, medical institutions and professional bodies must standardize the uptake of evidence-based depression screening tools and interventions to truly impact patient care.