Diagnostic accuracy of deep learning using speech samples in depression: a systematic review and meta-analysis

Abstract Objective This study aims to conduct a systematic review and meta-analysis of the diagnostic accuracy of deep learning (DL) using speech samples in depression. Materials and Methods This review included studies reporting diagnostic results of DL algorithms in depression using speech data, published from inception to January 31, 2024, on PubMed, Medline, Embase, PsycINFO, Scopus, IEEE, and Web of Science databases. Pooled accuracy, sensitivity, and specificity were obtained by random-effect models. The diagnostic Precision Study Quality Assessment Tool (QUADAS-2) was used to assess the risk of bias. Results A total of 25 studies met the inclusion criteria and 8 of them were used in the meta-analysis. The pooled estimates of accuracy, specificity, and sensitivity for depression detection models were 0.87 (95% CI, 0.81-0.93), 0.85 (95% CI, 0.78-0.91), and 0.82 (95% CI, 0.71-0.94), respectively. When stratified by model structure, the highest pooled diagnostic accuracy was 0.89 (95% CI, 0.81-0.97) in the handcrafted group. Discussion To our knowledge, our study is the first meta-analysis on the diagnostic performance of DL for depression detection from speech samples. All studies included in the meta-analysis used convolutional neural network (CNN) models, posing problems in deciphering the performance of other DL algorithms. The handcrafted model performed better than the end-to-end model in speech depression detection. Conclusions The application of DL in speech provided a useful tool for depression detection. CNN models with handcrafted acoustic features could help to improve the diagnostic performance. Protocol registration The study protocol was registered on PROSPERO (CRD42023423603).


Background and objective
Depression disorder is a common mental disorder, involving a low mood, loss of interest in everyday life, and other symptoms, which lead to burden, disability, and even suicide. 1orld Health Organization reports that 280 million people were diagnosed with depression in 2019, including almost 10% of children and adolescents. 2Early recognition of depression reduces the complication of treatment, shortens the course of the disease, and provides positive treatment outcomes. 3urrently, clinical symptoms, supplemented with objective physiological indicators and questionnaires, are considered to diagnose depression.Clinical symptoms must last for 2 weeks at least to confirm a diagnosis of depression, leaving patients with limited care or treatment during the early stage of the disorder. 4Moreover, subjective factors, such as patients' expressions, cultures, and attitudes, may make the diagnosis of depression more complex with a greater probability of misdiagnosis.Therefore, recent studies suggest using signal processing methods, including audio, 5 videos, 6 and electroencephalogram (EEG), 7 to increase the diagnostic accuracy of depression.
Speech has been proven as an important biomarker for depression detection since people with depression turn out to speak at a lower rate, give more prolonged pauses, and change less pitch than normal people. 8,9Compared with other biomarkers, such as videos, EEG, and skin conductance, speech has many advantages.First, it is easy and noninvasive to collect using smartphones or computers.Second, it contains various information related to depression symptoms and this information is difficult to hide.Third, it reduces privacy exposure for patients. 5A neural network is a series of connected weighted nodes that models the biological nervous system function of the human brain. 10Neural networks provide effective tools in speech processing since they have the ability to automatically learn available features from raw speech, reducing the subjectivity in manual feature selection. 11The successful applications of DL algorithms in speech signal processing and classification present a novel opportunity to improve the performance of automatic depression detection. 12ecent reviews addressed various psychiatric disorders and artificial techniques, and they gave a comprehensive explanation of the importance of applying artificial intelligence to support clinical diagnosis. 5,11,13,14However, to the best of our knowledge, few reviews focused on the use of deep learning (DL) algorithms to detect depression in speech till now.Thus, we aim to provide a systematic review and meta-analysis to evaluate the diagnostic performance of the DL algorithms in detecting and classifying depression using speech samples.

Methods
This review was conducted according to Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies statement. 15,16The study protocol was registered on PROSPERO (CRD42023423603).

Search strategy
We searched the following datasets: PubMed, Medline, Embase, PsycINFO, Scopus, IEEE, and Web of Science databases up to January 31, 2024, using the following keywords including, but not limited to, combinations of the following: depressi � , depressive disorder � , deep learning, machine learning, artificial intelligence, neural network, automat � , sound, speech, voice, acoustic � , audio, vowel, vocal, pitch, prosody.The complete search strategy is presented in the Supplementary Material.

Inclusion and exclusion criteria
This review includes studies evaluating the diagnostic accuracies of DL algorithms in depression using speech samples.After screening, we excluded the studies published with no full text.

Study selection and data extraction
Titles and abstracts of the retrieved literature were screened for eligibility.Relevant articles were read in full, and data were extracted from the articles that met all inclusion criteria.Two authors (L.L. and L.L.) conducted all these steps individually, and a third researcher (Y.W.) was included to solve the disagreements and uncertainties in the study selection process and data extraction process by discussion.
The following data was extracted from the included studies independently: title, authors, year of publication, diagnosis standard (scales), features, classification methods, model structure, and diagnostic test results (TP, TN, FP, and FN).

Statistical analysis
Sensitivity, specificity, and accuracy were calculated with a 95% CI based on the TP, TN, FP, and FN values that were extracted from the included studies for meta-analysis.The accuracy can be calculated as: The interpretation of accuracy could be affected by the prevalence of depression because, in cases of very high or very low prevalence, accuracy might not provide a complete picture of a test's performance.In our included studies, the prevalence is neither too high nor too low.Therefore, our pooled estimates of accuracy, especially their interpretation, are unlikely to be affected by the varying prevalence of the condition.We used the I 2 to measure the heterogeneity across studies and subgroups, with 25%, 50%, and 75% being considered as thresholds to indicate the low, moderate, and high heterogeneity, respectively. 17A P-value was used to measure the statistic, and P < .05 was considered statistically significant.A funnel plot was used to assess publication bias.All the analyses were performed using RStudio version 12.0 with the meta package. 18ooled estimates of depression detection in speech using DL algorithms were obtained.Leave-one-out method and subgroup analysis were used to evaluate the sensitivity and reduce the heterogeneity among the studies.SROC curve which represents the performance of a diagnostic test was also built to describe the relationship between test sensitivity and specificity. 19sessment of bias QUADAS-2 recommended by the Cochrane Collaboration was used to evaluate the risk of bias in each study by two authors (L.L. and L.L.), and the uncertainties were discussed with a third researcher (Y.W.).QUADAS-2 evaluates 4 key domains including patient selection, index test, reference standard, and flow and timing.Each domain is analyzed in terms of risk of bias, with particular attention given to concerns about applicability in the first three domains.The assessment of bias was conducted using the Review Manager Software version 5.3. 20

Literature search
A total of 2013 records were selected after duplicate removal.After screening the title and abstract, 1804 articles were excluded, with 209 articles being assessed for full-text review.Of these, 56 articles investigated speech signal processing in detecting depression but were excluded because they did not use DL algorithms.Another 57 articles excluded because not only did they apply speech samples but also other formats of data, including texts, sentiments, images, videos, and EEG.Finally, 71 articles were excluded since they did not report TP, TN, FP, and FN values used in the meta-analysis, and the remaining 25 articles were included in the systematic review.Some of them used the same dataset to explore the performance of different DL models, so we selected the studies with the highest accuracy score of each dataset to do the metaanalysis, and 8 studies were included (Figure 1).Of all included studies, the Distress Analysis Interview Corpus-Wizard-of-Oz (DAIC-WOZ) set is the most-used (n ¼ 16) dataset, and we used these studies to do the qualitative analysis from a technical perspective. 21An upward trend for publications was shown in the past 3 years.20 papers (80%) were published since 2022, and all eligible papers were published after 2019.Table 1 summarizes the main characteristics of all eligible papers, the ones in bold font were selected for the meta-analysis.

Datasets and languages
Several speech depression datasets were used to train models, and the speeches were generally recorded during the diagnosis conversation between clinicians and participants.DAIC-WOZ, which is a part of Distress Analysis Interview Corpus developed in 2016 Audio-Visual Emotion Challenge (AVEC), is the most commonly used dataset in speech depression detection. 47Besides, the Multimodal Open Dataset for Mental-disorder Analysis (MODMA), 48 Hungarian Depression Speech Database, 42 Sonde Health Free Speech (SH2-FS), 49 three Mandarin datasets, 40,43,44 and one Thai dataset recruited by researchers were also used in the included studies. 39Some studies used more than 1 dataset to test the performance of their proposed model.Among all included studies, 17 studies (68%) used English datasets, 5 studies (20%) used Mandarin datasets, 2 studies (8%) used Hungarian datasets, and only one used Thai dataset.

Diagnostic scales
In speech depression datasets, diagnostic scale scores are set as training labels.PHQ-8 scores of each participant were recorded in DAIC-WOZ, and the score of 10 was set as the threshold to decide whether the participant was diagnosed with depression or not. 50Besides, other questionnaires, such   as the Hamilton Rating Scale for Depression (HAMD), 51 Beck Depression Inventory-II (BDI-II), 52 and PHQ-9, 53 were also used as the assessments to detect depression.

Speech processing
In the development of an automatic speech recognition system, preprocessing is considered the first phase to train a robust and efficient model. 54Fourteen studies (87.5%) mentioned at least one speech preprocessing procedure with 50% of the papers applying various methods to tackle data imbalance as shown in Figure 2A.This result is unsurprising, given that the DAIC-WOZ dataset consists of 146 depressed subjects and only 43 healthy participants, highlighting the critical issue of data imbalance in achieving good performance.Speech segments of varying lengths are used as inputs to the model, enriching the dataset and accommodating DL models (Figure S12A).Yin and colleagues segmented the speech into 9-second fragments, achieving the highest performance across all studies with 0.94 in accuracy and 0.92 in sensitivity. 31Moreover, fragments over 10 seconds in length exhibited the highest specificity (Table 2).In Figure S12B, we can find that all studies employed either a train-test split or a train-validation-test split to prevent overfitting to the training data and accurately assess the model's performance.
Feature engineering is one of the most crucial steps of traditional machine learning based speech depression detection research and the main purpose of many studies considered in this review is to avoid this step by developing DL for automatic feature learning. 26,27,35,55Based on the results shown in Table 2, it is evident that LLDs and MFCCs-based models achieved over 80% accuracy, surpassing other types of features.Besides, 3 included studies compared the performance of speech depression detection with multimodal depression detection, and acoustic features. 24,25,35All these 3 studies present that using multimodal features enhances the performance of speech depression detection models.

Deep learning methodology
Compared with clinical diagnosis, DL algorithms can learn high-level features automatically.In this review, we divided the DL models used in the included studies into the following groups: convolutional neural network (CNN), CNN-long short-term memory (LSTM), CNN-support vector machine (SVM), LSTM, and CNN-Transformer.Figure 2B shows CNN is the most commonly used DL algorithm, with 56.25% of the studies using it directly as the depression detection model.Additionally, 25% of studies employed CNN as a feature extraction or dimension reduction method, followed by the use of LSTM (12.5%) or SVM (12.5%) as a classifier for depression detection.The CNN-Transformer architecture shows the highest performance among all studies (Table 2), which indicates that the transformer holds promising potential for depression detection using speech data. 31n Figure 3, we present a visualization of the distribution of hyperparameters in DL models.As shown in Figure 3, 50% of studies did not report batch size, 50% of studies did not report epochs, 50% of studies did not report learning rate, 56.25% of studies did not report loss functions and 43.75% of studies did not report optimizers.These hyperparameters may affect the model's performance to some extent, highlighting the importance of selecting appropriate hyperparameters.The number of neural network layers does not necessarily exceed 5 in most studies (62.5%) under consideration (Figure 3A), and such kind of studies achieved the highest accuracy, which is 0.79 (Table 2).Cross-entropy was the most commonly used (31.25%)loss function, outperforming mean square error in terms of accuracy (0.84), sensitivity (0.70), and specificity (0.91) (Figure 3E and Table 2).

Evaluation measures
The types of performance metrics used by the included studies focusing on speech depression detection are shown in Figure 4A.Most studies (over 80%) used F1-score, accuracy, recall, and precision which were derived from confusion matrices to evaluate the performance of the DL models, but these metrics were not commonly used by clinicians in evaluating diagnostic tests.Instead, sensitivity (recall), specificity and ROC AUC, which are also derived from diagnostic test results, are clinically relevant and commonly used performance measures for diagnostic tests.Based on Figure 4B, we found that the included studies achieved good results in terms of accuracy and specificity (over 75%), but slightly lagged in sensitivity.

Comparison between deep learning and machine learning
Four studies compared their proposed methods with machine learning methods. 23,27,31,36SVM is the most commonly used machine learning algorithm, and 3 studies compared their proposed methods with SVM. 23,27,36In all these 4 studies, the proposed DL methods performed better than machine learning methods.

Summary
The papers reviewed displayed varying degrees of speech depression detection.(1) Data preprocessing: segmentation of speech varied in length across papers, with notable performance achieved through longer segments (more than 5 s).(2) Features: the preference for DL models suggests a shift away from traditional feature engineering, with promising results observed particularly with LLDs and MFCCs.(3) Models: CNN emerges as the predominant choice among DL architectures for depression detection, with CNN-Transformer demonstrating the highest performance.While hyperparameters significantly impact model performance, many studies lack specificity in their selection, underscoring the importance of fine-tuning for optimal results.(4) Evaluation: Overall, the models using the DAIC-WOZ dataset generally achieved good accuracy and specificity (over 75%), and the sensitivity lagged slightly.

Diagnostic accuracy of deep learning in depression detection
Overall, 8 studies with 670 585 preprocessed speech samples in the test sets were included in the meta-analysis, and all these studies were published in the last 5 years (2021-2024).Our study reports the evaluation parameters of accuracy, sensitivity, and specificity.The pooled estimate of classification accuracy for depression detection models was 0.87 (95% CI, 0.81-0.93,I 2 ¼ 99%).Meta-analysis showed the pooled estimate to be specific (0.85, 95% CI, 0.78-0.91,I 2 ¼ 99%), but with a lower sensitivity (0.82, 95% CI, 0.71-0.94,I 2 ¼ 100%).The random-effect model was used due to the high heterogeneity in the meta-analysis.Figure 5 represents the accuracy forest plots of all included studies.The forest plots of the pooled sensitivity and specificity can be found in supplementary files (Figures S1 and S2).SROC curve for the test is also shown in supplementary files (Figure S3).

Sensitivity analysis
Considering the high heterogeneity, subgroup analysis was undertaken to excavate the potential factors.I 2 dropped significantly in specificity (from 100% to 0%), accuracy (from 100% to 78%), and sensitivity (from 100% to 88%) of the end-to-end group in the model structure subgroup.In this situation, the accuracy and the specificity of the handcrafted group (accuracy: 0.89, 95% CI, 0.81-0.97,I 2 ¼ 100%; specificity: 0.87, 95% CI, 0.78-0.96,I 2 ¼ 99%) was higher than the end-to-end group (accuracy: 0.82, 95% CI, 0.75-0.90,I 2 ¼ 78%; specificity: 0.80, 95% CI, 0.75-0.85,I 2 ¼ 0%), but the sensitivity of the end-to-end group (0.84, 95% CI, 0.73-0.95,I 2 ¼ 88%) was higher than the handcrafted group (0.81, 95% CI, 0.64-0.99,I 2 ¼ 100%).The forest plot of the pooled accuracy for the model structure subgroup is shown in Figure 6, and the other forest plots for the pooled sensitivity and specificity for the model structure subgroup can be found in supplementary files (Figures S4 and S5).Since speech samples in some included studies were segmented from audios, the sample size in one study (n ¼ 663 978) was extremely larger than the others.Leaveone-out test was conducted to minimize the influence of the particular study. 34While omitting each study, the pooled estimates of accuracy (0.85-0.89), sensitivity (0.80-0.87), and specificity (0.82-0.87) changed a little.The plots for the leave-one-out results of the pooled accuracy, sensitivity, and specificity can be found in supplementary files (Figures S6-S8).

Quality assessment
QUADAS-2 was used to rate the overall methodological quality in our study, and the Figures S9 and S10 present the plots illustrating the risk of bias and applicability concerns.The included studies achieved an average score of 3.3 out of 4 in the risk of bias section, and 3.1 out of 4 in the applicability concerns section, thereby affirming the high quality of the   studies.The funnel plot (Figure S11) was slightly asymmetric, indicating modest publication bias in all the included studies.Specifically, the shape of the plot suggests that smaller data sizes with low accuracy were less likely to get published.

Summary of key findings
To our knowledge, our study is the first review on the diagnostic performance of DL for depression detection from speech samples providing both systematic review (narrative summary of 25 studies) and meta-analysis (quantitative assessment of a subset of 8 studies).We found that across all included studies, the pooled estimate of the accuracy for depression detection was 0.87, and the specificity (0.85) was higher than the sensitivity (0.82).The handcrafted model obtained better evaluation results (accuracy: 0.89) in the subgroup analysis than the end-to-end model (accuracy: 0.82).

Speech features for depression detection
A recent review found that a set of bio-acoustic features, including source, spectral, prosodic, and formants, could improve the classification performance for depression detection. 56In addition, Zhao et al reported that acoustic characteristics were associated with the severity of depressive symptoms and might be objective biomarkers of depression. 57The findings are consistent with the present study that handcrafted model structure gave better performance than end-to-end model structure.This is because the handcrafted model structure contains various kinds of selected acoustic information, such as source and formants.Besides, our results showed acoustic features were promising, reliable, and objective biomarkers to support depression diagnosis using DL.

Superior performance of deep learning in depression detection
A recent systematic review (but not meta-analysis) suggested that SVM was the most popular classifier used among all machine learning (ML) methods in depression detection. 58hadra and his colleagues merged DL techniques into a single classifier group to compare with other ML algorithms owing to the limited studies accessible, which gave a comprehensive description of all ML algorithms but remained extensible for further research on DL. 58 In the present review, some included studies confirmed that DL surpasses previous ML methods for automated diagnosis of depression, such as SVM, Random Forest, and Gradient Boosting Tree. 27,36,40, As mentioned in the present review, the prevailing emphasis lies on CNN models, and it may be beneficial to explore more DL methods in depression detection.Although DL has less interpretability than other computational methods, it has shown great potential to assist in the diagnosis of depression.

Deep learning model structure strategies
Wu and colleagues summarized in their systematic survey that applying DL in depression detection could be built in two structures: (1) extract hand-craft acoustic features, and then implement classification methods; (2) put raw audio or spectrograms into an end-to-end DL architecture to do both feature extraction and classification by itself. 13To explore the performance of these two structures, we applied the subgroup analysis of the model structure in the meta-analysis.The pooled estimates of depression detection performance in the handcrafted structure were higher than the end-to-end structure, which provided evidence that the good performance of DL might rely on the strategies of model structures.Since its lack of interpretability, it is still limited to applying the end-to-end deep model to solve real-world clinical problems.

Future development of deep learning
Applying DL algorithms on speech samples to support clinical diagnosis for depression disorders was novel, but still needs further development.First, the performance of the automatic speech depression detection models may be influenced by different languages, cultures, and environments.0][61] Second, due to the difficulties and privacy issues of collecting depression speeches, issues of small sample size and data imbalance need to be solved before training a DL model.Third, the outperformance of CNN related model may be partly explained by the common interest in CNN, since most studies included in the systematic review focused on optimizing parameters for the CNN-related algorithms.Therefore, the performance of other DL algorithms remains to be deciphered.Fourth, the explainability of DL models is a limitation in speech depression detection.It is difficult to understand how decisions are made by DL, which is crucial for gaining trust and acceptance in clinical settings.

Clinical and research implication
The increasing prevalence of depression is a significant burden that could overwhelm mental health services capacity.Although automated depression detection allows wide screening of a larger population and ameliorate the increasing demand placed on health services, these techniques should still be used as supplementing methods to detect early signs of depression.Despite the positive attitudes of clinicians toward diagnosis-supported techniques, rolling out such novel applications on a wider scale remains challenging until knowledge of DL is obtained and experience is acquired in using those techniques in the diagnosis of depression. 62Therefore, future research should better involve physicians to improve the feasibility of techniques and require clinical trials to further explore the utility of diagnosis-supported tools.Besides, since speech is easy to collect using smartphones, future research can focus on implementing remote monitors on smartphones to obtain valuable information from real-time response and relapse, support physicians' decisions, and generate immediate diagnosis feedback.

Source of heterogeneity
The pooled results in the meta-analysis represented significant heterogeneity among the studies.There may be many reasons, including the various sample sizes based on speech segmentation, different speech languages and cultures, and different methodologies.In this study, we analyzed subgroup and leave-one-out results to explore the sources of heterogeneity.I 2 dropped significantly in specificity when dividing studies based on model structure (from 100% to 0%), which indicated that model structure might be the major cause of heterogeneity.Besides, heterogeneity was slightly lower in specificity when omitting the study with the biggest sample size, 34 providing evidence that the speech segmentation methods and the speech sample sizes also influenced the heterogeneity.

Limitations
Our study has several limitations.First, only a limited number of studies were included in the systematic review because most studies did not report the original TP, TN, FP, and FN scores, and this may lead to underpowered pooled estimates.An updated meta-analysis could be performed in the future when source studies are sufficient to make the results more robust.Second, most studies included used the same dataset, so we selected the best performance model from each dataset to ensure the validity and reliability of the meta-analysis.The limited number of studies in the meta-analysis made it difficult to stratify the studies into different subgroups to explore the source of heterogeneity.Third, we did not do the metaanalysis based on AUROC scores which were usually used to describe the performance of classification models since only 3 studies reported AUROC scores in all included studies. 29,41,44

Conclusions
We conducted a comprehensive systematic review and metaanalysis reporting the application of DL algorithms in speech to detect depression.The review confirms that using DL in speech to support the clinical diagnosis of depression is a promising method with excellent performance.CNN model with handcrafted acoustic features training on an appropriately balanced dataset was shown to be the best method in depression detection.Further studies could focus on multilingual and cross-lingual speech depression detection, DL algorithms exploration and optimization, and multimodal features combination.In addition, researchers should report diagnostic evaluation measures, such as sensitivity and specificity, to interpret DL results in real-world clinical settings.

Figure 1 .
Figure 1.PRISMA flowchart.Study selection for systematic review and meta-analysis.

Figure 2 .
Figure 2. Speech preprocessing and deep learning models.(A) The number of studies that used preprocessing steps, such as removing silence.(B) The number of studies that used different types of DL models.

Figure 3 .
Figure 3. Hyperparameters choices.(A) Distribution of the number of neural network layers.(B) The number of studies that used different batch sizes.(C) Distribution of the number of epochs.(D) The number of studies that used different learning rates.(E) The number of studies that used different loss functions.(F) The number of studies that used different optimizers.

Figure 4 .
Figure 4. Model performance evaluation.(A) the number of studies that used different evaluation methods.(B) The boxplot across studies of accuracy, sensitivity, and specificity.

Figure 5 .
Figure 5. Forest plot for the pooled accuracy.

Figure 6 .
Figure 6.Forest plot of the pooled accuracy for the model structure subgroup.

Table 1 .
Characteristics of the included studies.

Table 2 .
Performance comparison of characteristics of the studies using the DAIC-WOZ dataset.Journal of the American Medical InformaticsAssociation, 2024, Vol.31, No. 10 27Notes: Acc., Accuracy; Sens., Sensitivity; Spec., Specificity.The best performance for each characteristic of the studies is shown in bold font.