ABSTRACT
Though depression is a common mental health problem with significant impact on human society, it often goes undetected. We explore a diverse set of features based only on spoken audio to understand which features correlate with self-reported depression scores according to the Beck depression rating scale. These features, many of which are novel for this task, include (1) estimated articulatory trajectories during speech production, (2) acoustic characteristics, (3) acoustic-phonetic characteristics and (4) prosodic features. Features are modeled using a variety of approaches, including support vector regression, a Gaussian backend and decision trees. We report results on the AVEC-2014 depression dataset and find that individual systems range from 9.18 to 11.87 in root mean squared error (RMSE), and from 7.68 to 9.99 in mean absolute error (MAE). Initial fusion brings further improvement; fusion and feature selection work is still in progress.
- American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, Text Revision, Washington, DC, American Psychiatric Association, 2000.Google Scholar
- M.M.Weissman, S. Wolk, R.B. Goldstein, D. Moreau, P. Adams, S. Greenwald, C.M. Klier, N.D. Ryan, R.E. Dahl, P. Wichramaratne, "Depressed adolescents grown up," Journal of the American Medical Association, 1999; 281(18):1701--1713.Google Scholar
- J. March, S. Silva, S. Petrycki, J. Curry, K. Wells, J. Fairbank, B. Burns, M. Domino, S. McNulty, B. Vitiello, J. Severe, "Treatment for Adolescents with Depression Study (TADS) team. Fluoxetine, cognitive-behavioral therapy, and their combination for adolescents with depression: Treatment for Adolescents with Depression Study (TADS) randomized controlled trial," Journal of the American Medical Association, 2004; 292(7):807--820.Google Scholar
- J.A. Bridge, S. Iyengar, C.B. Salary, R.P. Barbe, B. Birmaher, H.A. Pincus, L. Ren, D.A. Brent, "Clinical response and risk for reported suicidal ideation and suicide attempts in pediatric antidepressant treatment, a meta-analysis of randomized controlled trials," Journal of the American Medical Association, 2007; 297(15):1683--1696.Google ScholarCross Ref
- J. Darby and H. Hollien, "Vocal and speech patterns of depressive patients," Folia phoniat, vol. 29, pp. 279--291, 1977.Google ScholarCross Ref
- J. Darby, N. Simons, and P. Berger, "Speech and voice parameters of depression: A pilot study," J. Commun. Disorders , vol. 17, pp. 75--85, 1984.Google ScholarCross Ref
- A. Ozdas, R. G. Shiavi, D. M. Wilkes, M. K. Silverman, and S. E. Silverman, "Analysis of vocal tract characteristics for near-term suicidal risk assessment," Methods of Information in Medicine, vol. 43, pp. 36--38, 2004.Google ScholarCross Ref
- A. Ozdas, R. G. Shiavi, S. E. Silverman, M. K. Silverman, and D. M. Wilkes, "Investigation of vocal jitter and glottal flow spectrum as possible cues for depression and near-term suicidal risk," IEEE Transactions on Biomedical Engineering, vol. 51, no. 9, pp. 1530--1540, September 2004.Google ScholarCross Ref
- M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, M. Pantic, "AVEC 2013 The Continuous Audio/Visual Emotion and Depression Recognition Challenge," Proc. of AVEC 2013. Google ScholarDigital Library
- L. A. Low, N. C. Maddage, M. Lech, L. Sheeber, and N. Allen, "Influence of acoustic low-level descriptors in the detection of clinical depression in adolescents," in IEEE Conference on Acoustics, Speech, and Signal Processing , Dallas, TX, USA, 2010, pp. 5154--5157.Google Scholar
- H. K. Keskinpala, T. Yingtha wornsuk, D. M. Wilkes, R. G. Shiavi, and R. M. Salomon, "Screening for high risk suicidal states using mel-cepstral coefficients and energy in frequency bands," in European Signal Processing Conference, Poznan, Poland, 2007, pp. 2229--2233.Google Scholar
- D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, and D. M. Wilkes, "Acoustical properties of speech as indicators of depression and suicidal risk," IEEE Transactions on Biomedical Engineering, vol. 47, no. 7, pp. 829--837, July 2000.Google ScholarCross Ref
- E. M. II, M. A. Clements, J. W. Peifer, and L. Weisser, "Criticalanalysis of the impact of glottal features in the classification of clinical depression in speech," IEEE Transactions on Biomedical Engineering, vol. 55, no. 1, pp. 96--107, January 2008.Google ScholarCross Ref
- J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T. Padilla, F. Zhou, and F. D. la Torre, "Detecting depression from facial actions and vocal prosody," in International Conference on Affective Computing and Intelligent Interaction, 2009.Google Scholar
- T. Yingthawornsuk and R. G. Shiavi, "Distinguishing depression and suicidal risk in men using GMM based frequency contents of affective vocal tract response," in International Conference on Control, Automation and Systems, Seoul, Korea, 2008, pp. 901--904.Google Scholar
- J. R. Williamson, R. Horwitz, T.F. Quatieri, B. Yu, B. S. Helfer, D. D. Mehta, "Vocal Biomarkers of Depression Based on Motor Incoordination," Proc. of AVEC 2013. Google ScholarDigital Library
- N. Cummins, V. Sethu, J. Joshi, R. Goecke, A. Dhall, J. Epps "Diagnosis of Depression by Behavioural Signals: A Multimodal Approach," Proc. of AVEC 2013. Google ScholarDigital Library
- H. Meng, H. Wang, H. Yang, M. Al-Shuraifi, Y. Wang, "Depression Recognition based on Dynamic Facial and Vocal Expression Features using Partial Least Square Regression," Proc. of AVEC 2013. Google ScholarDigital Library
- B. Siddiquie, S. Khan, A. Divakaran, H. Sawhney "Affect Analysis in natural human interaction using joint hidden conditional random fields," Proc of ICME 2013.Google Scholar
- M. Amer, B. Siddiquie, S. Khan, A. Divakaran, H. Sawhney "Multimodal Fusion using Dynamic Hybrid Models", Proc. of WACV 2014.Google Scholar
- D. Maust, M. Cristancho, L. Gray, S. Rushing, C. Tjoa, and M. E. Thase, "Chapter 13 - Psychiatric rating scales," in Handbook of Clinical Neurology, vol. Volume 106, F. B. Michael J. Aminoff and F. S. Dick, Eds. Elsevier, 2012, pp. 227--237.Google Scholar
- M. H. Sanchez, D. Vergyri, L. Ferrer, C. Richey, P. Garcia, B. Knoth, W. Jarrold, "Using Prosodic and Spectral Features in Detecting Depression in Elderly Males", Proc. of Interspeech, 2011.Google Scholar
- M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, M. Pantic "AVEC 2014-3D Dimensional Affect and Depression Recognition Challenge," Proc. of AVEC2014 Google ScholarDigital Library
- A. Beck, R. Steer, R. Ball, and W. Ranieri, "Comparison of beck depression inventories -ia and -ii in psychiatric outpatients. Journal of Personality Assessment, 67(3):588{97, December 1996.Google ScholarCross Ref
- V. Mitra, H. Franco, M. Graciarena, "Damped Oscillator Cepstral Coefficients for Robust Speech Recognition," Proc. of Interspeech, pp. 886--890, 2013.Google Scholar
- V. Mitra, H. Franco, M. Graciarena, A. Mandal, "Normalized Amplitude Modulation Features for Large Vocabulary Noise-Robust Speech Recognition," Proc. of ICASSP, pp. 4117--4120, 2012.Google Scholar
- R. Drullman, J.M. Festen, R. Plomp, "Effect of Reducing Slow Temporal Modulations on Speech Reception," J. Acoust. Soc. of Am., Vol. 95, No. 5, pp. 2670--2680, 1994.Google ScholarCross Ref
- V. Ghitza, "On the Upper Cutoff Frequency of Auditory Critical-Band Envelope Detectors in the Context of Speech Perception," J. Acoust. Soc. of America, vol. 110, no. 3, pp. 1628--1640, 2001.Google ScholarCross Ref
- P. Maragos, J. Kaiser, T. Quatieri, "Energy Separation in Signal Modulations with Application to Speech Analysis," IEEE Trans. Signal Processing, Vol. 41, pp. 3024--3051, 1993. Google ScholarDigital Library
- M. McLaren, N. Scheffer, M. Graciarena, L. Ferrer and Y. Lei, "Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion", in proc. of ICASSP 2013.Google ScholarCross Ref
- A. Lawson, M. McLaren, Y. Lei, V. Mitra, N. Scheffer, L. Ferrer, M. Graciarena, "Improving Language Identification Robustness to Highly Channel-Degraded Speech Through Multiple System Fusion," in Proc. of Interspeech, pp. 1507--1510, Lyon, 2013.Google Scholar
- V. Mitra, M. McLaren, H. Franco, M. Graciarena, N. Scheffer, "Modulation Features for Noise Robust Speaker Identification," Proc. of Interspeech, pp. 3703--3707, 2013.Google Scholar
- V. Mitra, H. Franco, M. Graciarena, D. Vergyri, "Medium duration modulation cepstral feature for robust speech recognition," Proc. of ICASSP, pp. 1768--1772, Florence, 2014.Google Scholar
- H. Teager, "Some Observations on Oral Air Flow during Phonation," IEEE Trans. ASSP, pp. 599--601, 1980.Google ScholarCross Ref
- V. Mitra, G. Sivaraman, H. Nam, C. Espy-Wilson, E. Saltzman, "Articulatory features from deep neural networks and their role in speech recognition," Proc. of ICASSP, pp.3041--3045, Florence, 2014.Google Scholar
- V. Mitra, H. Nam, C. Espy-Wilson, E. Saltzman, L. Goldstein, "Articulatory Information for Noise Robust Speech Recognition," IEEE Trans. on ASLP, Vol. 19, Iss. 7, pp. 1913--1924, 2010. Google ScholarDigital Library
- H. Nam, L. Goldstein, E. Saltzman, D. Byrd, "TADA: An enhanced, Portable Task Dynamics Model in Matlab," J. of Acoust. Soc. Am., 115(5), p. 2430, 2004.Google ScholarCross Ref
- E. Shriberg, A. Stolcke, S. Ravuri, "Addressee Detection for Dialog Systems Using Temporal and Spectral Dimensions of Speaking Style," Proc. of Interspeech, 2013.Google Scholar
- P. Boersma, D. Weenink, "Praat: doing phonetics by computer," Version 5.1.05, url: http://www.praat.org/, 2009Google Scholar
- N.C. Yoder, "Peak Finder," Matlab program, url: http://www.mathworks.com/matlabcentral/fileexchange/25500-peakfinder, 2011.Google Scholar
- P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal and S. Khudanpu "A Pitch Extraction Algorithm Tuned for Automatic Speech Recognition," in Proc. of ICASSP, 2014.Google Scholar
- D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., "The kaldi speech recognition toolkit," in Proc. ASRU, 2011.Google Scholar
- A. Juneja, "Speech recognition based on phonetic features and acoustic landmarks", PhD thesis, University of Maryland College Park, December 2004. Google ScholarDigital Library
- O. Deshmukh, J. Singh, C. Espy-Wilson. 2004. "A novel method for computation of periodicity, aperiodicity and pitch of speech signals," Proceedings of the 34th International Conference on Acoustics, Speech and Signal Processing, 17-21 May, Montreal, Canada, pp. 117--20.Google Scholar
- T. Pruthi, C. Espy-Wilson, "Acoustic parameters for the automatic detection of vowel nasalization," Proceedings of INTERSPEECH, pp. 1925--1928, 2007.Google Scholar
- N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Trans. on Speech and Audio Processing, 2011, 19, 788--798. Google ScholarDigital Library
- D. Martınez, O. Plchot, L. Burget, O. Glembek, P. Matejka, "Language recognition in ivectors space." Proceedings of Interspeech, Italy, 861--864, 2011.Google Scholar
- McLaren M.; Scheffer N.; Ferrer L. & Lei, Y. "Effective use of DCTs for Contextualizing Features for Speaker Recognition," Proc. ICASSP, 2014.Google Scholar
- H. Drucker, C.J. Burges, L. Kaufman, A. Smola, V. Vapnik, "Support vector regression machines. Advances in neural information processing systems," 9, pp. 155--161, 1997Google Scholar
- M. H. Bahari, M. McLaren, H. van hamme, and D. A. van Leeuwen. "Age estimation from telephone speech using i-vectors," in Proc. of InterSpeech 2012, 2012.Google Scholar
- Pedregosa et al. "Scikit-learn: Machine Learning in Python," JMLR 12, pp. 2825--2830, 2011. url: http://scikit-learn.org Google ScholarDigital Library
- L. Ferrer, L. Burget, O. Plchot, and N. Scheffer, "A unified approach for audio characterization and its application to speaker recognition," in Proc. of the Speaker and Language Recognition Workshop, Odyssey 2010, Brno, Czech Republic, Jun. 2010.Google Scholar
- F. Eyben, M. Wöllmer, B. Schuller: "openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor", Proc. ACM Multimedia (MM), ACM, Florence, Italy, ISBN 978--1--60558--933--6, pp. 1459--1462, 25.-29.10.2010. Google ScholarDigital Library
- K. Subrahmanyam, N. Shiva Sankar, S. Praveen Baggam, R. Rao S, "A Modified KS - test for Feature Selection," IOSR Journal of Computer Engineering, e-ISSN: 2278-0661, p-ISSN: 2278--8727, Vol. 13, Iss. 3, pp. 73--79, 2013.Google Scholar
Index Terms
- The SRI AVEC-2014 Evaluation System
Recommendations
Articulatory and excitation source features for speech recognition in read, extempore and conversation modes
In our previous works, we have explored articulatory and excitation source features to improve the performance of phone recognition systems (PRSs) using read speech corpora. In this work, we have extended the use of articulatory and excitation source ...
Prosody modeling for syllable based text-to-speech synthesis using feedforward neural networks
Prosody plays an important role in improving the quality of text-to-speech synthesis (TTS) system. In this paper, features related to the linguistic and the production constraints are proposed for modeling the prosodic parameters such as duration, ...
Detecting Depression Severity from Vocal Prosody
To investigate the relation between vocal prosody and change in depression severity over time, 57 participants from a clinical trial for treatment of depression were evaluated at seven-week intervals using a semistructured clinical interview for ...
Comments