Current approaches to the recognition of emotion within speech usually use statistic feature information obtained by application of functionals on turn- or chunk levels. Yet, it is well known that thereby important information on temporal sub-layers as the frame-level is lost. We therefore investigate the benefits of integration of such information within turn-level feature space. For frame-level analysis we use GMM for classification and 39 MFCC and energy features with CMS. In a subsequent step output scores are fed forward into a 1.4k large-feature-space turn-level SVM emotion recognition engine. Thereby we use a variety of Low-Level-Descriptors and functionals to cover prosodic, speech quality, and articulatory aspects. Extensive test-runs are carried out on the public databases EMO-DB and SUSAS. Speaker-independent analysis is faced by speaker normalization. Overall results highly emphasize the benefits of feature integration on diverse time scales.
Cite as: Vlasenko, B., Schuller, B., Wendemuth, A., Rigoll, G. (2007) Combining frame and turn-level information for robust recognition of emotions within speech. Proc. Interspeech 2007, 2249-2252, doi: 10.21437/Interspeech.2007-611
@inproceedings{vlasenko07_interspeech, author={Bogdan Vlasenko and Björn Schuller and Andreas Wendemuth and Gerhard Rigoll}, title={{Combining frame and turn-level information for robust recognition of emotions within speech}}, year=2007, booktitle={Proc. Interspeech 2007}, pages={2249--2252}, doi={10.21437/Interspeech.2007-611} }