A Survey on perceived speaker traits: Personality, likability, pathology, and the first challenge

https://doi.org/10.1016/j.csl.2014.08.003Get rights and content

Abstract

The INTERSPEECH 2012 Speaker Trait Challenge aimed at a unified test-bed for perceived speaker traits – the first challenge of this kind: personality in the five OCEAN personality dimensions, likability of speakers, and intelligibility of pathologic speakers. In the present article, we give a brief overview of the state-of-the-art in these three fields of research and describe the three sub-challenges in terms of the challenge conditions, the baseline results provided by the organisers, and a new openSMILE feature set, which has been used for computing the baselines and which has been provided to the participants. Furthermore, we summarise the approaches and the results presented by the participants to show the various techniques that are currently applied to solve these classification tasks.

Introduction

In 2009–2012, challenges (Schuller et al., 2009, Schuller et al., 2010, Schuller et al., 2011, Schuller et al., 2012, Schuller et al., 2013a, Schuller et al., 2013) were organised at the INTERSPEECH conferences that featured several different aspects of paralinguistics: topics of interest were not what the speaker said, i.e., word recognition, or the semantics behind word recognition, e.g., hot spots or ontologies, but how it was said; for that, pertinent information can either be found between words (vocal, non-verbal events), it can be modulated onto the word chain (typically supra-segmental phenomena such as prosody or voice quality), or it can be encoded in the (types of) words chosen and in the connotations of these words. Catalogues of (short-term) speaker states such as emotions and of (long-term) speaker traits such as gender or personality are given in Schuller et al. (2013) and Schuller and Batliner (2014). In the 2012 challenge and accordingly in the present article, we want to address speaker traits that were obtained by perceptual annotation and not by some ‘objective’ measurement such as placing subjects on a scale to find out about their weight, or simply by deciding between male or female.

There are different definitions for the field that deals with ‘how’ instead of ‘what’; traditionally, paralinguistics is mostly conceived as dealing with the non-verbal, vocal aspects of communication, sometimes including, sometimes excluding multi-modal behaviour such as facial expression, hand gesture, gait, body posture. Here, we follow the definition given in Schuller and Batliner (2014): paralinguistics is “[...] the discipline dealing with those phenomena that are modulated onto or embedded into the verbal message, be this in acoustics (vocal, non-verbal phenomena) or in linguistics (connotations of single units or of bunches of units).” Thus, we exclude multi-modality but include verbal phenomena: although most of the contributions to our challenges so far concentrated on acoustics, i.e. on vocal phenomena modulated onto or embedded into the verbal message, we do not want to exclude linguistic approaches such as the modelling of interjections, hesitations, part-of-speech, or n-grams.

Speech is produced by speakers, and when we aim at paralinguistics, then a specific type of speech (friendly speech, pathological speech) characterises a specific type of speakers – they display friendliness or pathological speech traits. Thus, we could subsume all these phenomena under Speaker Characterisation or Speaker Classification as was done by Müller (2007, V): “[...] the term speaker classification is defined as assigning a given speech sample to a particular class of speakers. These classes could be Women vs. Men, Children vs. Adults, Natives vs. Foreigners, etc.”. Eventually, it is simply a matter of perspective whether we call the object of our investigation “type of speech” (indicated by specific speech characteristics) or “speaker traits” (indicated by specific speech characteristics extracted from the speech of specific speakers).

Irrespective of the term chosen, it is always about assigning one individual sample (speech or speaker) to k = 1, …, n groups (classes) of speakers; the larger n is, the more likely we may employ regression procedures instead of classification. Of course, it is always possible to map more or less continuous attributions such as rating scales onto a few classes. For challenges like the present one, we as organisers have to know which class a speaker in the test set belongs to. As mentioned above, this ‘reference’ (or ‘ground truth’, ‘gold standard’) can be obtained by (sort of) objective measures (for instance, speaker weight classes by following the ‘body mass index’) or by using perceptive evaluation. In this challenge on perceived speaker traits, we presented three sub-challenges where all speakers were assigned to (two) different classes, based on perceptive evaluation.

Perceptual judgements as basis for reference classes set specific edge conditions: basically, this mostly results in ranked/ordinal scales; however, often-parametric procedures such as Pearson's correlation are used. Human annotators do not always agree; thus, we do need some measure for agreement, and some method for ending up with one ‘unified’ label per token. This is normally the mean of the rating scale scores of all annotators. If we aim at classes, we have to partition the scale at appropriate points (mean, median, etc.).

When some of the authors started organising challenges back in 2009, the main motivation behind was to introduce a certain standard of comparability into the field of Computational Paralinguistics, by introducing concepts like

  • a partitioning of the database into train, development, and test data; often, there were only train and test partitions, and researchers defined the partitions of the very same corpus in different ways

  • a clearcut stratification of subjects for the partitions, if necessary and feasible, for instance, into male/female, old/young, etc.

  • the ‘open microphone setting’ which means that all data recorded and available should be processed; this pertains especially realistic data that often were preselected, based on labeller agreement, quality of recordings, and alike

  • adequate performance measures such as Unweighted Average (UA) Recall (UAR), that is, the unweighted (by number of instances in each class) mean of the percentage correctly classified in the diagonal of the confusion matrix; especially for more than two classes, this measure is more adequate than the usual Weighted Average Recall

  • both feature extraction and machine learning procedures done with open source tools, to guarantee strict comparability (e.g. of different features, using exactly the same learning algorithm, and of various learning algorithms, using exactly the same features) and repeatability (ensuring, also by means of software configuration management, that baseline results can be reproduced by anyone with access to the data and open source software, at any time)

  • comparability between studies both within the setting of the challenge (this is easy to obtain because the organisers can define the settings in a strict way) and later on, after the challenge (this cannot be ascertained in a strict way, of course, but authors often refer to and apply the challenge settings)

In the later challenges 2010–2012, we basically kept these conditions, with slight modifications: we introduced some more performance measures (correlation and area under ROC (Receiver–Operating Characteristic) curve (AUC)); we employed not only free interaction (as in our Speaker Personality Corpus, see Section 3.2) but controlled, prompted data as well (as in our likability and pathology corpus, see Sections 3.3 Speaker Likability Database (SLD), 3.4 The pathology corpus – The NKI CCRT Speech Corpus (NCSC)); we implemented larger feature vectors, see Section 3.1.

The three speaker traits dealt with in the challenge are described in Section 2. Previous studies on these traits are summarised below in order to motivate research on their automatic recognition as well as demonstrate feasibility. In Section 3, after shortly presenting the challenge and the unified machine learning framework (feature vectors and learning algorithms employed for computing the baseline results), we introduce the three challenge corpora, together with baseline results. Section 4 presents the contributions to each of the three sub-challenges, and the winners – in contrast to the general literature review, this section serves to review state of the art methods in a comparable setting and to provide a form of quantitative meta-analysis. Section 5 aims at summarising what we have learnt from the challenge.

Section snippets

Three speaker traits

In this section, we want to give a short account of the state-of-the-art in research on perceived speaker traits within computational paralinguistics. The recognition of perceived speaker traits is exemplified by personality, likability, and pathology. These three traits have been chosen based on the quantity of available labelled data (a crucial prerequisite for meaningful machine learning experiments) and the existence of feasibility studies on automatic classification.

The first challenge on perceived speaker traits: personality, likability, pathology

Whereas the first open comparative challenges in the field of paralinguistics targeted more ‘conventional’ phenomena such as emotion, age, and gender, there still exists a multiplicity of not yet covered, but highly relevant speaker states and traits. In the previous 2011 challenge, we focused on medium-term speaker states, namely sleepiness and intoxication. Consequently, we now wanted to focus on long(er) term speaker traits. In that regard, the INTERSPEECH 2012 Speaker Trait Challenge

Challenge results

One of the requirements for participation in the challenge was the submission and acceptance of a paper to the INTERSPEECH 2012 Speaker Trait Challenge, which was organised as a special event at the INTERSPEECH conference. Overall, 52 research groups registered for the challenge, and finally, 18 papers were accepted for presentation in the regular review process of the conference. All participants were encouraged to compete in all three sub-challenges. Table 9 shows how many participants took

Summary: challenge setup and results

In this INTERSPEECH 2012 Speaker Trait Challenge, we focused on perceived speaker traits, i.e., on traits that have to be annotated by humans. The recording settings were realistic with respect to specific applications: radio broadcast in the case of personality, mobile and landline phone in the case of likability, and office environment in the case of pathological speech. The type of data was spontaneous, prompted, and read speech. Annotation was made using rating scales.

To keep the conditions

Acknowledgement

The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007–2013) under grant agreements No. 289021 (ASC-Inclusion) and No. 338164 (ERC Starting Grant iHEARu), and from the German Research Foundation (DFG grant WE 5050/1-1). The authors would further like to thank the sponsors of the challenge, the HUMAINE Association and Telekom Innovation Laboratories, and Catherine Middag for adding phoneme alignments for the Pathology

Björn Schuller received his diploma in 1999, his doctoral degree in 2006, and his habilitation in 2012, all in electrical engineering and information technology from TUM in Munich/Germany where he is tenured heading the Machine Intelligence & Signal Processing Group. He is further a Senior Lecturer at the Imperial College London/U. K. In 2013 he also chaired the Institute for Sensor Systems at the University of Passau/Germany. From 2009 to 2010 he was with the CNRS-LIMSI in Orsay/France and a

References (122)

  • B. Rammstedt et al.

    Measuring personality in one minute or less: a 10-item short version of the Big Five inventory in English and German

    J. Res. Personal.

    (2007)
  • R. Ranganath et al.

    Detecting friendly, flirtatious, awkward, and assertive speech in speed-dates

    Comput. Speech Lang.

    (2013)
  • A. Rosenberg et al.

    Charisma perception from text and speech

    Speech Commun.

    (2009)
  • B. Schuller et al.

    Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge

    Speech Commun.

    (2011)
  • B. Schuller et al.

    Paralinguistics in speech and language – state-of-the-art and the challenge

    Comput. Speech Lang.

    (2013)
  • A. Afifi et al.

    Statistical Analysis. A Computer Oriented Approach

    (1979)
  • G.K. Anumanchipalli et al.

    Text-dependent pathological voice detection

  • A.E. Aronson et al.

    Clinical Voice Disorders

    (2009)
  • E. Aronson et al.

    Social Psychology

    (2009)
  • Y. Attabi et al.

    Anchor models and WCCN normalization for speaker trait classification

  • K. Audhkhasi et al.

    Speaker personality classification using systems based on acoustic-lexical cues and an optimal tree-structures Bayesian network

  • J. Biesanz et al.

    Personality coherence: moderating self-other profile agreement and profile consensus

    J. Pers. Soc. Psychol.

    (2000)
  • W.J. Blot et al.

    Smoking and drinking in relation to oral and pharyngeal cancer

    Cancer Res.

    (1988)
  • T. Bocklet et al.

    Voice assessment of speakers with laryngeal cancer by glottal excitation modeling based on a 2-mass model

  • T. Bocklet et al.

    Detection of persons with Parkinson's disease by acoustic, vocal, and prosodic analysis

  • D.H. Brown et al.

    Postlaryngectomy voice rehabilitation: state of the art at the millennium

    World J. Surg.

    (2003)
  • R. Brueckner et al.

    Likability classification – a not so deep neural network approach

  • H. Buisman et al.

    The log-Gabor method: speech classification using spectrogram image analysis

  • K. Bunton et al.

    Listener agreement for auditory-perceptual ratings of dysarthria

    J. Speech Lang. Hear. Res.

    (2007)
  • F. Burkhardt et al.

    A database of age and gender annotated telephone speech

  • F. Burkhardt et al.

    ‘Would you buy a car from me?’ – On the likability of telephone voices

  • N. Campbell et al.

    Voice quality: the 4th prosodic dimension

  • C. Chastagnol et al.

    Personality traits detection using a parallelized modified SFFS algorithm

  • R.P. Clapham et al.

    NKI-CCRT Corpus – speech intelligibility before and after advanced head and neck cancer treated with concomitant chemoradiotherapy

  • S. Cloninger

    Conceptual issues in personality theory

  • N. Cummins et al.

    A comparison of classification paradigms for speaker likeability determination

  • N. Dahlbäck et al.

    Similarity is more important than expertise: accent effects in speech interfaces

  • P. Ekman et al.

    Relative importance of face, body, and speech in judgments of personality and affect

    J. Pers. Soc. Psychol.

    (1980)
  • F. Eyben

    Real-time speech and music classification by large audio feature space extraction

    (2014)
  • F. Eyben et al.

    Recent developments in openSMILE, the munich open-source multimedia feature extractor

  • F. Eyben et al.

    openSMILE – the munich versatile and fast open-source audio feature extractor

  • M.J. Ferguson et al.

    Likes and dislikes: a social cognitive perspective on attitudes

  • L. Ferrier et al.

    Dysarthric speakers’ intelligibility and speech characteristics in relation to computer speech recognition

    Augment. Altern. Commun.

    (1995)
  • D. Funder

    Personality

    Ann. Rev. Psychol.

    (2001)
  • A. Gravano et al.

    Acoustic and prosodic correlates of social behavior

  • M. Grimm et al.

    Evaluation of natural emotions using self assessment manikins

  • T. Haderlein

    Automatic Evaluation of Tracheoesophageal Substitute Voices

    (2007)
  • M. Hall et al.

    The WEKA data mining software: an update

    SIGKDD Explorations Newsletter

    (2009)
  • A.E. Harrison

    Speech Disorders: Causes, Treatment and Social Effects

    (2010)
  • Cited by (64)

    • A new fuzzy unit selection cost function optimized by relaxed gradient descent algorithm

      2020, Expert Systems with Applications
      Citation Excerpt :

      However, when computer algorithms start to appear and behave like humans, the feeling of ‘repulsion’ increases in line with mistakes in the generated behavior (i.e. the uncanny valley theorem, Ciechanowski, Przegalinska, Magnuski, & Gloor, 2019), voice quality (i.e. mis-articulation and synthetic voice) being the major factor in likeability and perceived human likeness of synthesized speech (Baird, Parada-Cabaleiro, Hantke, Burkhardt, Cummins, & Schuller, 2018). Paralinguistic features (i.e. excluding the linguistic meaning) strongly affect human perceptions of essential interaction factors, including “likeability” (Schuller et al., 2015). From IBM’s Watson (Fernandez, Rendel, Ramabhadran, & Hoory, 2015) to WaveNet (Shen, Pang, Weiss, Schuster, Jaitly, Yang, & Saurous, 2018), the methods for creating natural voices have substantially advanced in the past decade.

    • Multimodal affective computing: Technologies and applications in learning environments

      2023, Multimodal Affective Computing: Technologies and Applications in Learning Environments
    • Multimodal Personality Traits Assessment (MuPTA) Corpus: The Impact of Spontaneous and Read Speech

      2023, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    View all citing articles on Scopus

    Björn Schuller received his diploma in 1999, his doctoral degree in 2006, and his habilitation in 2012, all in electrical engineering and information technology from TUM in Munich/Germany where he is tenured heading the Machine Intelligence & Signal Processing Group. He is further a Senior Lecturer at the Imperial College London/U. K. In 2013 he also chaired the Institute for Sensor Systems at the University of Passau/Germany. From 2009 to 2010 he was with the CNRS-LIMSI in Orsay/France and a visiting scientist in the Imperial College London. In 2012 he was with Joanneum Research in Graz/Austria, and in 2013 Visiting Professor of the Harbin Institute of Technology in Harbin/P. R. China and of the University of Geneva/Switzerland. Dr. Schuller is president of the Association for the Advancement of Affective Computing (AAAC), elected member of the IEEE SLTC, member of the ACM, IEEE, and ISCA and (co-)authored more than 390 peer reviewed publications leading to more than 6100 citations – his current h-index equals 39.

    Stefan Steidl received his diploma degree in Computer Science in 2002 from Friedrich-Alexander University Erlangen-Nuremberg in Germany (FAU). In 2009, he received his doctoral degree from FAU for his work on Vocal Emotion Recognition. He is currently a member of the research staff of ICSI in Berkeley/USA and the Pattern Recognition Lab of FAU. His primary research interests are the classification of naturally occurring emotion-related states and of atypical speech (children's speech, speech of elderly people, pathological voices). He has (co-)authored more than 40 publications in journals and peer reviewed conference proceedings and been a member of the Network-of-Excellence HUMAINE.

    Anton Batliner received his M.A. degree in Scandinavian Languages and his doctoral degree in phonetics in 1978, both at LMU Munich/Germany. He has been a member of the research staff of the Institute for Pattern Recognition at FAU Erlangen/Germany since 1997. He is co-editor of one book and author/co-author of more than 200 technical articles, with a current h-index of 37 and more than 5000 citations. His research interests are all aspects of prosody and paralinguistics in speech processing. He repeatedly served as Workshop/Session (co)-organiser and has been Associate Editor for the IEEE Transactions on Affective Computing.

    Elmar Nöth is a professor for Applied Computer Science at the University of Erlangen-Nuremberg. He studied in Erlangen and at M.I.T. and received the Dipl.-Inf. (univ.) degree and the Dr.-Ing. degree from the University of Erlangen-Nuremberg in 1985 and 1990, respectively. Since 1990 he was an assistant professor at the Institute for Pattern Recognition in Erlangen. Since 2008 he is a full professor at the same institute and head of the speech group. Since 2013 he is Adjunct Professor at the King Abdulaziz University in Saudi Arabia. He is on the editorial board of Speech Communication and EURASIP Journal on Audio, Speech, and Music Processing. His current interests are prosody, analysis of pathologic speech, computer aided language learning and emotion analysis.

    Alessandro Vinciarelli is Lecturer at the University of Glasgow (UK) and Senior Researcher at the Idiap Research Institute (Switzerland). His main research interest is Social Signal Processing, the domain aimed at modelling analysis and synthesis of nonverbal behaviour in social interactions. He has published more than 80 works (1700+ citations, h-index 23), organized the IEEE International Conference on Social Computing, and chaired twenty international scientific events. Furthermore, he is or has been Principal Investigator of several national and international projects, including a European Network of Excellence (the SSPNet, www.sspnet.eu). Last, but not least, Alessandro is co-founder of Klewel (www.klewel.com), a knowledge management company recognised with several awards.

    Felix Burkhardt does tutoring, consulting, research and development in the working fields human-machine dialogue systems, Text-to-Speech synthesis, speaker classification, ontology based natural language modelling and emotional human-machine interfaces. Originally an expert of Speech Synthesis at the Technical University of Berlin, he wrote his PhD thesis on the simulation of emotional speech by machines, recorded the Berlin acted emotions database, EmoDB, and maintains the open source emotional speech synthesiser Emofilt. He has been working for the Deutsche Telekom AG since 2000, currently for the Telekom Innovation Laboratories in Berlin.

    Rob van Son received a masters degree from the Radboud University in Nijmegen and a PhD in Phonetics from the University of Amsterdam. He has worked for the Amsterdam Center for Language and Communication (ACLC, University of Amsterdam) and the NKI-AVL in Amsterdam on a number of projects in the field of phonetics, psycholinguistics, and speech technology.

    Felix Weninger received his diploma in computer science (Dipl.-Inf. degree) from TUM in 2009. He is currently pursuing his PhD degree as a researcher in the Machine Intelligence & Signal Processing Group at TUM's Institute for Human-Machine Communication. He has (co-)authored more than 60 publications in peer-reviewed books, journals and conference proceedings covering the fields of robust audio analysis, computational paralinguistics and medical informatics. Mr. Weninger serves as a reviewer for the IEEE Transactions on Audio, Speech and Language Processing, IEEE Transactions on Affective Computing and other high-profile journals and international conferences.

    Florian Eyben obtained his diploma in Information Technology from TUM. He is currently pursuing his PhD degree in the Machine Intelligence & Signal Processing Group. His research interests include large scale hierarchical audio feature extraction and evaluation, automatic emotion recognition from the speech signal, recognition of non-linguistic vocalisations, automatic large vocabulary continuous speech recognition, statistical and context-dependent language models, and Music Information Retrieval. He has over 90 publications in peer-reviewed books, journals, and conference proceedings covering many of his areas of research, leading to over 1900 citations and an h-index of 23.

    Tobias Bocklet received his diploma degree in computer science in 2007 and his PhD in 2012 both from the University of Erlangen-Nuremberg. In 2008 he was with the speech group at SRI International working on automatic speaker identification. From 2009 to 2013 he was a member of the research staff of the Institute of Pattern Recognition at the University of Erlangen-Nuremberg and the Department of Phoniatrics and Pedaudiology of the University Clinics Erlangen. In his work he focused on the assessment of speech and language development and pathologies. Tobias is now a researcher at Intel Corporation.

    Gelareh Mohammadi is postdoctoral researcher at Idiap Research Institute, Martigny, Switzerland. Her work investigates the effect of nonverbal vocal behaviour on personality perception. She received her BSc in Biomedical Engineering from Amirkabir University of Technology, Iran, in 2003, her MSc in Electrical Engineering from Sharif University of Technology, Iran, in 2006, and her PhD in Electrical Engineering from EPFL in 2013. Her research interests include social signal processing, machine learning and pattern recognition.

    Benjamin Weiss received his PhD in Linguistics in 2008 from Humboldt University of Berlin, doing his dissertation about speech tempo and pronunciation. In the same year, he evaluated embodied conversational agents as Visiting Fellow at the MARCS Auditory Laboratories, University of Western Sydney. Currently, he is working on likability of voices and multimodal Human-Computer Interaction at the Telekom Innovation Laboratories of TU Berlin.

    This paper has been recommended for acceptance by L. ten Bosch.

    View full text