Elsevier

Speech Communication

Volume 57, February 2014, Pages 63-75
Speech Communication

Predicting synthetic voice style from facial expressions. An application for augmented conversations

https://doi.org/10.1016/j.specom.2013.09.003Get rights and content

Highlights

  • A new method for processing emotional content in speech generating devices.

  • Expressive speech synthesis is generated matching the user’s facial expression.

  • Facial expressions are mapped to speech, based on emotional intensity classification.

  • A system is developed to evaluate and personalise the expression classification.

  • Results indicate that the system may improve communication experience for users.

Abstract

The ability to efficiently facilitate social interaction and emotional expression is an important, yet unmet requirement for speech generating devices aimed at individuals with speech impairment. Using gestures such as facial expressions to control aspects of expressive synthetic speech could contribute to an improved communication experience for both the user of the device and the conversation partner. For this purpose, a mapping model between facial expressions and speech is needed, that is high level (utterance-based), versatile and personalisable. In the mapping developed in this work, visual and auditory modalities are connected based on the intended emotional salience of a message: the intensity of facial expressions of the user to the emotional intensity of the synthetic speech. The mapping model has been implemented in a system called WinkTalk that uses estimated facial expression categories and their intensity values to automatically select between three expressive synthetic voices reflecting three degrees of emotional intensity. An evaluation is conducted through an interactive experiment using simulated augmented conversations. The results have shown that automatic control of synthetic speech through facial expressions is fast, non-intrusive, sufficiently accurate and supports the user to feel more involved in the conversation. It can be concluded that the system has the potential to facilitate a more efficient communication process between user and listener.

Introduction

Text-to-speech synthesis systems in devices for people with speech impairment need to function effectively as a human communication tool. Millions of non-speaking individuals rely on speech generating devices (SGDs) to meet their everyday communication needs. SGDs currently face challenges with speed, naturalness, input strategies and the lack of ability to convey the personality and the emotions of the user. These may disrupt the natural flow of the conversation or they can lead to misunderstandings between speaker and listener. One of the most important enhancements required is the integration of expressive synthetic voices that are functional, easy to use, and help the user to express emotion and personality by providing the right tone of voice for the given social situation. Higginbotham (2010) emphasises the need for such synthetic voices, as well as the necessity for input strategies carefully designed to ensure that non-speaking individuals (AAC1 users) are able to effectively access and control the synthetic voice during social interactions. One of the directions recommended in the above paper to address this, is to investigate the possibility of coordinating gestures and facial expressions with expressive speech output. Such an integration of modalities has the potential to improve the communication experience for the user of a speech generating device, as well as the conversation partner. This study investigates the possibility of using a functional linking of facial expressions and expressive synthetic voices in order to provide an alternative control strategy over the paralinguistic aspects of the synthetic speech output.

When integrating expressive synthetic speech into an SGD, many aspects of the augmented communication process need to be considered. Any natural conversation involves the use of significant linguistic, social-pragmatic, and discourse skills by both communicators. This can be particularly challenging when using an AAC system, which demands additional skills and knowledge as well as strong psychosocial capabilities, such as motivation and confidence, for competent communication (Light, 2003). Further, it has been shown that many AAC users struggle to effectively coordinate attention between a conversation partner, the AAC system, and the activity at hand (Light and Drager, 2007). Many challenges to both using, and learning how to use, current AAC systems have been documented (Light and Drager, 2007) and it is anticipated that the advent of expressive speech synthesis will pose additional challenges. Specifically, AAC users will soon need to coordinate the creation of linguistic content with the paralinguistic delivery of these messages. One of the advantages when the voice style of the synthetic voice automatically corresponds to the user’s facial gestures would be an easier and more efficient delivery of the message, whilst reducing the risk of being misunderstood.

While introducing the option of expressive speech into a speech generating device adds freedom and flexibility of expression to the user, at the same time, the user now has a new cognitive decision to make when composing a message: not only typing what to say, but also choosing how to say it. This kind of dual task and working memory have recently been identified as core challenges to be considered when designing AAC systems (Beukelman et al., 2012). If the decision as to which expressive voice to use could be made more automatically by analysing the emotional state of the user, then the presence of the expressive voices would no longer imply an additional cognitive load to the user.

In the past, there has been considerable interest in investigating the relationship between facial gestures such as eyebrow movements and acoustic prominence (focus) (Cavé et al., 2002, Swerts and Krahmer, 2008, Moubayed et al., 2011). These studies have shown that there is a strong relationship between auditory and visual modalities in perceiving prominence, and that appropriate accompanying facial gestures such as eyebrow movements aid the intelligibility and perceived naturalness of the message. Eyebrow movements are, however, not always reliable cues for prosody and there are differing views on the temporal relationship between these gestures and the place of focus in an utterance (Cvejic et al., 2011). For the particular task of augmented interactions (where one of the conversation partners uses speech synthesis as their communication tool), the mapping model between facial gestures and synthetic voices has to meet a set of practical requirements. One should take into account for example, that during an augmented communication process, there is a temporal shift between the text input and the acoustic realisation of the utterance by the TTS system, which would make facial expression based control at a lower level such as syllables or even single words, problematic. The linking between gestures and voice should be at a high level, functional for entire utterances, and not dependent on short temporal relationships. Moreover, the mapping needs to be easily personalisable, to account for individual needs and preferences. Lastly, because the mapping needs to be applicable in an automated system, it is functionally restricted to features that have been shown to produce robust and consistent results in both state-of-the art gesture recognisers and speech synthesisers.

Previous research has investigated a one-to-one mapping between facial expression and speech style for the application of speech-to-speech translation (Székely et al., 2013). There, a straight-forward mapping has been implied between facial and vocal expression such as from happy face to cheerful voice, sad face to depressed voice, etc. The drawback of this approach is the need for synthetic voices in specific voice styles, and the fact that an erroneous classification of a facial expression is magnified by a choice of a different basic emotion in the speech output. This means that if the user is smiling but the system makes an error and classifies the face as angry, the output will be synthetic speech in a stereotypical angry voice, which is highly undesirable. We believe that this effect can be avoided by stepping away from synthetic voices representing categories of basic emotions, and approaching the visual-auditory emotion mapping space from the viewpoint of emotional intensity instead. In this paper, we propose a high-level mapping between facial expressions and expressive synthetic voice styles based on intensity: how intense an emotion should be displayed in an utterance is determined by the intensity of the facial expression detected. Emotional intensity is a measure that is becoming more widely applied for facial expression recognition (Liao et al., 2012). The use of intensity as a feature of facial expressions is also supported by a study from the field of neuroscience which reports that the presence of cognitive processes is responsive to increments of intensity in facial expressions of emotion, but not sensitive to the type of emotion (Sprengelmeyer and Jentsch, 2006).

Probably the most fundamental need in a speech generating device is the intelligibility of the synthetic voice. Higher quality synthetic voices lower the cognitive requirements of the conversation partners of having to listen to the synthetic speech output (Mullennix and Stern, 2010), thus resulting in a more rewarding communication experience. Besides intelligibility, there are many possible enhancements that can contribute to the success of a communication experience between AAC user and listener. Personalisation of voice is often listed as a highly important feature (Creer et al., 2010). The synthetic voice can influence a listener’s attitude towards the user of the voice (Stern et al., 2002) because people associate synthetic speech with human-like attributes much like they do with normal speech. Personalised voices have the potential to improve the listener’s impression and attitudes towards the user. Customisability of text input strategies is also a relevant feature, AAC being an area where “one size does not fit all”. During an augmented communication process, the temporal dynamics and the collaborative nature of conversation may be disrupted (Mullennix and Stern, 2010). An important goal for a speech generating device is to aim for fluid interaction, and to preserve normal conversational rhythm, addressing the social dynamics between the augmented speaker and the listener (Higginbotham, 2010). Higginbotham (2010) also highlights the need for the ability to express emotions with voice, by means of alternative solutions to control prosodic variations of utterance productions. The future of AAC technology is advancing with innovative technologies involving brain-computer interfaces (BCI’s) (Guenther et al., 2009), animated visual avatars (Massaro, 2004), and applications using natural language processing such as word prediction and completion algorithms, speech recognition and linguistic context mining (Higginbotham et al., 2002).

This paper describes the development and characteristics of three expressive synthetic voices, and the design and implementation of personalisable mapping rules between these voices and the output of a facial expression recogniser. To help study the practical impact of a high level linking of visual and auditory cues based on the emotive-expressive intent of a message, the research prototype of a multimodal speech synthesis platform called WinkTalk has been developed. The WinkTalk system connects the results of a facial expression recogniser to three synthetic voices representing increasing degrees of emotional intensity. With the help of this system, an interactive evaluation has been conducted to measure the efficiency of the mapping between facial expressions and voice types, as well as to investigate the pragmatic effects of using facial expression control of synthetic speech in simulated augmented dialogue situations.

The goal of this work is two-fold: firstly, to show that intensity is a functional measure for mapping between facial expressions and features of synthetic voices. Secondly, to show that leveraging this type of correspondence of face and voice has the potential to improve communication experiences in situations where speech synthesis is used as a human communication tool.

An SGD that incorporates automatic selection of voice styles based on the facial expression of the user can be used in two ways: firstly, when an AAC user is in a conversation with another person and composes messages one utterance at the time. The device would analyse the facial expression of the user upon completion of the typing of the utterance and output speech that corresponds to the user’s facial gestures. Secondly, it could be applied when delivering a prepared speech composed of several utterances, where the voice style of each utterance would be adapted on-the-fly, to match the facial gestures of the user upon delivery of the speech, and hereby give the user of the device the opportunity to interact with the audience, maintain eye contact, and at the same time influence the delivery of the speech by the TTS system. The software component of such a device is implemented in this work. While integrated facial gesture analysis in a speech generating device may not be appropriate for all AAC users, we estimate that a significant number of users could benefit from it, especially when detailed personalisation options are added for flexibility.

Fig. 1 outlines the main components of the augmented communication process proposed in this work.

This paper is organised as follows: Sections 2 Expressive synthetic voices, 2 Expressive synthetic voices, 3 Facial expression analysis, 4 Mapping between facial expressions and synthetic voices present the details of the underlying study to develop a mapping between facial gestures and synthetic voices. Section 5 introduces the system prototype called WinkTalk and Section 6 describes an interactive experiment conducted to evaluate the mapping model and the prototype system.

Section snippets

Synthesising expressive speech – background

There has been significant recent interest in going beyond the neutral speaking style of traditional speech synthesisers and developing systems that are able to synthesise emotional or expressive speech. As Taylor (2009) describes, the history of expressive speech synthesis systems dates back to formant synthesisers (Cahn, 1990, Murray and Arnott, 1993), where different parameters could be varied to synthesise the same sentence with different types of affect. Later systems used data-driven

Facial expression analysis

One of the most well-known systems for analysing facial expressions is the Facial Action Coding System proposed by Ekman and Friesen (1976). FACS measures facial expressions in Action Units (AUs) related to muscle changes. FACS can be used to discriminate positive and negative emotions, and to some extent indicate the intensity of the expressions. While FACS can offer a more detailed analysis, for the purpose of using facial expression recognition in augmented conversations, a method is

Theoretical foundations of the mapping

Theoretical foundations based on considerations about intensity levels of the underlying basic emotions of the four facial expression categories have been considered to initialise the mapping values. Cowie et al. (2000) proposed the since widely applied two-dimensional disk-shaped activation-evaluation space to model emotion dimensions. The outer boundary of the disk represents maximally intense emotions, while its centre represents a neutral state. The radial distance of an emotion from the

System architecture and working modes

Fig. 4 provides an overview of the main components and the operation modes of the WinkTalk system. The system can operate in one of three modes: automatic voice selection (Mode 1), personalisation of voice selection (Mode 3) and manual voice selection (Mode 2). Mode 1 is the principal mode which automatically selects a voice from the face expression of the user. Mode 3 can be performed initially for the system to adapt the mapping rules between facial parameters and voice types to the user and

Goal of the experiment

The interactive experiment had two general goals: first, to evaluate the intensity based mapping between users’ facial expressions and the synthetic voices, and second, to test the functionality of the WinkTalk prototype system by simulating dialogues between the user and a conversation partner. More specifically, we were interested to study whether the personalisation phase is successful in improving the user experience, and, whether the system facilitates a communication experience that is a

Discussion

Two additional aspects that ought to be discussed when evaluating a communication process facilitated by a multimodal speech synthesis system such as WinkTalk, are the dynamics and timing aspects of the assisted conversation and the role of attention in the communication process. With regard to dynamics and timing, one might argue that during a natural conversation, gestures and linguistic message operate in a tightly connected pattern. A speech synthesis system that uses visual and linguistic

Conclusion and future work

This paper presented an intensity based mapping of facial expressions and synthetic voices that is implemented in a multimodal speech synthesis platform, and evaluated with the help of interactive experiments. The aim of the system is to provide an alternative strategy to control the choice of expressive synthetic voices by automatic selection of voice types through expression analysis of the user’s face. The personalised, high-level mapping of facial expressions and synthetic voices resulted

Acknowledgements

This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation (www.cngl.ie) at University College Dublin (UCD). The opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Science Foundation Ireland. This is research is further supported by the Istituto Italiano di Tecnologia and the Università degli Studi di Genova. The authors

References (50)

  • R. Sprengelmeyer et al.

    Event related potentials and the perception of intensity in facial expressions

    Neuropsychologia

    (2006)
  • M. Swerts et al.

    Facial expression and prosodic prominence: effects on modality and facial area

    Journal of Phonetics

    (2008)
  • Asha

    Roles and responsibilities of speech-language pathologists with respect to augmentative and alternative communication: position statement

    American Speech-Language-Hearing Association

    (2005)
  • J. Bedrosian

    Limitations in the use of nondisabled subjects in AAC research

    Augmentative and Alternative Communication

    (1995)
  • Beukelman, D., Blackstone, S., Caves, K., De Ruyter, F., Fried-Oken, M., Higginbotham, J., Jakobs, T., Light, J.,...
  • Braunschweiler, N., Gales, M.J.F., Buchholz, S., 2010. Lightly supervised recognition for automatic alignment of large...
  • Breuer, S., Bergmann, S., Dragon, R., Möller, S., 2006. Set-up of a unit-selection synthesis with a prominent voice....
  • Bulut, M., Narayanan, S.S., Syrdal, A.K., 2002. Expressive speech synthesis using a concatenative synthesizer. In:...
  • Cabral, J., Renals, S., Richmond, K., Yamagishi, J., 2007. Towards an improved modeling of the glottal source in...
  • J.E. Cahn

    Generation of affect in synthesized speech

    Journal of the American Voice I/O Society

    (1990)
  • Cavé, C., Guaïtella, I., Santi, S., 2002. Eyebrow movements and voice variations in dialogue situations: an...
  • D. Childers et al.

    Modeling the glottal volume-velocity waveform for three voice types

    The Journal of the Acoustical Society of America

    (1995)
  • Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M., Schroeder, M., 2000. ‘FEELTRACE’: An instrument for...
  • S.M. Creer et al.

    Building personalised synthetic voices for individuals with dysarthria using the hts toolkit

  • H.D. Critchley et al.

    Neural activity relating to generation and representation of galvanic skin conductance responses: a functional magnetic resonance imaging study

    Journal of Neuroscience

    (2000)
  • Cvejic, I., Kim, J., Davis, C., 2011. Temporal relationship between auditory and visual prosodic cues. In: Proceedings...
  • M.E. Dawson et al.
    (2000)
  • P. Ekman et al.

    Measuring facial movement

    Journal of Nonverbal Behavior

    (1976)
  • G. Fant et al.

    Frequency domain interpretation and derivation of glottal flow parameters

    STL-QPSR

    (1988)
  • J. Fontaine et al.

    The world of emotions is not two-dimensional

    Psychological Science

    (2007)
  • L.C. Gallo et al.

    Cardiovascular and electrodermal responses to support and provocation: Interpersonal methods in the study of psychophysiological reactivity

    Psychophysiology

    (2000)
  • Gobl, C., 1989. A preliminary study of acoustic voice quality correlates. STL-QPSR, Royal Institute of Technology,...
  • F.H. Guenther et al.

    A wireless brain-machine interface for real-time speech synthesis

    PLoS ONE

    (2009)
  • Hennig, S., Székely, E., Carson-Berndsen, J., Chellali, R., 2012. Listener evaluation of an expressiveness scale in...
  • D. Higginbotham

    Use of nondisabled subjects in AAC research: confessions of a research infidel

    Augmentative and Alternative Communication

    (1995)
  • Cited by (0)

    View full text