Being bored? Recognising natural interest by extensive audiovisual integration for real-life application
Introduction
Information on interest or disinterest of users possesses great potential for general Human–Computer Interaction [1], [2] and many commercial applications, such as sales and advertisement systems or virtual guides. Similar to the work introduced in [3], we are likewise interested in curiosity detection, e.g. for topic switching, in infotainment systems, or in customer service systems. Apart from that, also interest detection in meetings [4], [5], [6], or (children’s) tutoring systems [7] has been addressed so far.
Numerous works exist on the recognition of affective or emotional user-states, which are strongly related to interest. Many use solely acoustic speech parameters [8], [9], [10], followed by fewer works which use vision-based features (e.g. [11], [12]) or linguistic analysis [5], [10]. Only a considerably lower number deals with fusion of these input cues (e.g. [6], [13]), even though processing of such complementary information is known to be generally advantageous with respect to robustness and reliability [14], [15], [16], [17]. So far, this integration of streams has been fulfilled for acoustic and vision cues, exclusively (e.g. [4], [17], [13]), without fully automatic integration of textual analysis of spoken content, i.e. by using an Automatic Speech Recognition (ASR) system or considering non-linguistic vocalisations. Linguistic analysis up to date has been performed on ground truth data and not on actual transcripts from an ASR engine, which naturally is much more challenging due to inherent errors in the ASR stage. Further, current models for fusion usually are rather simple, as majority voting or logical operations on a late fusion level [11], [18] are implemented.
As shown in many works (e.g. [14], [15], [11], [17], [16]), audiovisual processing is known to be superior to each single modality. We therefore propose an attempt to combine features from practically all facial and spoken information available: facial expression analysis based on Active Appearance Models (AAM), eye activity modelling, acoustic and comprehensive linguistic analysis including non-linguistic vocalisations, and additional contextual history information.
In this respect it is received wisdom that a fusion of all accessible information on an early feature level is highly beneficial, as it preserves the largest possible information-basis for the final decision process [19]. The main problem thereby is the asynchronity of the feature streams. Frame-by-frame image analysis operates on 25 frames per second, for example, while speech analysis is term-based and linguistic analyses turn-based [20]. However, so far only fusion of acoustic and linguistic information [21], [10], and acoustic and vision-based information [16], [17] have each been realised on an early integration level.
Further, practically all results reported are based on databases rather than experience within the use of a real-life demonstrator. These data-sets usually employ a number of idealisations: a fixed head position, no occlusions, constant lighting conditions, no background noises, given pre-segmentation, partly known subject-samples, all modalities showing one clearly assignable affective state at a time, basic, mostly discrete emotions, that are deliberately displayed (as opposed to spontaneous) expressions [22]. If one aims at an automatic system that is capable of fully automatically responding to spontaneous interest, these simplifications clearly need to be overcome and more realistic, large scale databases are required. Moreover, recognition rates such as precision, recall, or accuracy can only report the objective performance of affective computing systems, but not how such a system would be accepted by users and whether it will be useful in a real-world scenario. A system with close to 100% accuracy under laboratory conditions (e.g. by relying on prototypical emotions, as often carried out) will still – in most cases – perform unsatisfactory in real-world scenarios. Thus, actual use-case studies [13], [23] must be performed to evaluate the performance and the acceptance of such systems in addition to the objective measures like accuracy.
In contrast to most works in the field of affective computing and interest recognition, we therefore attempt fully automatic audiovisual continuous interest recognition on spontaneous data recorded in a real-world scenario by including information from extensive audiovisual sources via early fusion. In an real-life user-study we evaluate how and if the system provides users a benefit.
The article is structured as follows: in Section 2 the featured approach to multimodal interest recognition is described and discussed in detail. Algorithms implemented for each modality are explained individually and are followed by a description of the multimodal fusion approach and an evaluation of recognition performance using individual modalities as well as various combinations of modalities. The setup and the survey results of the real-life application scenario user-study are discussed in Section 3. The article is concluded by a final discussion in Section 4.
Section snippets
Evaluating multimodal interest recognition
The details of the fully automatic approach to human interest detection are presented in this section. After a short description of the recording process and the final database of spontaneous interest data in Section 2.1 we describe the features and algorithms relevant for each modality in Section 2.2. The modalities we considered are as follows: facial expressions in Section 2.2.1, eye activity in Section 2.2.2, acoustics in Section 2.2.3, linguistics in Section 2.2.4, and contextual history
Evaluating real-life usage
Apart from figures “in the lab” as provided in typical works on interest and human affect recognition, it seems to be important for real-life usage to test systems with the user in the loop. We therefore present results of a study in which the system described so far (Section 2) is tested in a real usage study.
The primary aim of this survey is to measure whether a recognition system can indeed already improve a virtual agent-based presentation according to a person’s level of interest in a
Concluding remarks
In this work a fully automatic system for extensive audiovisual stream integration based recognition of human interest, which is able to operate in real-time, was presented. The approach proposed, integrates numerous streams to enable most reliable automatic determination of interest-related user states, such as being bored or being curious. For training and testing the AVIC database was recorded. The database contains data of a real-world scenario where an experimenter leads a subject through
References (81)
- et al.
Automatic recognition and analysis of human faces and facial expressions: a survey
Pattern Recognition
(1992) - et al.
Emotional speech recognition: resources, features, and methods
Speech Communication
(2006) - A. Pentland, A. Madan, Perception of social interest, in: Proc. IEEE Int. Conf. on Computer Vision, Workshop on...
- E. Shriberg, Spontaneous speech: How people really talk and why engineers should care, in: Proc. European Conf. on...
- P. Qvarfordt, D. Beymer, S.X. Zhai, Realtourist – a study of augmenting human–human and human–computer dialogue with...
- et al.
Modeling focus of attention for meeting indexing based on multiple cues
IEEE Transactions on Neural Networks
(2002) - L. Kennedy, D. Ellis, Pitch-based emphasis detection for characterization of meeting recordings, in: Proc. ASRU, Virgin...
- D. Gatica-Perez, I. McCowan, D. Zhang, S. Bengio, Detecting group interest-level in meetings, in: Proc. IEEE Int. Conf....
- S. Mota, R. Picard, Automated posture analysis for detecting learners interest level, in: Proc. Workshop on CVPR for...
- B. Schuller, G. Rigoll, M. Lang, Hidden markov model-based speech emotion recognition, in: Proc. ICASSP 2003, vol. II,...
Gaze-x: Adaptive affective multimodal interface for single-user office scenarios
Emotion recognition in human–computer interaction
IEEE Signal Processing magazine
Multimodal emotion recognition
Modeling multimodal expression of user’s affective subjective experience
User Modeling and User-Adapted Interaction
E-motional advantage: performance and satisfaction gains with affective computing
Assessing agreement on classification tasks: the kappa statistic
Computational Linguistics
On the necessity and feasibility of detecting a drivers emotional state while driving
Facial expressions
Foundations of human computing: facial expression and emotion
Facial expression analysis
Adaptive active appearance models
IEEE Transactions on Image Processing
Cited by (152)
A cognitive model to predict human interest in smart environments
2020, Computer CommunicationsCitation Excerpt :tries to detect basic facial expression, Surprise, Sad, Happy etc., from videos and tries to study interest. Similarly, [29] tries to detect the level of curiosity from eye movements using data mining algorithms. [30] use Bayesian Inference to predict the level of frustration.
Comprehensive Study of Automatic Speech Emotion Recognition Systems
2023, International Journal on Recent and Innovation Trends in Computing and CommunicationCognitive workload estimation using physiological measures: a review
2023, Cognitive Neurodynamics
- 1
The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 211486 (SEMAINE).