Audiovisual recognition of spontaneous interest within conversations

Authors:
Bjöern Schuller;Ronald Müeller;Benedikt Höernler;Anja Höethker;Hitoshi Konosu;Gerhard Rigoll
Affiliations:
Technische Universitaet Muenchen, Muenchen, Germany;Technische Universitaet Muenchen, Muenchen, Germany;Technische Universitaet Muenchen, Muenchen, Germany;Toyota Motor Europe, Zaventem, Belgium;Toyota Motor Corporation, Toyota City, Japan;Technische Universitaet Muenchen, Muenchen, Germany
Venue:
Proceedings of the 9th international conference on Multimodal interfaces
Year:
2007

Citing 3
Cited 15

Active Appearance Models

ECCV '98 Proceedings of the 5th European Conference on Computer Vision-Volume II - Volume II
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
RealTourist: a study of augmenting human-human and human-computer dialogue with eye-gaze overlay

INTERACT'05 Proceedings of the 2005 IFIP TC13 international conference on Human-Computer Interaction

Human-Centred Intelligent Human Computer Interaction (HCI²): how far are we from attaining it?

International Journal of Autonomous and Adaptive Communications Systems
Static and Dynamic Modelling for the Recognition of Non-verbal Vocalisations in Conversational Speech

PIT '08 Proceedings of the 4th IEEE tutorial and research workshop on Perception and Interactive Technologies for Speech-Based Systems: Perception in Multimodal Dialogue Systems
Social signal processing: state-of-the-art and future perspectives of an emerging domain

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Social signal processing: Survey of an emerging domain

Image and Vision Computing
Being bored? Recognising natural interest by extensive audiovisual integration for real-life application

Image and Vision Computing
Recognizing communicative facial expressions for discovering interpersonal emotions in group meetings

Proceedings of the 2009 international conference on Multimodal interfaces
A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams

Neurocomputing
The SEMAINE API: towards a standards-based framework for building emotion-oriented systems

Advances in Human-Computer Interaction - Special issue on emotion-aware natural interaction
Expression of affect in spontaneous speech: Acoustic correlates and automatic detection of irritation and resignation

Computer Speech and Language
Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Speech Communication
Semi-coupled hidden Markov model with state-based alignment strategy for audio-visual emotion recognition

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part I
Audio visual emotion recognition based on triple-stream dynamic bayesian network models

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part I
Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

International Journal of Speech Technology
Paralinguistics in speech and language-State-of-the-art and the challenge

Computer Speech and Language
Level of interest sensing in spoken dialog using decision-level fusion of acoustic and lexical evidence

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work we present an audiovisual approach to the recognition of spontaneous interest in human conversations. For a most robust estimate, information from four sources is combined by a synergistic and individual failure tolerant fusion. Firstly, speech is analyzed with respect to acoustic properties based on a high-dimensional prosodic, articulatory, and voice quality feature space plus the linguistic analysis of spoken content by LVCSR and bag-of-words vector space modeling including non-verbals. Secondly, visual analysis provides patterns of the facial expression by AAMs, and of the movement activity by eye tracking. Experiments base on a database of 10.5h of spontaneous human-to-human conversation containing 20 subjects in gender and age-class balance. Recordings are fulfilled with a room microphone, camera, and headsets for close-talk to consider diverse comfort and noise conditions. Three levels of interest were annotated within a rich transcription. We describe each information stream and a fusion on an early level in detail. Our experiments aim at a person-independent system for real-life usage and show the high potential of such a multimodal approach. Benchmark results based on transcription versus automatic processing are also provided.