ECCV '98 Proceedings of the 5th European Conference on Computer Vision-Volume II - Volume II
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
RealTourist: a study of augmenting human-human and human-computer dialogue with eye-gaze overlay
INTERACT'05 Proceedings of the 2005 IFIP TC13 international conference on Human-Computer Interaction
Human-Centred Intelligent Human Computer Interaction (HCI²): how far are we from attaining it?
International Journal of Autonomous and Adaptive Communications Systems
PIT '08 Proceedings of the 4th IEEE tutorial and research workshop on Perception and Interactive Technologies for Speech-Based Systems: Perception in Multimodal Dialogue Systems
Social signal processing: state-of-the-art and future perspectives of an emerging domain
MM '08 Proceedings of the 16th ACM international conference on Multimedia
Social signal processing: Survey of an emerging domain
Image and Vision Computing
Image and Vision Computing
Proceedings of the 2009 international conference on Multimodal interfaces
The SEMAINE API: towards a standards-based framework for building emotion-oriented systems
Advances in Human-Computer Interaction - Special issue on emotion-aware natural interaction
ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part I
Audio visual emotion recognition based on triple-stream dynamic bayesian network models
ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part I
International Journal of Speech Technology
Paralinguistics in speech and language-State-of-the-art and the challenge
Computer Speech and Language
Computer Speech and Language
Hi-index | 0.00 |
In this work we present an audiovisual approach to the recognition of spontaneous interest in human conversations. For a most robust estimate, information from four sources is combined by a synergistic and individual failure tolerant fusion. Firstly, speech is analyzed with respect to acoustic properties based on a high-dimensional prosodic, articulatory, and voice quality feature space plus the linguistic analysis of spoken content by LVCSR and bag-of-words vector space modeling including non-verbals. Secondly, visual analysis provides patterns of the facial expression by AAMs, and of the movement activity by eye tracking. Experiments base on a database of 10.5h of spontaneous human-to-human conversation containing 20 subjects in gender and age-class balance. Recordings are fulfilled with a room microphone, camera, and headsets for close-talk to consider diverse comfort and noise conditions. Three levels of interest were annotated within a rich transcription. We describe each information stream and a fusion on an early level in detail. Our experiments aim at a person-independent system for real-life usage and show the high potential of such a multimodal approach. Benchmark results based on transcription versus automatic processing are also provided.