Fear-type emotion recognition for future audio-based surveillance systems

  • Authors:
  • C. Clavel;I. Vasilescu;L. Devillers;G. Richard;T. Ehrette

  • Affiliations:
  • Thales Research and Technology France, RD 128, 91767 Palaiseau Cedex, France;LIMSI-CNRS, BP 133, 91403 Orsay Cedex, France;LIMSI-CNRS, BP 133, 91403 Orsay Cedex, France;TELECOM ParisTech, 37 rue Dareau, 75014 Paris, France;Thales Research and Technology France, RD 128, 91767 Palaiseau Cedex, France

  • Venue:
  • Speech Communication
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper addresses the issue of automatic emotion recognition in speech. We focus on a type of emotional manifestation which has been rarely studied in speech processing: fear-type emotions occurring during abnormal situations (here, unplanned events where human life is threatened). This study is dedicated to a new application in emotion recognition - public safety. The starting point of this work is the definition and the collection of data illustrating extreme emotional manifestations in threatening situations. For this purpose we develop the SAFE corpus (situation analysis in a fictional and emotional corpus) based on fiction movies. It consists of 7h of recordings organized into 400 audiovisual sequences. The corpus contains recordings of both normal and abnormal situations and provides a large scope of contexts and therefore a large scope of emotional manifestations. In this way, not only it addresses the issue of the lack of corpora illustrating strong emotions, but also it forms an interesting support to study a high variety of emotional manifestations. We define a task-dependent annotation strategy which has the particularity to describe simultaneously the emotion and the situation evolution in context. The emotion recognition system is based on these data and must handle a large scope of unknown speakers and situations in noisy sound environments. It consists of a fear vs. neutral classification. The novelty of our approach relies on dissociated acoustic models of the voiced and unvoiced contents of speech. The two are then merged at the decision step of the classification system. The results are quite promising given the complexity and the diversity of the data: the error rate is about 30%.