A comparative study on automatic audio-visual fusion for aggression detection using meta-information

  • Authors:
  • I. Lefter;L. J. M. Rothkrantz;G. J. Burghouts

  • Affiliations:
  • Section of Interactive Intelligence, Department of Intelligence Systems, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands and Intelligent Imaging Department, TNO, Oude W ...;Section of Interactive Intelligence, Department of Intelligence Systems, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands and Sensor Technology, SEWACO Department, Nethe ...;Intelligent Imaging Department, TNO, Oude Waalsdorperweg 63, 2597 AK The Hague, The Netherlands

  • Venue:
  • Pattern Recognition Letters
  • Year:
  • 2013

Quantified Score

Hi-index 0.10

Visualization

Abstract

Multimodal fusion is a complex topic. For surveillance applications audio-visual fusion is very promising given the complementary nature of the two streams. However, drawing the correct conclusion from multi-sensor data is not straightforward. In previous work we have analysed a database with audio-visual recordings of unwanted behavior in trains (Lefter et al., 2012) and focused on a limited subset of the recorded data. We have collected multi- and unimodal assessments by humans, who have given aggression scores on a 3 point scale. We showed that there are no trivial fusion algorithms to predict the multimodal labels from the unimodal labels since part of the information is lost when using the unimodal streams. We proposed an intermediate step to discover the structure in the fusion process. This step is based upon meta-features and we find a set of five which have an impact on the fusion process. In this paper we extend the findings in (Lefter et al., 2012) for the general case using the entire database. We prove that the meta-features have a positive effect on the fusion process in terms of labels. We then compare three fusion methods that encapsulate the meta-features. They are based on automatic prediction of the intermediate level variables and multimodal aggression from state of the art low level acoustic, linguistic and visual features. The first fusion method is based on applying multiple classifiers to predict intermediate level features from the low level features, and to predict the multimodal label from the intermediate variables. The other two approaches are based on probabilistic graphical models, one using (Dynamic) Bayesian Networks and the other one using Conditional Random Fields. We learn that each approach has its strengths and weaknesses in predicting specific aggression classes and using the meta-features yields significant improvements in all cases.