A comparative study on automatic audio-visual fusion for aggression detection using meta-information

Authors:
I. Lefter;L. J. M. Rothkrantz;G. J. Burghouts
Affiliations:
Section of Interactive Intelligence, Department of Intelligence Systems, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands and Intelligent Imaging Department, TNO, Oude W ...;Section of Interactive Intelligence, Department of Intelligence Systems, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands and Sensor Technology, SEWACO Department, Nethe ...;Intelligent Imaging Department, TNO, Oude Waalsdorperweg 63, 2597 AK The Hague, The Netherlands
Venue:
Pattern Recognition Letters
Year:
2013

Citing 9
Cited 0

Random Forests

Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
CASSANDRA: audio-video sensor fusion for aggression detection

AVSS '07 Proceedings of the 2007 IEEE Conference on Advanced Video and Signal Based Surveillance
Opensmile: the munich versatile and fast open-source audio feature extractor

Proceedings of the international conference on Multimedia
Emotion recognition from speech by combining databases and fusion of classifiers

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
A Unified Framework for Biometric Expert Fusion Incorporating Quality Measures

IEEE Transactions on Pattern Analysis and Machine Intelligence
Audio-Visual fusion for detecting violent scenes in videos

SETN'10 Proceedings of the 6th Hellenic conference on Artificial Intelligence: theories, models and applications
Reasoning About Threats: From Observables to Situation Assessment

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Automatic Audio-Visual Fusion for Aggression Detection Using Meta-information

AVSS '12 Proceedings of the 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance

Quantified Score

Hi-index	0.10

Visualization

Abstract

Multimodal fusion is a complex topic. For surveillance applications audio-visual fusion is very promising given the complementary nature of the two streams. However, drawing the correct conclusion from multi-sensor data is not straightforward. In previous work we have analysed a database with audio-visual recordings of unwanted behavior in trains (Lefter et al., 2012) and focused on a limited subset of the recorded data. We have collected multi- and unimodal assessments by humans, who have given aggression scores on a 3 point scale. We showed that there are no trivial fusion algorithms to predict the multimodal labels from the unimodal labels since part of the information is lost when using the unimodal streams. We proposed an intermediate step to discover the structure in the fusion process. This step is based upon meta-features and we find a set of five which have an impact on the fusion process. In this paper we extend the findings in (Lefter et al., 2012) for the general case using the entire database. We prove that the meta-features have a positive effect on the fusion process in terms of labels. We then compare three fusion methods that encapsulate the meta-features. They are based on automatic prediction of the intermediate level variables and multimodal aggression from state of the art low level acoustic, linguistic and visual features. The first fusion method is based on applying multiple classifiers to predict intermediate level features from the low level features, and to predict the multimodal label from the intermediate variables. The other two approaches are based on probabilistic graphical models, one using (Dynamic) Bayesian Networks and the other one using Conditional Random Fields. We learn that each approach has its strengths and weaknesses in predicting specific aggression classes and using the meta-features yields significant improvements in all cases.