Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
CASSANDRA: audio-video sensor fusion for aggression detection
AVSS '07 Proceedings of the 2007 IEEE Conference on Advanced Video and Signal Based Surveillance
Opensmile: the munich versatile and fast open-source audio feature extractor
Proceedings of the international conference on Multimedia
Emotion recognition from speech by combining databases and fusion of classifiers
TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
A Unified Framework for Biometric Expert Fusion Incorporating Quality Measures
IEEE Transactions on Pattern Analysis and Machine Intelligence
Audio-Visual fusion for detecting violent scenes in videos
SETN'10 Proceedings of the 6th Hellenic conference on Artificial Intelligence: theories, models and applications
Reasoning About Threats: From Observables to Situation Assessment
IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Automatic Audio-Visual Fusion for Aggression Detection Using Meta-information
AVSS '12 Proceedings of the 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance
Hi-index | 0.10 |
Multimodal fusion is a complex topic. For surveillance applications audio-visual fusion is very promising given the complementary nature of the two streams. However, drawing the correct conclusion from multi-sensor data is not straightforward. In previous work we have analysed a database with audio-visual recordings of unwanted behavior in trains (Lefter et al., 2012) and focused on a limited subset of the recorded data. We have collected multi- and unimodal assessments by humans, who have given aggression scores on a 3 point scale. We showed that there are no trivial fusion algorithms to predict the multimodal labels from the unimodal labels since part of the information is lost when using the unimodal streams. We proposed an intermediate step to discover the structure in the fusion process. This step is based upon meta-features and we find a set of five which have an impact on the fusion process. In this paper we extend the findings in (Lefter et al., 2012) for the general case using the entire database. We prove that the meta-features have a positive effect on the fusion process in terms of labels. We then compare three fusion methods that encapsulate the meta-features. They are based on automatic prediction of the intermediate level variables and multimodal aggression from state of the art low level acoustic, linguistic and visual features. The first fusion method is based on applying multiple classifiers to predict intermediate level features from the low level features, and to predict the multimodal label from the intermediate variables. The other two approaches are based on probabilistic graphical models, one using (Dynamic) Bayesian Networks and the other one using Conditional Random Fields. We learn that each approach has its strengths and weaknesses in predicting specific aggression classes and using the meta-features yields significant improvements in all cases.