Bimodal log-linear regression for fusion of audio and visual features

Authors:
Ognjen Rudovic;Stavros Petridis;Maja Pantic
Affiliations:
Imperial College London, London, United Kingdom;Imperial College London, London, United Kingdom;Imperial College London - Univ. Twente, London, United Kingdom
Venue:
Proceedings of the 21st ACM international conference on Multimedia
Year:
2013

Citing 8
Cited 0

Early versus late fusion in semantic video analysis

Proceedings of the 13th annual ACM international conference on Multimedia
Early versus late fusion in semantic video analysis

Proceedings of the 13th annual ACM international conference on Multimedia
Particle filtering with factorized likelihoods for tracking facial features

FGR' 04 Proceedings of the Sixth IEEE international conference on Automatic face and gesture recognition
Audio-visual grouplet: temporal audio-visual interactions for general video concept classification

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Toward Pose-Invariant 2-D Face Recognition Through Point Distribution Models and Facial Symmetry

IEEE Transactions on Information Forensics and Security - Part 1
Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help

IEEE Transactions on Multimedia
Comparison of prediction-based fusion and feature-level fusion across different learning models

Proceedings of the 20th ACM international conference on Multimedia
The MAHNOB Laughter database

Image and Vision Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most commonly used audiovisual fusion approaches is feature-level fusion where the audio and visual features are concatenated. Although this approach has been successfully used in several applications, it does not take into account interactions between the features, which can be a problem when one and/or both modalities have noisy features. In this paper, we investigate whether feature fusion based on explicit modelling of interactions between audio and visual features can enhance the performance of the classifier that performs feature fusion using simple concatenation of the audio-visual features. To this end, we propose a log-linear model, named Bimodal Log-linear regression, which accounts for interactions between the features of the two modalities. The performance of the target classifiers is measured in the task of laughter-vs-speech discrimination, since both laughter and speech are naturally audiovisual events. Our experiments on the MAHNOB laughter database suggest that feature fusion based on explicit modelling of interactions between the audio-visual features leads to an improvement of 3\% over the standard feature concatenation approach, when log-linear model is used as the base classifier. Finally, the most and least influential features can be easily identified by observing their interactions.