Early versus late fusion in semantic video analysis
Proceedings of the 13th annual ACM international conference on Multimedia
Early versus late fusion in semantic video analysis
Proceedings of the 13th annual ACM international conference on Multimedia
Particle filtering with factorized likelihoods for tracking facial features
FGR' 04 Proceedings of the Sixth IEEE international conference on Automatic face and gesture recognition
Audio-visual grouplet: temporal audio-visual interactions for general video concept classification
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Toward Pose-Invariant 2-D Face Recognition Through Point Distribution Models and Facial Symmetry
IEEE Transactions on Information Forensics and Security - Part 1
Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help
IEEE Transactions on Multimedia
Comparison of prediction-based fusion and feature-level fusion across different learning models
Proceedings of the 20th ACM international conference on Multimedia
Image and Vision Computing
Hi-index | 0.00 |
One of the most commonly used audiovisual fusion approaches is feature-level fusion where the audio and visual features are concatenated. Although this approach has been successfully used in several applications, it does not take into account interactions between the features, which can be a problem when one and/or both modalities have noisy features. In this paper, we investigate whether feature fusion based on explicit modelling of interactions between audio and visual features can enhance the performance of the classifier that performs feature fusion using simple concatenation of the audio-visual features. To this end, we propose a log-linear model, named Bimodal Log-linear regression, which accounts for interactions between the features of the two modalities. The performance of the target classifiers is measured in the task of laughter-vs-speech discrimination, since both laughter and speech are naturally audiovisual events. Our experiments on the MAHNOB laughter database suggest that feature fusion based on explicit modelling of interactions between the audio-visual features leads to an improvement of 3\% over the standard feature concatenation approach, when log-linear model is used as the base classifier. Finally, the most and least influential features can be easily identified by observing their interactions.