Comparison of prediction-based fusion and feature-level fusion across different learning models

Authors:
Stavros Petridis;Sanjay Bilakhia;Maja Pantic
Affiliations:
Imperial College London, London, United Kingdom;Imperial College London, London, United Kingdom;Imperial College London & Univ. Twente, London, United Kingdom
Venue:
Proceedings of the 20th ACM international conference on Multimedia
Year:
2012

Citing 6
Cited 1

Early versus late fusion in semantic video analysis

Proceedings of the 13th annual ACM international conference on Multimedia
Early versus late fusion in semantic video analysis

Proceedings of the 13th annual ACM international conference on Multimedia
A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions

IEEE Transactions on Pattern Analysis and Machine Intelligence
Particle filtering with factorized likelihoods for tracking facial features

FGR' 04 Proceedings of the Sixth IEEE international conference on Automatic face and gesture recognition
Toward Pose-Invariant 2-D Face Recognition Through Point Distribution Models and Facial Symmetry

IEEE Transactions on Information Forensics and Security - Part 1
Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help

IEEE Transactions on Multimedia

Bimodal log-linear regression for fusion of audio and visual features

Proceedings of the 21st ACM international conference on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is evidence in neuroscience indicating that prediction of spatial and temporal patterns in the brain plays a key role in perception. This has given rise to prediction-based fusion as a method of combining information from audio and visual modalities. Models are trained on a per-class basis, to learn the mapping from one feature-space to another. When presented with unseen data, each model predicts the respective feature-sets using its learnt mapping, and the prediction error is combined within each class. The model which best describes the audiovisual relationship (by having the lowest combined prediction error) provides its label to the input data. Previous studies have only used neural networks to evaluate this method of combining modalities - this paper extends this to other learning methods, including Long Short-Term Memory recurrent neural networks (LSTMs), Support Vector Machines (SVMs), Relevance Vector Machines (RVMs), and Gaussian Processes (GPs). Our results on cross-database experiments on nonlinguistic vocalisation recognition show that feature-prediction significantly outperforms feature-fusion for neural networks, LSTMs, and GPs, while performance on SVMs and RVMs is more ambiguous and neither model gains an absolute advantage over the other.