Comparison of prediction-based fusion and feature-level fusion across different learning models

  • Authors:
  • Stavros Petridis;Sanjay Bilakhia;Maja Pantic

  • Affiliations:
  • Imperial College London, London, United Kingdom;Imperial College London, London, United Kingdom;Imperial College London & Univ. Twente, London, United Kingdom

  • Venue:
  • Proceedings of the 20th ACM international conference on Multimedia
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

There is evidence in neuroscience indicating that prediction of spatial and temporal patterns in the brain plays a key role in perception. This has given rise to prediction-based fusion as a method of combining information from audio and visual modalities. Models are trained on a per-class basis, to learn the mapping from one feature-space to another. When presented with unseen data, each model predicts the respective feature-sets using its learnt mapping, and the prediction error is combined within each class. The model which best describes the audiovisual relationship (by having the lowest combined prediction error) provides its label to the input data. Previous studies have only used neural networks to evaluate this method of combining modalities - this paper extends this to other learning methods, including Long Short-Term Memory recurrent neural networks (LSTMs), Support Vector Machines (SVMs), Relevance Vector Machines (RVMs), and Gaussian Processes (GPs). Our results on cross-database experiments on nonlinguistic vocalisation recognition show that feature-prediction significantly outperforms feature-fusion for neural networks, LSTMs, and GPs, while performance on SVMs and RVMs is more ambiguous and neither model gains an absolute advantage over the other.