Early versus late fusion in semantic video analysis
Proceedings of the 13th annual ACM international conference on Multimedia
Early versus late fusion in semantic video analysis
Proceedings of the 13th annual ACM international conference on Multimedia
A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions
IEEE Transactions on Pattern Analysis and Machine Intelligence
Particle filtering with factorized likelihoods for tracking facial features
FGR' 04 Proceedings of the Sixth IEEE international conference on Automatic face and gesture recognition
Toward Pose-Invariant 2-D Face Recognition Through Point Distribution Models and Facial Symmetry
IEEE Transactions on Information Forensics and Security - Part 1
Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help
IEEE Transactions on Multimedia
Bimodal log-linear regression for fusion of audio and visual features
Proceedings of the 21st ACM international conference on Multimedia
Hi-index | 0.00 |
There is evidence in neuroscience indicating that prediction of spatial and temporal patterns in the brain plays a key role in perception. This has given rise to prediction-based fusion as a method of combining information from audio and visual modalities. Models are trained on a per-class basis, to learn the mapping from one feature-space to another. When presented with unseen data, each model predicts the respective feature-sets using its learnt mapping, and the prediction error is combined within each class. The model which best describes the audiovisual relationship (by having the lowest combined prediction error) provides its label to the input data. Previous studies have only used neural networks to evaluate this method of combining modalities - this paper extends this to other learning methods, including Long Short-Term Memory recurrent neural networks (LSTMs), Support Vector Machines (SVMs), Relevance Vector Machines (RVMs), and Gaussian Processes (GPs). Our results on cross-database experiments on nonlinguistic vocalisation recognition show that feature-prediction significantly outperforms feature-fusion for neural networks, LSTMs, and GPs, while performance on SVMs and RVMs is more ambiguous and neither model gains an absolute advantage over the other.