Audio and video feature fusion for activity recognition in unconstrained videos

Authors:
José Lopes;Sameer Singh
Affiliations:
Research School of Informatics, Loughborough University, Loughborough, UK;Research School of Informatics, Loughborough University, Loughborough, UK
Venue:
IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Year:
2006

Citing 9
Cited 1

Floating search methods in feature selection

Pattern Recognition Letters
On Combining Classifiers

IEEE Transactions on Pattern Analysis and Machine Intelligence
Audio Feature Extraction and Analysis for Scene Segmentation and Classification

Journal of VLSI Signal Processing Systems - special issue on multimedia signal processing
Handbook of Neural Network Signal Processing

Handbook of Neural Network Signal Processing
Developing Multimodal Interfaces: A Theoretical Framework and Guided Propagation Networks

Multimodal Human-Computer Communication, Systems, Techniques, and Experiments
Image Processing, Analysis, and Machine Vision

Image Processing, Analysis, and Machine Vision
An iterative image registration technique with an application to stereo vision

IJCAI'81 Proceedings of the 7th international joint conference on Artificial intelligence - Volume 2
Multi-stage classification for audio based activity recognition

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Extracting semantics from audio-visual content: the final frontier in multimedia retrieval

IEEE Transactions on Neural Networks

Audio-visual robot command recognition: D-META'12 grand challenge

Proceedings of the 14th ACM international conference on Multimodal interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Combining audio and image processing for understanding video content has several benefits when compared to using each modality on their own. For the task of context and activity recognition in video sequences, it is important to explore both data streams to gather relevant information. In this paper we describe a video context and activity recognition model. Our work extracts a range of audio and visual features, followed by feature reduction and information fusion. We show that combining audio with video based decision making improves the quality of context and activity recognition in videos by 4% over audio data and 18% over image data.