Boosting multi-modal camera selection with semantic features

Authors:
Benedikt Hörnler;Dejan Arsic;Bjön Schuller;Gerhard Rigoll
Affiliations:
Institute for Human-Machine-Communication, Technische Universität München, Munich, Germany;Institute for Human-Machine-Communication, Technische Universität München, Munich, Germany;Institute for Human-Machine-Communication, Technische Universität München, Munich, Germany;Institute for Human-Machine-Communication, Technische Universität München, Munich, Germany
Venue:
ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Year:
2009

Citing 6
Cited 0

Detecting Faces in Images: A Survey

IEEE Transactions on Pattern Analysis and Machine Intelligence
Comparison of different implementations of MFCC

Journal of Computer Science and Technology
The AMI meeting corpus: a pre-announcement

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Multimodal integration for meeting group action segmentation and recognition

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Browsing recorded meetings with ferret

MLMI'04 Proceedings of the First international conference on Machine Learning for Multimodal Interaction
Using audio, visual, and lexical features in a multi-modal virtual meeting director

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work semantic features are used to improve the results of the camera selection. These semantic features are group action, person action and person speaking. For this purpose low level acoustic and visual features are combined with high level semantic ones. After the feature fusion, a segmentation and classification are performed by Hidden Markov Models. The evaluation shows that an absolute improvement of 6.5% can be achieved. The frame error rate is reduced to 38.1% by using acoustic and all semantic features. The best model using only low level features achieves a frame error rate of 44.6%, which is the best one reported on this data set.