Learning multi-modal dictionaries: application to audiovisual data

Authors:
Gianluca Monaci;Philippe Jost;Pierre Vandergheynst;Boris Mailhe;Sylvain Lesage;Rémi Gribonval
Affiliations:
Signal Processing Institute, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland;Signal Processing Institute, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland;Signal Processing Institute, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland;IRISA-INRIA, Rennes, France;IRISA-INRIA, Rennes, France;IRISA-INRIA, Rennes, France
Venue:
MRCS'06 Proceedings of the 2006 international conference on Multimedia Content Representation, Classification and Security
Year:
2006

Citing 5
Cited 0

Dictionary learning algorithms for sparse representation

Neural Computation
Pixels that Sound

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Learning Overcomplete Representations

Neural Computation
Analysis of multimodal sequences using geometric video representations

Signal Processing - Special section: Multimodal human-computer interfaces
Speaker association with signal-level audiovisual fusion

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a methodology for extracting meaningful synchronous structures from multi-modal signals. Simultaneous processing of multi-modal data can reveal information that is unavailable when handling the sources separately. However, in natural high-dimensional data, the statistical dependencies between modalities are, most of the time, not obvious. Learning fundamental multi-modal patterns is an alternative to classical statistical methods. Typically, recurrent patterns are shift invariant, thus the learning should try to find the best matching filters. We present a new algorithm for iteratively learning multi-modal generating functions that can be shifted at all positions in the signal. The proposed algorithm is applied to audiovisual sequences and it demonstrates to be able to discover underlying structures in the data.