Analysis of multimodal sequences using geometric video representations

Authors:
Gianluca Monaci;Òscar Divorra Escoda;Pierre Vandergheynst
Affiliations:
École Polytechnique Fédérale de Lausanne(EPFL), Signal Processing Institute, Lausanne, Switzerland;École Polytechnique Fédérale de Lausanne(EPFL), Signal Processing Institute, Lausanne, Switzerland;École Polytechnique Fédérale de Lausanne(EPFL), Signal Processing Institute, Lausanne, Switzerland
Venue:
Signal Processing - Special section: Multimodal human-computer interfaces
Year:
2006

Citing 7
Cited 2

Elements of information theory

Elements of information theory
Fundamentals of speech recognition

Fundamentals of speech recognition
Atomic Decomposition by Basis Pursuit

SIAM Journal on Scientific Computing
Video Processing and Communications

Video Processing and Communications
From error probability to information theoretic (multi-modal) signal processing

Signal Processing - Special issue: Information theoretic signal processing
Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus

EURASIP Journal on Applied Signal Processing
Speaker association with signal-level audiovisual fusion

IEEE Transactions on Multimedia

Geometric video approximation using weighted matching pursuit

IEEE Transactions on Image Processing
Learning multi-modal dictionaries: application to audiovisual data

MRCS'06 Proceedings of the 2006 international conference on Multimedia Content Representation, Classification and Security

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a novel method to correlate audio and visual data generated by the same physical phenomenon, based on sparse geometric representation of video sequences. The video signal is modeled as a sum of geometric primitives evolving through time, that jointly describe the geometric and motion content of the scene. The displacement through time of relevant visual features, like the mouth of a speaker, can thus be compared with the evolution of an audio feature to assess the correspondence between acoustic and visual signals. Experiments show that the proposed approach allows to localize and track the speaker's mouth when several persons are present on the scene, in presence of distracting motion, and without prior face or mouth detection.