Pixels that Sound

Authors:
Einat Kidron;Yoav Y. Schechner;Michael Elad
Affiliations:
Technion - Israel Institute of Technology;Technion - Israel Institute of Technology;Technion - Israel Institute of Technology
Venue:
CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Year:
2005

Citing 0
Cited 7

Locality preserving CCA with applications to data visualization and pose estimation

Image and Vision Computing
Taking the bite out of automated naming of characters in TV video

Image and Vision Computing
Visual localization of non-stationary sound sources

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Recovery of audio-to-video synchronization through analysis of cross-modality correlation

Pattern Recognition Letters
Towards a gesture-sound cross-modal analysis

GW'09 Proceedings of the 8th international conference on Gesture in Embodied Communication and Human-Computer Interaction
Learning multi-modal dictionaries: application to audiovisual data

MRCS'06 Proceedings of the 2006 international conference on Multimedia Content Representation, Classification and Security
Multivariate regression shrinkage and selection by canonical correlation analysis

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

People and animals fuse auditory and visual information to obtain robust perception. A particular benefit of such cross-modal analysis is the ability to localize visual events associated with sound sources. We aim to achieve this using computer-vision aided by a single microphone. Past efforts encountered problems stemming from the huge gap between the dimensions involved and the available data. This has led to solutions suffering from low spatio-temporal resolutions. We present a rigorous analysis of the fundamental problems associated with this task. Then, we present a stable and robust algorithm which overcomes pastdeficiencies. It grasps dynamic audio-visual events with high spatial resolution, and derives a unique solution. The algorithm effectively detects pixels that are associated with the sound, while filtering out other dynamic pixels. It is based on canonical correlation analysis (CCA), where we remove inherent ill-posedness by exploiting the typical spatial sparsity of audio-visual events. The algorithm is simple and ef- ficient thanks to its reliance on linear programming and is free of user-defined parameters. To quantitatively assess the performance, we devise a localization criterion. The algorithm capabilities were demonstrated in experiments, where it overcame substantial visual distractions and audio noise.