Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus

Authors:
Eric K. Patterson;Sabri Gurbuz;Zekeriya Tufekci;John N. Gowdy
Affiliations:
Department of Electrical and Computer Engineering, Clemson University, Clemson, SC;Department of Electrical and Computer Engineering, Clemson University, Clemson, SC;Department of Electrical and Computer Engineering, Clemson University, Clemson, SC;Department of Electrical and Computer Engineering, Clemson University, Clemson, SC
Venue:
EURASIP Journal on Applied Signal Processing
Year:
2002

Citing 5
Cited 12

An improved automatic lipreading system to enhance speech recognition

CHI '88 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Application of Affine-Invariant Fourier Descriptors to Recognition of 3-D Objects

IEEE Transactions on Pattern Analysis and Machine Intelligence
An introduction to NURBS: with historical perspective

An introduction to NURBS: with historical perspective
Pattern Recognition, Third Edition

Pattern Recognition, Third Edition
Lipreading from color video

IEEE Transactions on Image Processing

Analysis of multimodal sequences using geometric video representations

Signal Processing - Special section: Multimodal human-computer interfaces
Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment

Speech Communication
Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Visual localization of non-stationary sound sources

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Lip contour extraction for language learning in VEC3D

Machine Vision and Applications
Information theoretic feature extraction for audio-visual speech recognition

IEEE Transactions on Signal Processing
A comprehensive audio-visual corpus for teaching sound persian phoneme articulation

SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
Recovery of audio-to-video synchronization through analysis of cross-modality correlation

Pattern Recognition Letters
Speaker localisation using audio-visual synchrony: an empirical study

CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
Selecting relevant visual features for speechreading

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
AV16.3: an audio-visual corpus for speaker localization and tracking

MLMI'04 Proceedings of the First international conference on Machine Learning for Multimodal Interaction
The persian linguistic based audio-visual data corpus, AVA II, considering coarticulation

MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling

Quantified Score

Hi-index	0.00

Visualization

Abstract

Strides in computer technology and the search for deeper, more powerful techniques in signal processing have brought multimodal research to the forefront in recent years. Audio-visual speech processing has become an important part of this research because it holds great potential for overcoming certain problems of traditional audio-only methods. Difficulties, due to background noise and multiple speakers in an application environment, are significantly reduced by the additional information provided by visual features. This paper presents information on a new audio-visual database, a feature study on moving speakers, and on baseline results for the whole speaker group. Although a few databases have been collected in this area, none has emerged as a standard for comparison. Also, efforts to date have often been limited, focusing on cropped video or stationary speakers. This paper seeks to introduce a challenging audio-visual database that is flexible and fairly comprehensive, yet easily available to researchers on one DVD. The Clemson University Audio-Visual Experiments (CUAVE) database is a speaker-independent corpus of both connected and continuous digit strings totaling over 7000 utterances. It contains a wide variety of speakers and is designed to meet several goals discussed in this paper. One of these goals is to allow testing of adverse conditions such as moving talkers and speaker pairs. A feature study of connected digit strings is also discussed. It compares stationary and moving talkers in a speaker-independent grouping. An image-processing-based contour technique, an image transform method, and a deformable template scheme are used in this comparison to obtain visual features. This paper also presents methods and results in an attempt to make these techniques more robust to speaker movement. Finally, initial baseline speaker-independent results are included using all speakers, and conclusions as well as suggested areas of research are given.