An improved automatic lipreading system to enhance speech recognition
CHI '88 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Application of Affine-Invariant Fourier Descriptors to Recognition of 3-D Objects
IEEE Transactions on Pattern Analysis and Machine Intelligence
An introduction to NURBS: with historical perspective
An introduction to NURBS: with historical perspective
Pattern Recognition, Third Edition
Pattern Recognition, Third Edition
IEEE Transactions on Image Processing
Analysis of multimodal sequences using geometric video representations
Signal Processing - Special section: Multimodal human-computer interfaces
Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition
ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Visual localization of non-stationary sound sources
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Lip contour extraction for language learning in VEC3D
Machine Vision and Applications
Information theoretic feature extraction for audio-visual speech recognition
IEEE Transactions on Signal Processing
A comprehensive audio-visual corpus for teaching sound persian phoneme articulation
SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
Recovery of audio-to-video synchronization through analysis of cross-modality correlation
Pattern Recognition Letters
Speaker localisation using audio-visual synchrony: an empirical study
CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
Selecting relevant visual features for speechreading
ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
AV16.3: an audio-visual corpus for speaker localization and tracking
MLMI'04 Proceedings of the First international conference on Machine Learning for Multimodal Interaction
The persian linguistic based audio-visual data corpus, AVA II, considering coarticulation
MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling
Hi-index | 0.00 |
Strides in computer technology and the search for deeper, more powerful techniques in signal processing have brought multimodal research to the forefront in recent years. Audio-visual speech processing has become an important part of this research because it holds great potential for overcoming certain problems of traditional audio-only methods. Difficulties, due to background noise and multiple speakers in an application environment, are significantly reduced by the additional information provided by visual features. This paper presents information on a new audio-visual database, a feature study on moving speakers, and on baseline results for the whole speaker group. Although a few databases have been collected in this area, none has emerged as a standard for comparison. Also, efforts to date have often been limited, focusing on cropped video or stationary speakers. This paper seeks to introduce a challenging audio-visual database that is flexible and fairly comprehensive, yet easily available to researchers on one DVD. The Clemson University Audio-Visual Experiments (CUAVE) database is a speaker-independent corpus of both connected and continuous digit strings totaling over 7000 utterances. It contains a wide variety of speakers and is designed to meet several goals discussed in this paper. One of these goals is to allow testing of adverse conditions such as moving talkers and speaker pairs. A feature study of connected digit strings is also discussed. It compares stationary and moving talkers in a speaker-independent grouping. An image-processing-based contour technique, an image transform method, and a deformable template scheme are used in this comparison to obtain visual features. This paper also presents methods and results in an attempt to make these techniques more robust to speaker movement. Finally, initial baseline speaker-independent results are included using all speakers, and conclusions as well as suggested areas of research are given.