A segment-based audio-visual speech recognizer: data collection, development, and initial experiments

Authors:
Timothy J. Hazen;Kate Saenko;Chia-Hao La;James R. Glass
Affiliations:
MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA;MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA;MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA;MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA
Venue:
Proceedings of the 6th international conference on Multimodal interfaces
Year:
2004

Citing 3
Cited 16

The M2VTS Multimodal Face Database (Release 1.00)

AVBPA '97 Proceedings of the First International Conference on Audio- and Video-Based Biometric Person Authentication
Articulatory features for robust visual speech recognition

Proceedings of the 6th international conference on Multimodal interfaces
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia

Using redundant speech and handwriting for learning new vocabulary and understanding abbreviations

Proceedings of the 8th international conference on Multimodal interfaces
Non-parametric and light-field deformable models

Computer Vision and Image Understanding
Voiceless speech recognition using dynamic visual speech features

VisHCI '06 Proceedings of the HCSNet workshop on Use of vision in human-computer interaction - Volume 56
Local spatiotemporal descriptors for visual recognition of spoken phrases

Proceedings of the international workshop on Human-centered multimedia
Visual recognition of speech consonants using facial movement features

Integrated Computer-Aided Engineering - Informatics in Control, Automation and Robotics
The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Multi-modal Speech Processing Methods: An Overview and Future Research Directions Using a MATLAB Based Audio-Visual Toolbox

Multimodal Signals: Cognitive and Algorithmic Issues
Lipreading with local spatiotemporal descriptors

IEEE Transactions on Multimedia
Lips shape extraction via active shape model and local binary pattern

MICAI'07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence
Hybrid simulated annealing and its application to optimization of hidden Markov models for visual speech recognition

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on gait analysis
Speaker identification and speech recognition using phased arrays

Ambient Intelligence in Everyday Life
Speech audio retrieval using voice query

ICADL'06 Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and Opportunities
Proximity-Based order-respecting intersection for searching in image databases

AMR'10 Proceedings of the 8th international conference on Adaptive Multimedia Retrieval: context, exploration, and fusion
Dynamic units of visual speech

EUROSCA'12 Proceedings of the 11th ACM SIGGRAPH / Eurographics conference on Computer Animation
Dynamic units of visual speech

Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation
Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours of read speech collected from 223 different speakers. This new corpus was used to evaluate our new AVSR system which incorporates a novel audio-visual integration scheme using segment-constrained Hidden Markov Models (HMMs). Preliminary experiments have demonstrated improvements in phonetic recognition performance when incorporating visual information into the speech recognition process.