Active shape models—their training and application
Computer Vision and Image Understanding
The nature of statistical learning theory
The nature of statistical learning theory
Extraction of Visual Features for Lipreading
IEEE Transactions on Pattern Analysis and Machine Intelligence
ECCV '98 Proceedings of the 5th European Conference on Computer Vision-Volume II - Volume II
Denoising of human speech using combined acoustic and EM sensor signal processing
ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 01
A support vector machine-based dynamic network for visual speech recognition applications
EURASIP Journal on Applied Signal Processing
Feature-based pronunciation modeling for speech recognition
HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Audio-visual speech modeling for continuous speech recognition
IEEE Transactions on Multimedia
Proceedings of the 6th international conference on Multimodal interfaces
Patch-based representation of visual speech
VisHCI '06 Proceedings of the HCSNet workshop on Use of vision in human-computer interaction - Volume 56
Voiceless speech recognition using dynamic visual speech features
VisHCI '06 Proceedings of the HCSNet workshop on Use of vision in human-computer interaction - Volume 56
Proceedings of the 9th international conference on Multimodal interfaces
Using the Tandem Approach for AF Classification in an AVSR System
ISNN '08 Proceedings of the 5th international symposium on Neural Networks: Advances in Neural Networks, Part II
Dual stream speech recognition using articulatory syllable models
International Journal of Speech Technology
Integrating phonological knowledge in ASR systems for Spanish language
CIARP'10 Proceedings of the 15th Iberoamerican congress conference on Progress in pattern recognition, image analysis, computer vision, and applications
Hi-index | 0.01 |
Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions. The idea is to use a set of parallel classifiers to extract different articulatory attributes from the input images, and then combine their decisions to obtain higher-level units, such as visemes or words. We evaluate our approach in a preliminary experiment on a small audio-visual database, using several image noise conditions, and compare it to the standard viseme-based modeling approach.