Probabilistic Visual Learning for Object Representation
IEEE Transactions on Pattern Analysis and Machine Intelligence
Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection
IEEE Transactions on Pattern Analysis and Machine Intelligence
IEEE Transactions on Pattern Analysis and Machine Intelligence
Face Recognition: Features Versus Templates
IEEE Transactions on Pattern Analysis and Machine Intelligence
ECCV '98 Proceedings of the 5th European Conference on Computer Vision-Volume II - Volume II
An Approach to Statistical Lip Modelling for Speaker Identification via Chromatic Feature Extraction
ICPR '98 Proceedings of the 14th International Conference on Pattern Recognition-Volume 1 - Volume 1
Articulatory features for robust visual speech recognition
Proceedings of the 6th international conference on Multimodal interfaces
Learning Patch Dependencies for Improved Pose Mismatched Face Verification
CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1
Hi-index | 0.00 |
Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness, especially in the presence of acoustic noise. To date, the vast majority of work in this field has viewed these visual features in a holistic manner, which may not take into account the various changes that occur within articulation (process of changing the shape of the vocal tract using the articulators, i.e lips and jaw). Motivated by the work being conducted in fields of audio-visual automatic speech recognition (AVASR) and face recognition using articulatory features (AFs) and patches respectively, we present a proof of concept paper which represents the mouth region as a ensemble of image patches. Our experiments show that by dealing with the mouth region in this manner, we are able to extract more speech information from the visual domain. For the task of visual-only speaker-independent isolated digit recognition, we were able to improve the relative word error rate by more than 23% on the CUAVE audio-visual corpus.