Patch-based representation of visual speech

Authors:
Patrick Lucey;Sridha Sridharan
Affiliations:
Queensland University of Technology, Brisbane, QLD, Australia;Queensland University of Technology, Brisbane, QLD, Australia
Venue:
VisHCI '06 Proceedings of the HCSNet workshop on Use of vision in human-computer interaction - Volume 56
Year:
2006

Citing 8
Cited 0

Probabilistic Visual Learning for Object Representation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection

IEEE Transactions on Pattern Analysis and Machine Intelligence
Recognizing Imprecisely Localized, Partially Occluded, and Expression Variant Faces from a Single Sample per Class

IEEE Transactions on Pattern Analysis and Machine Intelligence
Face Recognition: Features Versus Templates

IEEE Transactions on Pattern Analysis and Machine Intelligence
Active Appearance Models

ECCV '98 Proceedings of the 5th European Conference on Computer Vision-Volume II - Volume II
An Approach to Statistical Lip Modelling for Speaker Identification via Chromatic Feature Extraction

ICPR '98 Proceedings of the 14th International Conference on Pattern Recognition-Volume 1 - Volume 1
Articulatory features for robust visual speech recognition

Proceedings of the 6th international conference on Multimodal interfaces
Learning Patch Dependencies for Improved Pose Mismatched Face Verification

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness, especially in the presence of acoustic noise. To date, the vast majority of work in this field has viewed these visual features in a holistic manner, which may not take into account the various changes that occur within articulation (process of changing the shape of the vocal tract using the articulators, i.e lips and jaw). Motivated by the work being conducted in fields of audio-visual automatic speech recognition (AVASR) and face recognition using articulatory features (AFs) and patches respectively, we present a proof of concept paper which represents the mouth region as a ensemble of image patches. Our experiments show that by dealing with the mouth region in this manner, we are able to extract more speech information from the visual domain. For the task of visual-only speaker-independent isolated digit recognition, we were able to improve the relative word error rate by more than 23% on the CUAVE audio-visual corpus.