Visual Speech Recognition with Loosely Synchronized Feature Streams

Authors:
Kate Saenko;Karen Livescu;Michael Siracusa;Kevin Wilson;James Glass;Trevor Darrell
Affiliations:
Massachusetts Institute of Technology;Massachusetts Institute of Technology;Massachusetts Institute of Technology;Massachusetts Institute of Technology;Massachusetts Institute of Technology;Massachusetts Institute of Technology
Venue:
ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Year:
2005

Citing 0
Cited 15

EM detection of common origin of multi-modal cues

Proceedings of the 8th international conference on Multimodal interfaces
Towards capturing fine phonetic variation in speech using articulatory features

Speech Communication
Local spatiotemporal descriptors for visual recognition of spoken phrases

Proceedings of the international workshop on Human-centered multimedia
Visual recognition of speech consonants using facial movement features

Integrated Computer-Aided Engineering - Informatics in Control, Automation and Robotics
Audiovisual-to-articulatory inversion

Speech Communication
Taking the bite out of automated naming of characters in TV video

Image and Vision Computing
Lipreading with local spatiotemporal descriptors

IEEE Transactions on Multimedia
Robust speaking face identification for video analysis

PCM'07 Proceedings of the multimedia 8th Pacific Rim conference on Advances in multimedia information processing
Chinese character learning by synchronization in Wilson-Cowan oscillatory neural networks

ICNC'09 Proceedings of the 5th international conference on Natural computation
Dynamic captioning: video accessibility enhancement for hearing impairment

Proceedings of the international conference on Multimedia
Movie2Comics: a feast of multimedia artwork

Proceedings of the international conference on Multimedia
iComics: automatic conversion of movie into comics

Proceedings of the international conference on Multimedia
The corpus analysis toolkit - analysing multilevel annotations

LTC'09 Proceedings of the 4th conference on Human language technology: challenges for computer science and linguistics
Video accessibility enhancement for hearing-impaired users

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) - Special section on ACM multimedia 2010 best paper candidates, and issue on social media
Robust visual speakingness detection using bi-level HMM

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an approach to detecting and recognizing spoken isolated phrases based solely on visual input. We adopt an architecture that first employs discriminative detection of visual speech and articulatory features, and then performs recognition using a model that accounts for the loose synchronization of the feature streams. Discriminative classifiers detect the subclass of lip appearance corresponding to the presence of speech, and further decompose it into features corresponding to the physical components of articulatory production. These components often evolve in a semi-independent fashion, and conventional viseme-based approaches to recognition fail to capture the resulting co-articulation effects. We present a novel dynamic Bayesian network with a multi-stream structure and observations consisting of articulatory feature classifier scores, which can model varying degrees of co-articulation in a principled way. We evaluate our visual-only recognition system on a command utterance task. We show comparative results on lip detection and speech/nonspeech classification, as well as recognition performance against several baseline systems.