EM detection of common origin of multi-modal cues
Proceedings of the 8th international conference on Multimodal interfaces
Towards capturing fine phonetic variation in speech using articulatory features
Speech Communication
Local spatiotemporal descriptors for visual recognition of spoken phrases
Proceedings of the international workshop on Human-centered multimedia
Visual recognition of speech consonants using facial movement features
Integrated Computer-Aided Engineering - Informatics in Control, Automation and Robotics
Audiovisual-to-articulatory inversion
Speech Communication
Taking the bite out of automated naming of characters in TV video
Image and Vision Computing
Lipreading with local spatiotemporal descriptors
IEEE Transactions on Multimedia
Robust speaking face identification for video analysis
PCM'07 Proceedings of the multimedia 8th Pacific Rim conference on Advances in multimedia information processing
Chinese character learning by synchronization in Wilson-Cowan oscillatory neural networks
ICNC'09 Proceedings of the 5th international conference on Natural computation
Dynamic captioning: video accessibility enhancement for hearing impairment
Proceedings of the international conference on Multimedia
Movie2Comics: a feast of multimedia artwork
Proceedings of the international conference on Multimedia
iComics: automatic conversion of movie into comics
Proceedings of the international conference on Multimedia
The corpus analysis toolkit - analysing multilevel annotations
LTC'09 Proceedings of the 4th conference on Human language technology: challenges for computer science and linguistics
Video accessibility enhancement for hearing-impaired users
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) - Special section on ACM multimedia 2010 best paper candidates, and issue on social media
Robust visual speakingness detection using bi-level HMM
Pattern Recognition
Hi-index | 0.00 |
We present an approach to detecting and recognizing spoken isolated phrases based solely on visual input. We adopt an architecture that first employs discriminative detection of visual speech and articulatory features, and then performs recognition using a model that accounts for the loose synchronization of the feature streams. Discriminative classifiers detect the subclass of lip appearance corresponding to the presence of speech, and further decompose it into features corresponding to the physical components of articulatory production. These components often evolve in a semi-independent fashion, and conventional viseme-based approaches to recognition fail to capture the resulting co-articulation effects. We present a novel dynamic Bayesian network with a multi-stream structure and observations consisting of articulatory feature classifier scores, which can model varying degrees of co-articulation in a principled way. We evaluate our visual-only recognition system on a command utterance task. We show comparative results on lip detection and speech/nonspeech classification, as well as recognition performance against several baseline systems.