Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments

Authors:
Ayaz A. Shaikh;Dinesh K. Kumar;Jayavardhana Gubbi
Affiliations:
School of Electrical and Computer Engineering and Health Innovations Research Institute, RMIT University, Melbourne, Australia 3001;School of Electrical and Computer Engineering and Health Innovations Research Institute, RMIT University, Melbourne, Australia 3001;ISSNIP, Dept of Electrical and Electronic Engineering, The University of Melbourne, Melbourne, Australia 3010
Venue:
The Visual Computer: International Journal of Computer Graphics
Year:
2013

Citing 17
Cited 0

On Image Analysis by the Methods of Moments

IEEE Transactions on Pattern Analysis and Machine Intelligence
Invariant Image Recognition by Zernike Moments

IEEE Transactions on Pattern Analysis and Machine Intelligence
Feature extraction from faces using deformable templates

International Journal of Computer Vision
Extraction of Visual Features for Lipreading

IEEE Transactions on Pattern Analysis and Machine Intelligence
Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data: Research Articles

Computer Animation and Virtual Worlds
Beyond Tracking: Modelling Activity and Understanding Behaviour

International Journal of Computer Vision
Free viewpoint action recognition using motion history volumes

Computer Vision and Image Understanding - Special issue on modeling people: Vision-based understanding of a person's shape, appearance, movement, and behaviour
Voiceless speech recognition using dynamic visual speech features

VisHCI '06 Proceedings of the HCSNet workshop on Use of vision in human-computer interaction - Volume 56
A support vector machine-based dynamic network for visual speech recognition applications

EURASIP Journal on Applied Signal Processing
Audio-visual speech recognition using lip information extracted from side-face images

EURASIP Journal on Audio, Speech, and Music Processing
Lip-Reading Technique Using Spatio-Temporal Templates and Support Vector Machines

CIARP '08 Proceedings of the 13th Iberoamerican congress on Pattern Recognition: Progress in Pattern Recognition, Image Analysis and Applications
Learning Optical Flow

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part III
Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Lipreading with local spatiotemporal descriptors

IEEE Transactions on Multimedia
Modeling coarticulation in EMG-based continuous speech recognition

Speech Communication
Integration of acoustic and visual speech signals using neural networks

IEEE Communications Magazine
The image input microphone - a new nonacoustic speech communication system by media conversion from oral motion images to speech

IEEE Journal on Selected Areas in Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Appearance-based visual speech recognition using only video signals is presented. The proposed technique is based on the use of directional motion history images (DMHIs), which is an extension of the popular optical-flow method for object tracking. Zernike moments of each DMHI are computed in order to perform the classification. The technique incorporates automatic temporal segmentation of isolated utterances. The segmentation of isolated utterance is achieved using pair-wise pixel comparison. Support vector machine is used for classification and the results are based on leave-one-out paradigm. Experimental results show that the proposed technique achieves better performance in visemes recognition than others reported in literature. The benefit of this proposed visual speech recognition method is that it is suitable for real-time applications due to quick motion tracking system and the fast classification method employed. It has applications in command and control using lip movement to text conversion and can be used in noisy environment and also for assisting speech impaired persons.