Voiceless speech recognition using dynamic visual speech features

Authors:
Wai Chee Yau;Dinesh Kant Kumar;Sridhar Poosapadi Arjunan
Affiliations:
RMIT University, Melbourne, Victoria, Australia;RMIT University, Melbourne, Victoria, Australia;RMIT University, Melbourne, Victoria, Australia
Venue:
VisHCI '06 Proceedings of the HCSNet workshop on Use of vision in human-computer interaction - Volume 56
Year:
2006

Citing 11
Cited 2

On Image Analysis by the Methods of Moments

IEEE Transactions on Pattern Analysis and Machine Intelligence
Invariant Image Recognition by Zernike Moments

IEEE Transactions on Pattern Analysis and Machine Intelligence
Neural networks: algorithms, applications, and programming techniques

Neural networks: algorithms, applications, and programming techniques
Speech recognition by machines and humans

Speech Communication
The Recognition of Human Movement Using Temporal Templates

IEEE Transactions on Pattern Analysis and Machine Intelligence
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Artificial Neural Networks for Image Understanding

Artificial Neural Networks for Image Understanding
Invited Speech: "Speechreading: An Overview of Image Processing, Feature Extraction, Sensory Intergration and Pattern Recognition Techiques

FG '96 Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition (FG '96)
Articulatory features for robust visual speech recognition

Proceedings of the 6th international conference on Multimodal interfaces
A segment-based audio-visual speech recognizer: data collection, development, and initial experiments

Proceedings of the 6th international conference on Multimodal interfaces
Visual model structures and synchrony constraints for audio-visual speech recognition

IEEE Transactions on Audio, Speech, and Language Processing

Adaptive Reliability Measure and Optimum Integration Weight for Decision Fusion Audio-visual Speech Recognition

Journal of Signal Processing Systems
Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments

The Visual Computer: International Journal of Computer Graphics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a voiceless speech recognition technique that utilizes dynamic visual features to represent the facial movements during phonation. The dynamic features extracted from the mouth video are used to classify utterances without using the acoustic data. The audio signals of consonants are more confusing than vowels and the facial movements involved in pronunciation of consonants are more discernible. Thus, this paper focuses on identifying consonants using visual information. This paper adopts a visual speech model that categorizes utterances into sequences of smallest visually distinguishable units known as visemes. The viseme model used is based on the viseme model of Moving Picture Experts Group 4 (MPEG-4) standard. The facial movements are segmented from the video data using motion history images (MHI). MHI is a spatio-temporal template (grayscale image) generated from the video data using accumulative image subtraction technique. The proposed approach combines discrete stationary wavelet transform (SWT) and Zernike moments to extract rotation invariant features from the MHI. A feed forward multilayer perceptron (MLP) neural network is used to classify the features based on the patterns of visible facial movements. The preliminary experimental results indicate that the proposed technique is suitable for recognition of English consonants.