An evaluation of visual speech features for the tasks of speech and speaker recognition

Authors:
Simon Lucey
Affiliations:
Advanced Multimedia Processing Laboratory, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA
Venue:
AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
Year:
2003

Citing 1
Cited 4

A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal

Signal Processing

Temporal filtering of visual speech for audio-visual speech recognition in acoustically and visually challenging environments

Proceedings of the 9th international conference on Multimodal interfaces
Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Hybrid simulated annealing and its application to optimization of hidden Markov models for visual speech recognition

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on gait analysis
VALID: a new practical audio-visual database, and comparative results

AVBPA'05 Proceedings of the 5th international conference on Audio- and Video-Based Biometric Person Authentication

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper an evaluation of visual speech features is performed specifically for the tasks of speech and speaker recognition. Unlike acoustic speech processing, we demonstrate that the features employed for effective speech and speaker recognition are quite different to one another in the visual modality. Area based features (i.e. raw pixels) rather than contour features (i.e. an atomized parametric representation of the mouth, e.g. outer and inner labial contour, tongue, teeth, etc.) are investigated due to their robustness and stability. For the task of speech reading we demonstrate empirically that a large proportion of word unit class distinction stems from the temporal rather than static nature of the visual speech signal. Conversely, for the task of speaker recognition static representations suffice for effective performance although modelling the temporal nature of the signal does improve performance. Additionally, we hypothesize that traditional hidden Markov model (HMM) classifiers may, due to their assumption of intra-state observation independence and stationarity, not be the best paradigm to use for modelling visual speech for the purposes of speech recognition. Results and discussion are presented on the M2VTS database for the tasks of isolated digit, speech and text-dependent speaker recognition.