Temporal filtering of visual speech for audio-visual speech recognition in acoustically and visually challenging environments

Authors:
Jong-Seok Lee;Cheol Hoon Park
Affiliations:
KAIST, Daejeon, South Korea;KAIST, Daejeon, South Korea
Venue:
Proceedings of the 9th international conference on Multimodal interfaces
Year:
2007

Citing 12
Cited 0

Discrete-time signal processing (2nd ed.)

Discrete-time signal processing (2nd ed.)
Adaptive fusion of acoustic and visual sources for automatic speech recognition

Speech Communication - Special issue on auditory-visual speech processing
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Sensor fusion weighting measures in Audio-Visual Speech Recognition

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Articulatory features for robust visual speech recognition

Proceedings of the 6th international conference on Multimodal interfaces
Digital Image Processing (3rd Edition)

Digital Image Processing (3rd Edition)
An evaluation of visual speech features for the tasks of speech and speaker recognition

AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
Visual model structures and synchrony constraints for audio-visual speech recognition

IEEE Transactions on Audio, Speech, and Language Processing
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia
A review of speech-based bimodal recognition

IEEE Transactions on Multimedia
Statistical multimodal integration for audio-visual speech processing

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

The use of visual information of speech has been shown to be effective for compensating for performance degradation of acoustic speech recognition in noisy environments. However, visual noise is usually ignored in most of audio-visual speech recognition systems, while it can be included in visual speech signals during acquisition or transmission of the signals. In this paper, we present a new temporal filtering technique for extraction of noise-robust visual features. In the proposed method, a carefully designed band-pass filter is applied to the temporal pixel value sequences of lip region images in order to remove unwanted temporal variations due to visual noise, illumination conditions or speakers' appearances. We demonstrate that the method can improve not only visual speech recognition performance for clean and noisy images but also audio-visual speech recognition performance in both acoustically and visually noisy conditions.