Product HMMs for audio-visual continuous speech recognition using facial animation parameters

Authors:
P. S. Aleksic;A. K. Katsaggelos
Affiliations:
Dept. of Electr. & Comput. Eng., Northwestern Univ., Evanston, IL, USA;Dept. of Electr. & Comput. Eng., Northwestern Univ., Evanston, IL, USA
Venue:
ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 1
Year:
2003

Citing 0
Cited 3

Toward multimodal fusion of affective cues

Proceedings of the 1st ACM international workshop on Human-centered multimedia
Local spatiotemporal descriptors for visual recognition of spoken phrases

Proceedings of the international workshop on Human-centered multimedia
Lipreading with local spatiotemporal descriptors

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both single-stream and multi-stream hidden Markov models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (A-ASR) WERs, at various SNRs with additive white Gaussian noise.