A real-time prototype for small-vocabulary audio-visual ASR

Authors:
J. H. Connell;N. Haas;E. Marcheret;C. Neti;G. Potamianos;S. Velipasalar
Affiliations:
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
Venue:
ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 1
Year:
2003

Citing 0
Cited 2

Speaker localisation using audio-visual synchrony: an empirical study

CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
An embedded audio-visual tracking and speech purification system on a dual-core processor platform

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a prototype for the automatic recognition of audio-visual speech, developed to augment the IBM ViaVoice/spl trade/ speech recognition system. Frontal face, full frame video is captured through a USB 2.0 interface by means of an inexpensive PC camera, and processed to obtain appearance-based visual features. Subsequently, these are combined with audio features, synchronously extracted from the acoustic signal, using a simple discriminant feature fusion technique. On the average, the required computations utilize approximately 67% of a Pentium/spl trade/ 4, 1.8 GHz processor, leaving the remaining resources available to hidden Markov model based speech recognition. Real-time performance is there- fore achieved for small-vocabulary tasks, such as connected-digit recognition. In the paper, we discuss the prototype architecture based on the ViaVoice engine, the basic algorithms employed, and their necessary modifications to ensure real-time performance and causality of the visual front end processing. We benchmark the resulting system performance on stored videos against prior research experiments, and we report a close match between the two.