Multiple cameras for audio-visual speech recognition in an automotive environment

Authors:
Rajitha Navarathna;David Dean;Sridha Sridharan;Patrick Lucey
Affiliations:
Speech, Audio, Image and Video Technology Lab, Queensland University of Technology, Australia.;Speech, Audio, Image and Video Technology Lab, Queensland University of Technology, Australia.;Speech, Audio, Image and Video Technology Lab, Queensland University of Technology, Australia.;Speech, Audio, Image and Video Technology Lab, Queensland University of Technology, Australia. and Disney Research Pittsburgh, USA.
Venue:
Computer Speech and Language
Year:
2013

Citing 7
Cited 1

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
The M2VTS Multimodal Face Database (Release 1.00)

AVBPA '97 Proceedings of the First International Conference on Audio- and Video-Based Biometric Person Authentication
Automatic Analysis of Multimodal Group Actions in Meetings

IEEE Transactions on Pattern Analysis and Machine Intelligence
Integrating audio and visual information to provide highly robust speech recognition

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02
Robust speech recognition in a car using a microphone array

Robust speech recognition in a car using a microphone array
The AMI meeting corpus: a pre-announcement

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
VACE multimodal meeting corpus

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction

A novel speech content authentication algorithm based on Bessel-Fourier moments

Digital Signal Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Audio-visual speech recognition, or the combination of visual lip-reading with traditional acoustic speech recognition, has been previously shown to provide a considerable improvement over acoustic-only approaches in noisy environments, such as that present in an automotive cabin. The research presented in this paper will extend upon the established audio-visual speech recognition literature to show that further improvements in speech recognition accuracy can be obtained when multiple frontal or near-frontal views of a speaker's face are available. A series of visual speech recognition experiments using a four-stream visual synchronous hidden Markov model (SHMM) are conducted on the four-camera AVICAR automotive audio-visual speech database. We study the relative contribution between the side and central orientated cameras in improving visual speech recognition accuracy. Finally combination of the four visual streams with a single audio stream in a five-stream SHMM demonstrates a relative improvement of over 56% in word recognition accuracy when compared to the acoustic-only approach in the noisiest conditions of the AVICAR database.