Multiple cameras for audio-visual speech recognition in an automotive environment

  • Authors:
  • Rajitha Navarathna;David Dean;Sridha Sridharan;Patrick Lucey

  • Affiliations:
  • Speech, Audio, Image and Video Technology Lab, Queensland University of Technology, Australia.;Speech, Audio, Image and Video Technology Lab, Queensland University of Technology, Australia.;Speech, Audio, Image and Video Technology Lab, Queensland University of Technology, Australia.;Speech, Audio, Image and Video Technology Lab, Queensland University of Technology, Australia. and Disney Research Pittsburgh, USA.

  • Venue:
  • Computer Speech and Language
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Audio-visual speech recognition, or the combination of visual lip-reading with traditional acoustic speech recognition, has been previously shown to provide a considerable improvement over acoustic-only approaches in noisy environments, such as that present in an automotive cabin. The research presented in this paper will extend upon the established audio-visual speech recognition literature to show that further improvements in speech recognition accuracy can be obtained when multiple frontal or near-frontal views of a speaker's face are available. A series of visual speech recognition experiments using a four-stream visual synchronous hidden Markov model (SHMM) are conducted on the four-camera AVICAR automotive audio-visual speech database. We study the relative contribution between the side and central orientated cameras in improving visual speech recognition accuracy. Finally combination of the four visual streams with a single audio stream in a five-stream SHMM demonstrates a relative improvement of over 56% in word recognition accuracy when compared to the acoustic-only approach in the noisiest conditions of the AVICAR database.