Robust multi-modal speech recognition in two languages utilizing video and distance information from the kinect

  • Authors:
  • Georgios Galatas;Gerasimos Potamianos;Fillia Makedon

  • Affiliations:
  • Heracleia Human Centered Computing Lab, Computer Science and Engineering Dept., University of Texas at Arlington and Institute of Informatics and Telecommunications, NCSR;Dept. of Computer and Communication Engineering, University of Thessaly, Volos, Greece,Institute of Informatics and Telecommunications, NCSR;Heracleia Human Centered Computing Lab, Computer Science and Engineering Dept., University of Texas at Arlington

  • Venue:
  • HCI'13 Proceedings of the 15th international conference on Human-Computer Interaction: interaction modalities and techniques - Volume Part IV
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

We investigate the performance of our audio-visual speech recognition system in both English and Greek under the influence of audio noise. We present the architecture of our recently built system that utilizes information from three streams including 3-D distance measurements. The feature extraction approach used is based on the discrete cosine transform and linear discriminant analysis. Data fusion is employed using state-synchronous hidden Markov models. Our experiments were conducted on our recently collected database under a multi-speaker configuration and resulted in higher performance and robustness in comparison to an audio-only recognizer.