Robust multi-modal speech recognition in two languages utilizing video and distance information from the kinect

Authors:
Georgios Galatas;Gerasimos Potamianos;Fillia Makedon
Affiliations:
Heracleia Human Centered Computing Lab, Computer Science and Engineering Dept., University of Texas at Arlington and Institute of Informatics and Telecommunications, NCSR;Dept. of Computer and Communication Engineering, University of Thessaly, Volos, Greece,Institute of Informatics and Telecommunications, NCSR;Heracleia Human Centered Computing Lab, Computer Science and Engineering Dept., University of Texas at Arlington
Venue:
HCI'13 Proceedings of the 15th international conference on Human-Computer Interaction: interaction modalities and techniques - Volume Part IV
Year:
2013

Citing 2
Cited 0

Assessment for automatic speech recognition II: NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems

Speech Communication - Special issue on speech processing in adverse conditions
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the performance of our audio-visual speech recognition system in both English and Greek under the influence of audio noise. We present the architecture of our recently built system that utilizes information from three streams including 3-D distance measurements. The feature extraction approach used is based on the discrete cosine transform and linear discriminant analysis. Data fusion is employed using state-synchronous hidden Markov models. Our experiments were conducted on our recently collected database under a multi-speaker configuration and resulted in higher performance and robustness in comparison to an audio-only recognizer.