Microphone array driven speech recognition: influence of localization on the word error rate

Authors:
Matthias Wölfel;Kai Nickel;John McDonough
Affiliations:
Institut für Theoretische Informatik, Universität Karlsruhe (TH), Karlsruhe, Germany;Institut für Theoretische Informatik, Universität Karlsruhe (TH), Karlsruhe, Germany;Institut für Theoretische Informatik, Universität Karlsruhe (TH), Karlsruhe, Germany
Venue:
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Year:
2005

Citing 5
Cited 4

Multirate systems and filter banks

Multirate systems and filter banks
CONDENSATION—Conditional Density Propagation forVisual Tracking

International Journal of Computer Vision
Towards Vision-Based 3-D People Tracking in a Smart Room

ICMI '02 Proceedings of the 4th IEEE International Conference on Multimodal Interfaces
A joint particle filter for audio-visual speaker tracking

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Computers in the Human Interaction Loop

Computers in the Human Interaction Loop

Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech

Proceedings of the 8th international conference on Multimodal interfaces
Prototyping novel collaborative multimodal systems: simulation, data collection and analysis tools for the next decade

Proceedings of the 8th international conference on Multimodal interfaces
Computer-supported human-human multilingual communication

50 years of artificial intelligence
The connector service-predicting availability in mobile contexts

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Interest within the automatic speech recognition (ASR) research community has recently focused on the recognition of speech captured with one or more microphones located in the far field, rather than being mounted on a headset and positioned next to the speaker's mouth. Far field ASR is a natural application for beamforming techniques using an array of microphones. A prerequisite for applying such techniques, however, is a reliable means of speaker localization. In this work, we compare the accuracy of source localization systems based on only audio features, only video features, as well as a combination of audio and video features using speech data collected during seminars held by actual speakers. We also investigate the influence of source localization accuracy on the word error rate (WER) of a far field ASR system, comparing the WERs obtained with position estimates from several automatic source localizers with those obtained from true speaker positions. Our results reveal that accurate speaker localization is crucial for minimizing the error rate of a far field ASR system.