Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech

Authors:
Hari Krishna Maganti;Daniel Gatica-Perez
Affiliations:
IDIAP Research Institute, Martigny, Switzerland and University of Ulm, Ulm, Germany;IDIAP Research Institute, Martigny, Switzerland and Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
Venue:
Proceedings of the 8th international conference on Multimodal interfaces
Year:
2006

Citing 8
Cited 3

Multiple view geometry in computer visiond

Multiple view geometry in computer visiond
Distributed meetings: a meeting capture and broadcasting system

Proceedings of the tenth ACM international conference on Multimedia
Multi-Modal Tracking of Faces for Video Communications

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
A joint particle filter for audio-visual speaker tracking

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Multimodal multispeaker probabilistic tracking in meetings

ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Detection and separation of speech event using audio and video information fusion and its application to robust speech interface

EURASIP Journal on Applied Signal Processing
Microphone array driven speech recognition: influence of localization on the word error rate

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
The development of the AMI system for the transcription of speech in meetings

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction

Visual Focus of Attention in Dynamic Meeting Scenarios

MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Computer-supported human-human multilingual communication

50 years of artificial intelligence
Virtual speaker tracking by camera using a sound source localisation with two microphones

International Journal of Networking and Virtual Organisations

Quantified Score

Hi-index	0.00

Visualization

Abstract

Accurate speaker location is essential for optimal performance of distant speech acquisition systems using microphone array techniques. However, to the best of our knowledge, no comprehensive studies on the degradation of automatic speech recognition (ASR) as a function of speaker location accuracy in a multi-party scenario exist. In this paper, we describe a framework for evaluation of the effects of speaker location errors on a microphone array-based ASR system, in the context of meetings in multi-sensor rooms comprising multiple cameras and microphones. Speakers are manually annotated in videos in different camera views, and triangulation is used to determine an accurate speaker location. Errors in the speaker location are then induced in a systematic manner to observe their influence on speech recognition performance. The system is evaluated on real overlapping speech data collected with simultaneous speakers in a meeting room. The results are compared with those obtained from close-talking headset microphones, lapel microphones, and speaker location based on audio-only and audio-visual information approaches.