Multiple view geometry in computer visiond
Multiple view geometry in computer visiond
Distributed meetings: a meeting capture and broadcasting system
Proceedings of the tenth ACM international conference on Multimedia
Multi-Modal Tracking of Faces for Video Communications
CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
A joint particle filter for audio-visual speaker tracking
ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Multimodal multispeaker probabilistic tracking in meetings
ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
EURASIP Journal on Applied Signal Processing
Microphone array driven speech recognition: influence of localization on the word error rate
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
The development of the AMI system for the transcription of speech in meetings
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Visual Focus of Attention in Dynamic Meeting Scenarios
MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Computer-supported human-human multilingual communication
50 years of artificial intelligence
Virtual speaker tracking by camera using a sound source localisation with two microphones
International Journal of Networking and Virtual Organisations
Hi-index | 0.00 |
Accurate speaker location is essential for optimal performance of distant speech acquisition systems using microphone array techniques. However, to the best of our knowledge, no comprehensive studies on the degradation of automatic speech recognition (ASR) as a function of speaker location accuracy in a multi-party scenario exist. In this paper, we describe a framework for evaluation of the effects of speaker location errors on a microphone array-based ASR system, in the context of meetings in multi-sensor rooms comprising multiple cameras and microphones. Speakers are manually annotated in videos in different camera views, and triangulation is used to determine an accurate speaker location. Errors in the speaker location are then induced in a systematic manner to observe their influence on speech recognition performance. The system is evaluated on real overlapping speech data collected with simultaneous speakers in a meeting room. The results are compared with those obtained from close-talking headset microphones, lapel microphones, and speaker location based on audio-only and audio-visual information approaches.