Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor Array

Authors:
H. K. Maganti;D. Gatica-Perez;I. McCowan
Affiliations:
Inst. of Neural Inf. Process., Univ. of Ulm, Ulm;-;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2007

Citing 0
Cited 4

Maximum a posteriori multimodal 3D object localization with a depth sensor and stereo microphones

Proceedings of the 2nd International Conference on Immersive Telecommunications
Microphone array beamforming approach to blind speech separation

MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
Online blind speech separation using multiple acoustic speaker tracking and time-frequency masking

Computer Speech and Language
Capturing and reproducing spatial audio based on a circular microphone array

Journal of Electrical and Computer Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integrated approach, in which an audio-visual multiperson tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel postfiltering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on data recorded in a real meeting room for stationary speaker, moving speaker, and overlapping speech scenarios. The results show that the speech enhancement and recognition performance achieved using our approach are significantly better than a single table-top microphone and are comparable to a lapel microphone for some of the scenarios. The results also indicate that the audio-visual-based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system.