Voice activity detection and speaker localization using audiovisual cues

Authors:
Dante A. Blauth;Vicente P. Minotto;Claudio R. Jung;Bowon Lee;Ton Kalker
Affiliations:
Applied Computing - UNISINOS, Av. Unisinos, 950, São Leopoldo 93022-000, RS, Brazil;Institute of Informatics - UFRGS, Av. Bento Gonçalves, 9500, Porto Alegre 91501-970, RS, Brazil;Institute of Informatics - UFRGS, Av. Bento Gonçalves, 9500, Porto Alegre 91501-970, RS, Brazil;Hewlett-Packard Laboratories, 1501 Page Mill Road, Palo Alto, CA 94304, USA;Huawei Innovation Center US R&D, 2330 Central Expressway, Santa Clara, CA 95050, USA
Venue:
Pattern Recognition Letters
Year:
2012

Citing 6
Cited 0

Multimodal human-computer interaction: A survey

Computer Vision and Image Understanding
Audio-visual active speaker tracking in cluttered indoors environments

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on human computing
Feature-Based Face Tracking for Videoconferencing Applications

ISM '09 Proceedings of the 2009 11th IEEE International Symposium on Multimedia
Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

IEEE Transactions on Audio, Speech, and Language Processing
Improved Voice Activity Detection Using Contextual Multiple Hypothesis Testing for Robust Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.10

Visualization

Abstract

This paper proposes a multimodal approach to distinguish silence from speech situations, and to identify the location of the active speaker in the latter case. In our approach, a video camera is used to track the faces of the participants, and a microphone array is used to estimate the Sound Source Location (SSL) using the Steered Response Power with the phase transform (SRP-PHAT) method. The audiovisual cues are combined, and two competing Hidden Markov Models (HMMs) are used to detect silence or the presence of a person speaking. If speech is detected, the corresponding HMM also provides the spatio-temporally coherent location of the speaker. Experimental results show that incorporating the HMM improves the results over the unimodal SRP-PHAT, and the inclusion of video cues provides even further improvements.