Multimodal human-computer interaction: A survey
Computer Vision and Image Understanding
Audio-visual active speaker tracking in cluttered indoors environments
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on human computing
Feature-Based Face Tracking for Videoconferencing Applications
ISM '09 Proceedings of the 2009 11th IEEE International Symposium on Multimedia
Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings
IEEE Transactions on Audio, Speech, and Language Processing
IEEE Transactions on Audio, Speech, and Language Processing
Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos
IEEE Transactions on Multimedia
Hi-index | 0.10 |
This paper proposes a multimodal approach to distinguish silence from speech situations, and to identify the location of the active speaker in the latter case. In our approach, a video camera is used to track the faces of the participants, and a microphone array is used to estimate the Sound Source Location (SSL) using the Steered Response Power with the phase transform (SRP-PHAT) method. The audiovisual cues are combined, and two competing Hidden Markov Models (HMMs) are used to detect silence or the presence of a person speaking. If speech is detected, the corresponding HMM also provides the spatio-temporally coherent location of the speaker. Experimental results show that incorporating the HMM improves the results over the unimodal SRP-PHAT, and the inclusion of video cues provides even further improvements.