W4: Real-Time Surveillance of People and Their Activities
IEEE Transactions on Pattern Analysis and Machine Intelligence
Convergence Properties of the Nelder--Mead Simplex Method in Low Dimensions
SIAM Journal on Optimization
Robust Real Time Color Tracking
RoboCup 2000: Robot Soccer World Cup IV
A Real-time Computer Vision System for Measuring Traffic Parameters
CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Lucas-Kanade 20 Years On: A Unifying Framework
International Journal of Computer Vision
Audio Segmentation and Speaker Localization in Meeting Videos
ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 02
Cross-modal prediction in audio-visual communication
ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 04
Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information
IEEE Transactions on Computers
On-line multi-modal speaker diarization
Proceedings of the 9th international conference on Multimodal interfaces
Multi-modal speaker diarization of real-world meetings using compressed-domain video features
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Visual speaker localization aided by acoustic models
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Investigating the use of visual focus of attention for audio-visual speaker diarisation
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Recognizing visual focus of attention from head pose in natural meetings
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on human computing
Speaker localisation using audio-visual synchrony: an empirical study
CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
The AMI meeting corpus: a pre-announcement
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Estimating Dominance in Multi-Party Meetings Using Speaker Diarization
IEEE Transactions on Audio, Speech, and Language Processing
Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings
IEEE Transactions on Audio, Speech, and Language Processing
Prosodic and other Long-Term Features for Speaker Diarization
IEEE Transactions on Audio, Speech, and Language Processing
Speaker association with signal-level audiovisual fusion
IEEE Transactions on Multimedia
Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization
IEEE Transactions on Circuits and Systems for Video Technology
Viewing by interactions: media-oriented operators for reviewing recorded sessions on tv
Proceddings of the 9th international interactive conference on Interactive television
Scalable multimedia content analysis on parallel platforms using python
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Audiovisual diarization of people in video content
Multimedia Tools and Applications
Hi-index | 0.00 |
The following article presents a novel audio-visual approach for unsupervised speaker localization in both time and space and systematically analyzes its unique properties. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the-art audio-only speaker diarization system (speaker localization in time) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the speech regions and estimates “who spoke when,” then, in a second step, the visual models are used to infer the location of the speakers in the video. We call this process “dialocalization.” The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker localization at little incremental engineering and computation costs. The combined algorithm has different properties, such as increased robustness, that cannot be observed in algorithms based on single modalities. The article describes the algorithm, presents benchmarking results, explains its properties, and systematically discusses the contributions of each modality.