Speaker Diarization: A Review of Recent Research

Authors:
X. Anguera Miro;S. Bozonnet;N. Evans;C. Fredouille;G. Friedland;O. Vinyals
Affiliations:
Multimedia Res. Group, Telefonica Res., Barcelona, Spain;-;-;-;-;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2012

Citing 0
Cited 6

Variational conditional random fields for online speaker detection and tracking

Speech Communication
Speaker diarization using low-cost wearable wireless sensors

Proceedings of the 3rd International Conference on Information and Communication Systems
SocioPhone: everyday face-to-face interaction monitoring platform using multi-phone sensor fusion

Proceeding of the 11th annual international conference on Mobile systems, applications, and services
Crowd++: unsupervised speaker count with smartphones

Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing
Structured exploration of who, what, when, and where in heterogeneous multimedia news sources

Proceedings of the 21st ACM international conference on Multimedia
Scalable multimedia content analysis on parallel platforms using python

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Speaker diarization is the task of determining “who spoke when?” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper, we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research.