Unfolding speaker clustering potential: a biomimetic approach

Authors:
Thilo Stadelmann;Bernd Freisleben
Affiliations:
University of Marburg, Marburg, Germany;University of Marburg, Marburg, Germany
Venue:
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Year:
2009

Citing 22
Cited 2

Speaker identification and verification using Gaussian mixture speaker models

Speech Communication
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
Markov processes on curves for automatic speech recognition

Proceedings of the 1998 conference on Advances in neural information processing systems II
Machine Learning

Machine Learning
Discovering Similar Multidimensional Trajectories

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Artificial General Intelligence (Cognitive Technologies)

Artificial General Intelligence (Cognitive Technologies)
Modeling prosodic differences for speaker recognition

Speech Communication
Springer Handbook of Speech Processing

Springer Handbook of Speech Processing
Review: Speaker segmentation and clustering

Signal Processing
Speaker diarization using one-class support vector machines

Speech Communication
In search of deterministic methods for initializing K-means and Gaussian mixture clustering

Intelligent Data Analysis
Extraction and representation of prosodic features for language and speaker recognition

Speech Communication
α-Gaussian mixture modelling for speaker recognition

Pattern Recognition Letters
Do 'Dominant Frequencies' explain the listener's response to formant and spectrum shape variations?

Speech Communication
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
An online kernel change detection algorithm

IEEE Transactions on Signal Processing - Part II
Temporal Integration for Audio Classification With Application to Musical Instrument Classification

IEEE Transactions on Audio, Speech, and Language Processing
Computationally Efficient and Robust BIC-Based Speaker Segmentation

IEEE Transactions on Audio, Speech, and Language Processing
Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization

IEEE Transactions on Audio, Speech, and Language Processing
Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation

IEEE Transactions on Audio, Speech, and Language Processing
An overview of automatic speaker diarization systems

IEEE Transactions on Audio, Speech, and Language Processing

Dynamic captioning: video accessibility enhancement for hearing impairment

Proceedings of the international conference on Multimedia
Video accessibility enhancement for hearing-impaired users

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) - Special section on ACM multimedia 2010 best paper candidates, and issue on social media

Quantified Score

Hi-index	0.00

Visualization

Abstract

Speaker clustering is the task of grouping a set of speech utterances into speaker-specific classes. The basic techniques for solving this task are similar to those used for speaker verification and identification. The hypothesis of this paper is that the techniques originally developed for speaker verification and identification are not sufficiently discriminative for speaker clustering. However, the processing chain for speaker clustering is quite large - there are many potential areas for improvement. The question is: where should improvements be made to improve the final result? To answer this question, this paper takes a biomimetic approach based on a study with human participants acting as an automatic speaker clustering system. Our findings are twofold: it is the stage of modeling that has the highest potential, and information with respect to the temporal succession of frames is crucially missing. Experimental results with our implementation of a speaker clustering system incorporating our findings and applying it on TIMIT data show the validity of our approach.