Modeling prosodic differences for speaker recognition

Authors:
André Gustavo Adami
Affiliations:
Departamento de Informática, Universidade de Caxias do Sul, Rua Francisco Getúlio Vargas, 1130 Caxias do Sul, RS 95070-560, Brazil
Venue:
Speech Communication
Year:
2007

Citing 5
Cited 3

Elements of information theory

Elements of information theory
Prosodic and segmental speaker variations

Speech Communication - Special issue on speaker characterization in speech terminology
Prosodic aspects of speech

Fundamentals of speech synthesis and speech recognition
Statistical methods for speech recognition

Statistical methods for speech recognition
The NIST speaker recognition evaluation - overview methodology, systems, results, perspective

Speech Communication - Speaker recognition and its commercial and forensic applications

Unfolding speaker clustering potential: a biomimetic approach

MM '09 Proceedings of the 17th ACM international conference on Multimedia
An overview of text-independent speaker recognition: From features to supervectors

Speech Communication
Comparison of the impact of some Minkowski metrics on VQ/GMM based speaker recognition

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Prosody plays an important role in discriminating speakers. Due to the complexity of estimating relevant prosodic information, most recognition systems rely on the notion that the statistics of the fundamental frequency (as a proxy for pitch) and speech energy (as a proxy for loudness/stress) distributions can be used to capture prosodic differences between speakers. However, this simplistic notion disregards the temporal aspects and the relationship between prosodic features that determine certain phenomena, such as intonation and stress. We propose an alternative approach that exploits the dynamics between the fundamental frequency and speech energy to capture prosodic differences. The aim is to characterize different intonation, stress, or rhythm patterns produced by the variation in the fundamental frequency and speech energy contours. In our approach, the continuous speech signal is converted into a sequence of discrete units that describe the signal in terms of dynamics of the fundamental frequency and energy contours. Using simple statistical models, we show that the statistical dependency between such discrete units can capture speaker-specific information. On the extended-data speaker detection task of the 2001 and 2003 NIST Speaker Recognition Evaluation, such approach achieves a relative improvement of at least 17% over a system based on the distribution statistics of fundamental frequency, speech energy and their deltas. We also show that they are more robust to communication channel effects than the state-of-the-art speaker recognition system. Since conventional speaker recognition systems do not fully incorporate different levels of information, we show that the prosodic features provide complementary information to conventional systems by fusing the prosodic systems with the state-of-the-art system. The relative performance improvement over the state-of-the-art system is about 42% and 12% for the extended-data task of the 2001 and 2003 NIST Speaker Recognition Evaluation, respectively.