Elements of information theory
Elements of information theory
Prosodic and segmental speaker variations
Speech Communication - Special issue on speaker characterization in speech terminology
Fundamentals of speech synthesis and speech recognition
Statistical methods for speech recognition
Statistical methods for speech recognition
The NIST speaker recognition evaluation - overview methodology, systems, results, perspective
Speech Communication - Speaker recognition and its commercial and forensic applications
Unfolding speaker clustering potential: a biomimetic approach
MM '09 Proceedings of the 17th ACM international conference on Multimedia
An overview of text-independent speaker recognition: From features to supervectors
Speech Communication
Comparison of the impact of some Minkowski metrics on VQ/GMM based speaker recognition
Computers and Electrical Engineering
Hi-index | 0.00 |
Prosody plays an important role in discriminating speakers. Due to the complexity of estimating relevant prosodic information, most recognition systems rely on the notion that the statistics of the fundamental frequency (as a proxy for pitch) and speech energy (as a proxy for loudness/stress) distributions can be used to capture prosodic differences between speakers. However, this simplistic notion disregards the temporal aspects and the relationship between prosodic features that determine certain phenomena, such as intonation and stress. We propose an alternative approach that exploits the dynamics between the fundamental frequency and speech energy to capture prosodic differences. The aim is to characterize different intonation, stress, or rhythm patterns produced by the variation in the fundamental frequency and speech energy contours. In our approach, the continuous speech signal is converted into a sequence of discrete units that describe the signal in terms of dynamics of the fundamental frequency and energy contours. Using simple statistical models, we show that the statistical dependency between such discrete units can capture speaker-specific information. On the extended-data speaker detection task of the 2001 and 2003 NIST Speaker Recognition Evaluation, such approach achieves a relative improvement of at least 17% over a system based on the distribution statistics of fundamental frequency, speech energy and their deltas. We also show that they are more robust to communication channel effects than the state-of-the-art speaker recognition system. Since conventional speaker recognition systems do not fully incorporate different levels of information, we show that the prosodic features provide complementary information to conventional systems by fusing the prosodic systems with the state-of-the-art system. The relative performance improvement over the state-of-the-art system is about 42% and 12% for the extended-data task of the 2001 and 2003 NIST Speaker Recognition Evaluation, respectively.