Modeling Prosodic Features With Joint Factor Analysis for Speaker Verification

Authors:
N. Dehak;P. Dumouchel;P. Kenny
Affiliations:
CRIM, Montreal;-;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2007

Citing 0
Cited 7

The likelihood ratio decision criterion for nuisance attribute projection in GMM speaker verification

EURASIP Journal on Advances in Signal Processing
An overview of text-independent speaker recognition: From features to supervectors

Speech Communication
Robust speaker recognition in cross-channel condition based on Gaussian mixture model

Multimedia Tools and Applications
N-best rescoring based on pitch-accent patterns

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Automatic prosodic event detection using a novel labeling and selection method in co-training

Speech Communication
Automatic speaker age and gender recognition using acoustic and prosodic level information fusion

Computer Speech and Language
Pertinent Prosodic Features for Speaker Identification by Voice

International Journal of Mobile Computing and Multimedia Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we introduce the use of continuous prosodic features for speaker recognition, and we show how they can be modeled using joint factor analysis. Similar features have been successfully used in language identification. These prosodic features are pitch and energy contours spanning a syllable-like unit. They are extracted using a basis consisting of Legendre polynomials. Since the feature vectors are continuous (rather than discrete), they can be modeled using a standard Gaussian mixture model (GMM). Furthermore, speaker and session variability effects can be modeled in the same way as in conventional joint factor analysis. We find that the best results are obtained when we use the information about the pitch, energy, and the duration of the unit all together. Testing on the core condition of NIST 2006 speaker recognition evaluation data gives an equal error rate of 16.6% and 14.6%, with prosodic features alone, for all trials and English-only trials, respectively. When the prosodic system is fused with a state-of-the-art cepstral joint factor analysis system, we obtain a relative improvement of 8% (all trials) and 12% (English only) compared to the cepstral system alone.