Pseudo pitch synchronous analysis of speech with applications to speaker recognition

Authors:
R. D. Zilca;B. Kingsbury;J. Navratil;G. N. Ramaswamy
Affiliations:
IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA;-;-;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2006

Citing 0
Cited 7

Frame length selection in speaker verification task

WSEAS TRANSACTIONS on SYSTEMS
Optimizing features extraction parameters for speaker verification

ICS'08 Proceedings of the 12th WSEAS international conference on Systems
An overview of text-independent speaker recognition: From features to supervectors

Speech Communication
Healthcare audio event classification using hidden Markov models and hierarchical hidden Markov models

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Feature selection using singular value decomposition and QR factorization with column pivoting for text-independent speaker identification

Speech Communication
Pitch mean based frequency warping

ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
A multi-resolution multi-classifier system for speaker verification

Expert Systems: The Journal of Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The fine spectral structure related to pitch information is conveyed in Mel cepstral features, with variations in pitch causing variations in the features. For speaker recognition systems, this phenomenon, known as "pitch mismatch" between training and testing, can increase error rates. Likewise, pitch-related variability may potentially increase error rates in speech recognition systems for languages such as English in which pitch does not carry phonetic information. In addition, for both speech recognition and speaker recognition systems, the parsing of the raw speech signal into frames is traditionally performed using a constant frame size and a constant frame offset, without aligning the frames to the natural pitch cycles. As a result the power spectral estimation that is done as part of the Mel cepstral computation may include artifacts. Pitch synchronous methods have addressed this problem in the past, at the expense of adding some complexity by using a variable frame size and/or offset. This paper introduces Pseudo Pitch Synchronous (PPS) signal processing procedures that attempt to align each individual frame to its natural cycle and avoid truncation of pitch cycles while still using constant frame size and frame offset, in an effort to address the above problems. Text independent speaker recognition experiments performed on NIST speaker recognition tasks demonstrate a performance improvement when the scores produced by systems using PPS are fused with traditional speaker recognition scores. In addition, a better distribution of errors across trials may be obtained for similar error rates, and some insight regarding of role of the fundamental frequency in speaker recognition is revealed. Speech recognition experiments run on the Aurora-2 noisy digits task also show improved robustness and better accuracy for extremely low signal-to-noise ratio (SNR) data.