Frame length selection in speaker verification task
WSEAS TRANSACTIONS on SYSTEMS
Optimizing features extraction parameters for speaker verification
ICS'08 Proceedings of the 12th WSEAS international conference on Systems
An overview of text-independent speaker recognition: From features to supervectors
Speech Communication
ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Pitch mean based frequency warping
ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
A multi-resolution multi-classifier system for speaker verification
Expert Systems: The Journal of Knowledge Engineering
Hi-index | 0.00 |
The fine spectral structure related to pitch information is conveyed in Mel cepstral features, with variations in pitch causing variations in the features. For speaker recognition systems, this phenomenon, known as "pitch mismatch" between training and testing, can increase error rates. Likewise, pitch-related variability may potentially increase error rates in speech recognition systems for languages such as English in which pitch does not carry phonetic information. In addition, for both speech recognition and speaker recognition systems, the parsing of the raw speech signal into frames is traditionally performed using a constant frame size and a constant frame offset, without aligning the frames to the natural pitch cycles. As a result the power spectral estimation that is done as part of the Mel cepstral computation may include artifacts. Pitch synchronous methods have addressed this problem in the past, at the expense of adding some complexity by using a variable frame size and/or offset. This paper introduces Pseudo Pitch Synchronous (PPS) signal processing procedures that attempt to align each individual frame to its natural cycle and avoid truncation of pitch cycles while still using constant frame size and frame offset, in an effort to address the above problems. Text independent speaker recognition experiments performed on NIST speaker recognition tasks demonstrate a performance improvement when the scores produced by systems using PPS are fused with traditional speaker recognition scores. In addition, a better distribution of errors across trials may be obtained for similar error rates, and some insight regarding of role of the fundamental frequency in speaker recognition is revealed. Speech recognition experiments run on the Aurora-2 noisy digits task also show improved robustness and better accuracy for extremely low signal-to-noise ratio (SNR) data.