Spectral entropy and spectral shape based pre-quantization for real time speaker identification system

Authors:
Gourav Sarkar;Goutam Saha
Affiliations:
Department of Electronics and Electrical Communication Engineering, IIT Kharagpur, Pin, India 721302;Department of Electronics and Electrical Communication Engineering, IIT Kharagpur, Pin, India 721302
Venue:
International Journal of Speech Technology
Year:
2010

Citing 6
Cited 0

Speaker identification and verification using Gaussian mixture speaker models

Speech Communication
Digital signal processing (3rd ed.): principles, algorithms, and applications

Digital signal processing (3rd ed.): principles, algorithms, and applications
POLYCOST: A telephone-speech database for speaker recognition

Speech Communication - Speaker recognition and its commercial and forensic applications
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Selecting feature frames for automatic speaker recognition using mutual information

IEEE Transactions on Audio, Speech, and Language Processing
Real-time speaker identification and verification

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Pre-processing is one of the vital steps for developing robust and efficient recognition system. Better pre-processing not only aid in better data selection but also in significant reduction of computational complexity. Further an efficient frame selection technique can improve the overall performance of the system. Pre-quantization (PQ) is the technique of selecting less number of frames in the pre-processing stage to reduce the computational burden in the post processing stages of speaker identification (SI). In this paper, we develop PQ techniques based on spectral entropy and spectral shape to pick suitable frames containing speaker specific information that varies from frame to frame depending on spoken text and environmental conditions. The attempt is to exploit the statistical properties of distributions of speech frames at the pre-processing stage of speaker recognition. Our aim is not only to reduce the frame rate but also to maintain identification accuracy reasonably high. Further we have also analyzed the robustness of our proposed techniques on noisy utterances. To establish the efficacy of our proposed methods, we used two different databases, POLYCOST (telephone speech) and YOHO (microphone speech).