A shift-based approach to speaker normalization using non-linear frequency-scaling model

Authors:
Rohit Sinha;S. Umesh
Affiliations:
Department of Electrical Engineering, Indian Institute of Technology, Kanpur 208 016, India;Department of Electrical Engineering, Indian Institute of Technology, Kanpur 208 016, India
Venue:
Speech Communication
Year:
2008

Citing 4
Cited 0

Speaker Normalization Based on Frequency Warping

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Speaker normalization on conversational telephone speech

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
A parametric approach to vocal tract length normalization

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
A Study of Filter Bank Smoothing in MFCC Features for Recognition of Children's Speech

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this work, we present a speaker-normalization method based on the idea that the speaker-dependent scale-factor can be separated out as a fixed translation factor in an alternate domain. We also introduce a non-linear frequency-scaling model motivated by the analysis of speech data. The proposed shift-based normalization approach is implemented using a maximum-likelihood (ML) search for the translation factor in the alternate domain. The advantage of our approach is that we are able to show the relationship between conventional frequency-warping based vocal-tract length normalization (VTLN) methods and the methods based on shifts in psycho-acoustic scale thus providing a unifying frame-work for speaker-normalization. Additionally, in our approach it is simple to show that the shifting required for normalization can be expressed as a linear transformation in the cepstral domain. This is important for computational efficiency since we do not have to recompute the features by re-doing the signal processing for each scale/translation factor as is usually done in conventional normalization. We present recognition results using our proposed approach on a digit recognition task and show that the non-linear scaling model provides relative improvement of 4% for adults and 7.5% for children when compared to the linear-scaling model.