Improved automatic speech recognition through speaker normalization

Authors:
Diego Giuliani;Matteo Gerosa;Fabio Brugnara
Affiliations:
ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Via Sommarive, 18, I-38050 Povo, Trento, Italy;ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Via Sommarive, 18, I-38050 Povo, Trento, Italy and University of Trento, International Graduate School I-38050 Povo, Trento, Italy;ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Via Sommarive, 18, I-38050 Povo, Trento, Italy
Venue:
Computer Speech and Language
Year:
2006

Citing 7
Cited 6

1993 benchmark tests for the ARPA spoken language program

HLT '94 Proceedings of the workshop on Human Language Technology
A new paradigm for speaker-independent training

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
Speaker normalization on conversational telephone speech

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
A parametric approach to vocal tract length normalization

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
A study of speech recognition for children and the elderly

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Speaker normalization using efficient frequency warping procedures

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Improved methods for vocal tract normalization

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 02

Acoustic variability and automatic recognition of children's speech

Speech Communication
Towards age-independent acoustic modeling

Speech Communication
A review of ASR technologies for children's speech

Proceedings of the 2nd Workshop on Child, Computer and Interaction
An automatic transcription system of hearings in Italian courtrooms

Proceedings of the 2nd ACM workshop on Multimedia in forensics, security and intelligence
Exploring the effect of differences in the acoustic correlates of adults' and children's speech in the context of automatic speech recognition

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on atypical speech
Aging speech recognition with speaker adaptation techniques: Study on medium vocabulary continuous Bengali speech

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, speaker adaptive acoustic modeling is investigated by using a novel method for speaker normalization and a well known vocal tract length normalization method. With the novel normalization method, acoustic observations of training and testing speakers are mapped into a normalized acoustic space through speaker-specific transformations with the aim of reducing inter-speaker acoustic variability. For each speaker, an affine transformation is estimated with the goal of reducing the mismatch between the acoustic data of the speaker and a set of target hidden Markov models. This transformation is estimated through constrained maximum likelihood linear regression and then applied to map the acoustic observations of the speaker into the normalized acoustic space. Recognition experiments made use of two corpora, the first one consisting of adults' speech, the second one consisting of children's speech. Performing training and recognition with normalized data resulted in a consistent reduction of the word error rate with respect to the baseline systems trained on unnormalized data. In addition, the novel method always performed better than the reference vocal tract length normalization method adopted in this work. When unsupervised static speaker adaptation was applied in combination with each of the two speaker normalization methods, a different behavior was observed on the two corpora: in one case performance became very similar while in the other case the difference remained significant.