A hybrid approach to adapting acoustic and pronunciation models for non-native speech recognition

Authors:
Yoo Rhee Oh;Hong Kook Kim
Affiliations:
Dept. of Information and Communications, Gwangju Institute of Science and Technology, Gwangju, Korea;Dept. of Information and Communications, Gwangju Institute of Science and Technology, Gwangju, Korea
Venue:
Asilomar'09 Proceedings of the 43rd Asilomar conference on Signals, systems and computers
Year:
2009

Citing 6
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
Modeling pronunciation variation for ASR: a survey of the literature

Speech Communication - Special issue on modeling pronunciation variation for automatic speech recognition
Recognizing speech of goats, wolves, sheep and...non-natives

Speech Communication
The design for the wall street journal-based CSR corpus

HLT '91 Proceedings of the workshop on Speech and Natural Language
Tree-based state tying for high accuracy acoustic modelling

HLT '94 Proceedings of the workshop on Human Language Technology
Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a hybrid model adaptation approach that combines pronunciation and acoustic model adaptation methods in order to improve the performance of nonnative automatic speech recognition (ASR). Specifically, the hybrid model adaptation can be performed in two ways; at a statetying level or a triphone-modeling level. In both methods, we first analyze the pronunciation variant rules of non-native speech and then classify each rule as either a pronunciation variant or an acoustic variant. The state-tying level method then adapts pronunciation models by adding variant pronunciations from the non-native speech and acoustic models by tying the states of triphone acoustic models using the acoustic variants. Conversely, the triphone-modeling level method adapts pronunciation models in the same way as the state-tying level method, re-estimates the triphone acoustic models using the adapted pronunciation models, and clusters the states of triphone acoustic models using the acoustic variants. From Korean-spoken English speech-recognition experiments, it is shown that the proposed hybrid acoustic and pronunciation model adaptation approach using the state-tying level method and the triphone-modeling level method can relatively reduce the average word error rates (WERs) by 16.07% and 20.94%, respectively, when compared to a baseline ASR system.