Building a highly accurate Mandarin speech recognizer with language-independent technologies and language-dependent modules

Authors:
Mei-Yuh Hwang;Gang Peng;Mari Ostendorf;Wen Wang;Arlo Faria;Aaron Heidel
Affiliations:
Microsoft Corporation, Redmond, WA and University of Washington, Seattle, WA;Chinese University of Hong Kong, Shatin, NT, Hong Kong and University of Washington, Seattle, WA;Department of Electrical Engineering, University of Washington, Seattle, WA;SRI International, Menlo Park, CA;International Computer Science Institute, University of California at Berkeley, Berkeley, CA;National Taiwan University, Taipei, Taiwan
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2009

Citing 4
Cited 0

Speaker Adaptive Training: A Maximum Likelihood Approach to Speaker Normalization

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Latent dirichlet allocation

The Journal of Machine Learning Research
When a mismatch can be good: large vocabulary speech recognition trained with idealized tandem features

Proceedings of the 2008 ACM symposium on Applied computing
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a system for highly accurate large-vocabulary Mandarin speech recognition. The prevailing hidden Markov model based technologies are essentially language independent and constitute the backbone of our system. These include minimum-phone-error discriminative training and maximum-likelihood linear regression adaptation, among others. Additionally, careful considerations are taken into account for Mandarin-specific issues including lexical word segmentation, tone modeling, phone set design, and automatic acoustic segmentation. Our system comprises two sets of acoustic models for the purposes of cross adaptation. The systems are designed to be complementary in terms of errors but with similar overall accuracy by using different phone sets and different combinations of discriminative learning. The outputs of the two subsystems are then rescored by an adapted n-gram language model. Final confusion network combination yielded 9.1% character error rate on the DARPA GALE 2007 official evaluation, the best Mandarin recognition system in that year.