Modeling lexical tones for mandarin large vocabulary continuous speech recognition

  • Authors:
  • Mari Ostendorf;Xin Lei

  • Affiliations:
  • University of Washington;University of Washington

  • Venue:
  • Modeling lexical tones for mandarin large vocabulary continuous speech recognition
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Tones in Mandarin carry lexical meaning to distinguish ambiguous words. Therefore, some representation of tone is considered to be an important component of an automatic Mandarin speech recognition system. In this dissertation, we propose several new strategies for tone modeling and explore their effectiveness in state-of-the-art HMM-based Mandarin large vocabulary speech recognition systems in two domains: conversational telephone speech and broadcast news. A scientific study of tonal patterns in different domains is performed first, showing the different levels of tone coarticulation effects. Then we investigate two classes of approaches to tone modeling for speech recognition: embedded and explicit tone modeling. In embedded tone modeling, a novel spline interpolation algorithm is proposed for continuation of the F 0 contour in unvoiced regions, and more effective pitch features are extracted from the interpolated F0 contour. Since tones span syllables rather than phonetic units, we also investigate the use of a multi-layer perceptron and long-term F0 windows to extract tone-related posterior probabilities for acoustic modeling. Experiments reveal the new tone features can improve the recognition performance significantly. To address the different natures of spectral and tone features, multi-stream adaptation is also explored. To further exploit the suprasegmental nature of tones, we combine explicit tone modeling with the embedded tone modeling by lattice rescoring. Explicit tone models allow the use of variable windows to synchronize feature extraction with the syllable. Oracle experiments reveal that there is substantial room for improvement by adding explicit tone modeling (30% reduction in character error rate). Pursuing that potential improvement, syllable-level tone models are first trained and used to provide an extra knowledge source in the lattice. Then we extend the syllable-level tone modeling to word-level modeling with a hierarchical backoff. Experimental results show the proposed word-level tone modeling outperforms the syllable-level modeling consistently and leads to significant gains over embedded tone modeling alone. An important aspect of this work is that the methods are evaluated in the context of a high performance, continuous speech recognition system. Hence, our development of two state-of-the-art Mandarin large vocabulary speech recognition systems to incorporate the tope modeling techniques is also described.