RNN-based prosodic modeling for mandarin speech and its application to speech-to-text conversion

  • Authors:
  • Wern-Jun Wang;Yuan-Fu Liao;Sin-Horng Chen

  • Affiliations:
  • Department of Communication Engineering, National Chiao Tung University, Taiwan, ROC and Advanced Technology Research Laboratory, Chunghwa Telecommunication Laboratories, Taiwan, ROC;Department of Communication Engineering, National Chiao Tung University, Taiwan, ROC;Department of Communication Engineering, National Chiao Tung University, Taiwan, ROC

  • Venue:
  • Speech Communication
  • Year:
  • 2002

Quantified Score

Hi-index 0.01

Visualization

Abstract

In this paper, a recurrent neural network (RNN) based prosodic modeling method for Mandarin speech-to-text conversion is proposed. The prosodic modeling is performed in the post-processing stage of acoustic decoding and aims at detecting word-boundary cues to assist in linguistic decoding. It employs a simple three-layer RNN to learn the relationship between input prosodic features, extracted from the input utterance with syllable boundaries pre-determined by the preceding acoustic decoder, and output word-boundary information of the associated text. After the RNN prosodic model is properly trained, it can be used to generate word-boundary cues to help the linguistic decoder solving the problem of word-boundary ambiguity. Two schemes of using these word-boundary cues are proposed. Scheme 1 modifies the baseline scheme of the conventional linguistic decoding search by directly taking the RNN outputs as additional scores and adding them to all word-sequence hypotheses to assist in selecting the best recognized word sequence. Scheme 2 is an extended version of Scheme 1 by further using the RNN outputs to drive a finite state machine (FSM) for setting path constraints to restrict the linguistic decoding search. Character accuracy rates of 73.6%, 74.6% and 74.7% were obtained for the systems using the baseline scheme, Schemes 1 and 2, respectively. Besides, a gain of 17% reduction in the computational complexity of the linguistic decoding search was also obtained for Scheme 2. So the proposed prosodic modeling method is promising for Mandarin speech recognition.