RNN-based prosodic modeling for mandarin speech and its application to speech-to-text conversion

Authors:
Wern-Jun Wang;Yuan-Fu Liao;Sin-Horng Chen
Affiliations:
Department of Communication Engineering, National Chiao Tung University, Taiwan, ROC and Advanced Technology Research Laboratory, Chunghwa Telecommunication Laboratories, Taiwan, ROC;Department of Communication Engineering, National Chiao Tung University, Taiwan, ROC;Department of Communication Engineering, National Chiao Tung University, Taiwan, ROC
Venue:
Speech Communication
Year:
2002

Citing 8
Cited 2

Automatic detection of prosodic boundaries in speech

Speech Communication - Speech science and technology: a selection from the papers presented at the Fourth International Conference in Speech Science and Technology (SST-92)
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Neural Networks and Speech Processing

Neural Networks and Speech Processing
Linear Prediction of Speech

Linear Prediction of Speech
A Multi-Phase Approach for Fast Spotting of Large Vocabulary Chinese Keywords from Mandarin Speech Using Prosodic Information

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Improving Parsing of Spontaneous Speech with the Help of Prosodic Boundaries

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Prosodic Processing and its Use in Verbmobil

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1
Prosodic word boundary detection using statistical modeling of moraic fundamental frequency contours and its use for continuous speech recognition

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01

A modular holistic approach to prosody modelling for Standard Yorùbá speech synthesis

Computer Speech and Language
Stacking Model-Based Korean Prosodic Phrasing Using Speaker Variability Reduction and Linguistic Feature Engineering

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper, a recurrent neural network (RNN) based prosodic modeling method for Mandarin speech-to-text conversion is proposed. The prosodic modeling is performed in the post-processing stage of acoustic decoding and aims at detecting word-boundary cues to assist in linguistic decoding. It employs a simple three-layer RNN to learn the relationship between input prosodic features, extracted from the input utterance with syllable boundaries pre-determined by the preceding acoustic decoder, and output word-boundary information of the associated text. After the RNN prosodic model is properly trained, it can be used to generate word-boundary cues to help the linguistic decoder solving the problem of word-boundary ambiguity. Two schemes of using these word-boundary cues are proposed. Scheme 1 modifies the baseline scheme of the conventional linguistic decoding search by directly taking the RNN outputs as additional scores and adding them to all word-sequence hypotheses to assist in selecting the best recognized word sequence. Scheme 2 is an extended version of Scheme 1 by further using the RNN outputs to drive a finite state machine (FSM) for setting path constraints to restrict the linguistic decoding search. Character accuracy rates of 73.6%, 74.6% and 74.7% were obtained for the systems using the baseline scheme, Schemes 1 and 2, respectively. Besides, a gain of 17% reduction in the computational complexity of the linguistic decoding search was also obtained for Scheme 2. So the proposed prosodic modeling method is promising for Mandarin speech recognition.