A One-Pass Real-Time Decoder Using Memory-Efficient State Network

Authors:
Jian Shao;Ta Li;Qingqing Zhang;Qingwei Zhao;Yonghong Yan
Affiliations:
-;-;-;-;-
Venue:
IEICE - Transactions on Information and Systems
Year:
2008

Citing 2
Cited 6

A one pass decoder design for large vocabulary recognition

HLT '94 Proceedings of the workshop on Human Language Technology
Towards a robust real-time decoder

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 02

Simultaneous Synchronization of Text and Speech for Broadcast News Subtitling

ISNN 2009 Proceedings of the 6th International Symposium on Neural Networks: Advances in Neural Networks - Part III
A Reference Verification Framework and its Application to a Children's Speech Reading Tracker

Proceedings of the 2nd Workshop on Child, Computer and Interaction
WAPS: An Audio Program Surveillance System for Large Scale Web Data Stream

WISM '09 Proceedings of the International Conference on Web Information Systems and Mining
Improved lattice-based confidence measure for speech recognition via a lattice cutoff procedure

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 4
Improving automatic speech recognizer of voice search using system combination

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 4
Towards precise and robust automatic synchronization of live speech and its transcripts

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents our developed decoder which adopts the idea of statically optimizing part of the knowledge sources while handling the others dynamically. The lexicon, phonetic contexts and acoustic model are statically integrated to form a memory-efficient state network, while the language model (LM) is dynamically incorporated on the fly by means of extended tokens. The novelties of our approach for constructing the state network are (1) introducing two layers of dummy nodes to cluster the cross-word (CW) context dependent fan-in and fan-out triphones, (2) introducing a so-called “WI layer” to store the word identities and putting the nodes of this layer in the non-shared mid-part of the network, (3) optimizing the network at state level by a sufficient forward and backward node-merge process. The state network is organized as a multi-layer structure for distinct token propagation at each layer. By exploiting the characteristics of the state network, several techniques including LM look-ahead, LM cache and beam pruning are specially designed for search efficiency. Especially in beam pruning, a layer-dependent pruning method is proposed to further reduce the search space. The layer-dependent pruning takes account of the neck-like characteristics of WI layer and the reduced variety of word endings, which enables tighter beam without introducing much search errors. In addition, other techniques including LM compression, lattice-based bookkeeping and lattice garbage collection are also employed to reduce the memory requirements. Experiments are carried out on a Mandarin spontaneous speech recognition task where the decoder involves a trigram LM and CW triphone models. A comparison with HDecode of HTK toolkits shows that, within 1% performance deviation, our decoder can run 5 times faster with half of the memory footprint.