A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition

Authors:
Sabato Marco Siniscalchi;Chin-Hui Lee
Affiliations:
Department of Electronics and Telecommunications, Norwegian University of Science and Technology, NO-7491 Trondheim, Norway;School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
Venue:
Speech Communication
Year:
2009

Citing 8
Cited 3

Maximum likelihood estimation for multivariate mixture observations of Markov chins

IEEE Transactions on Information Theory
Exploiting generative models in discriminative classifiers

Proceedings of the 1998 conference on Advances in neural information processing systems II
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Connectionist Speech Recognition: A Hybrid Approach

Connectionist Speech Recognition: A Hybrid Approach
Knowledge-based parameters for HMM speech recognition

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Articulatory feature recognition using dynamic Bayesian networks

Computer Speech and Language
Large margin hidden Markov models for speech recognition

IEEE Transactions on Audio, Speech, and Language Processing
Approximate Test Risk Bound Minimization Through Soft Margin Estimation

IEEE Transactions on Audio, Speech, and Language Processing

Combining speech attribute detection and penalized logistic regression for phoneme recognition

Neurocomputing
Exploiting deep neural networks for detection-based speech recognition

Neurocomputing
Phonetic feature extraction for context-sensitive glottal source processing

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, a lattice rescoring approach to integrating acoustic-phonetic information into automatic speech recognition (ASR) is described. Additional information over what is used in conventional log-likelihood based decoding is provided by a bank of speech event detectors that score manner and place of articulation events with log-likelihood ratios that are treated as confidence levels. An artificial neural network (ANN) is then used to transform raw log-likelihood ratio scores into manageable terms for easy incorporation. We refer to the union of the event detectors and the ANN as knowledge module. A goal of this study is to design a generic framework which makes it easier to incorporate other sources of information into an existing ASR system. Another aim is to start investigating the possibility of building a generic knowledge module that can be plugged into an ASR system without being trained on specific data for the given task. To this end, the proposed approach is evaluated on three diverse ASR tasks: continuous phone recognition, connected digit recognition, and large vocabulary continuous speech recognition, but the data-driven knowledge module is trained with a single corpus and used in all three evaluation tasks without further training. Experimental results indicate that in all three cases the proposed rescoring framework achieves better results than those obtained without incorporating the confidence scores provided by the knowledge module. It is interesting to note that the rescoring process is especially effective in correcting utterances with errors in large vocabulary continuous speech recognition, where constraints imposed by the lexical and language models sometimes produce recognition results not strictly observing the underlying acoustic-phonetic properties.