Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

Authors:
G. E. Dahl; Dong Yu; Li Deng;A. Acero
Affiliations:
Dept. of Comput. Sci., Univ. of Toronto, Toronto, ON, Canada;-;-;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2012

Citing 0
Cited 14

Efficient and effective algorithms for training single-hidden-layer neural networks

Pattern Recognition Letters
Learning invariant feature hierarchies

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I
Exploiting deep neural networks for detection-based speech recognition

Neurocomputing
Deep learning comes of age

Communications of the ACM
A historical perspective of speech recognition

Communications of the ACM
Online multimodal deep similarity learning with application to image retrieval

Proceedings of the 21st ACM international conference on Multimedia
Learning deep structured semantic models for web search using clickthrough data

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Multilingual OCR research and applications: an overview

Proceedings of the 4th International Workshop on Multilingual OCR
2013 Special Issue: Nonlinear spectro-temporal features based on a cochlear model for automatic speech recognition in a noisy situation

Neural Networks
Sign language recognition with support vector machines and hidden conditional random fields: going from fingerspelling to natural articulated words

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
A fast maximum likelihood nonlinear feature transformation method for GMM-HMM speaker adaptation

Neurocomputing
Noise-robust speech recognition through auditory feature detection and spike sequence decoding

Neural Computation
Cross-Lingual Subspace Gaussian Mixture Models for Low-Resource Speech Recognition

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
A voice command system for AUTONOMY using a novel speech alignment algorithm

International Journal of Speech Technology

Quantified Score

Hi-index	0.03

Visualization

Abstract

We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.