Exploiting deep neural networks for detection-based speech recognition

Authors:
Sabato Marco Siniscalchi;Dong Yu;Li Deng;Chin-Hui Lee
Affiliations:
Faculty of Engineering and Architecture, Kore University of Enna, Cittadella Universitaria, Enna, Sicily, Italy and School of Electrical and Computer Engineering, Georgia Institute of Technology, ...;Speech Research Group, Microsoft Research, Redmond, WA, USA;Speech Research Group, Microsoft Research, Redmond, WA, USA;School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
Venue:
Neurocomputing
Year:
2013

Citing 23
Cited 1

A time-delay neural network architecture for isolated word recognition

Neural Networks
Alpha-nets: a recurrent “neural” network architecture with a hidden Markov model interpretation

Speech Communication - Neurospeech
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Training products of experts by minimizing contrastive divergence

Neural Computation
2005 Special Issue: Framewise phoneme classification with bidirectional LSTM and other neural network architectures

Neural Networks - 2005 Special issue: IJCNN 2005
A fast learning algorithm for deep belief nets

Neural Computation
Temporal patterns (TRAPs) in ASR of noisy speech

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Articulatory feature recognition using dynamic Bayesian networks

Computer Speech and Language
Towards capturing fine phonetic variation in speech using articulatory features

Speech Communication
A unified architecture for natural language processing: deep neural networks with multitask learning

Proceedings of the 25th international conference on Machine learning
Semantic hashing

International Journal of Approximate Reasoning
A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition

Speech Communication
Learning Deep Architectures for AI

Foundations and Trends® in Machine Learning
Penalized logistic regression with HMM log-likelihood regressors for speech recognition

IEEE Transactions on Audio, Speech, and Language Processing
Learning to detect roads in high-resolution aerial images

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part VI
A Maximum Likelihood Approach to Continuous Speech Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Approximate Test Risk Bound Minimization Through Soft Margin Estimation

IEEE Transactions on Audio, Speech, and Language Processing
Approximation capability in C(R¯n) by multilayer feedforward networks and related problems

IEEE Transactions on Neural Networks
Calibration of Confidence Measures in Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
Acoustic Modeling Using Deep Belief Networks

IEEE Transactions on Audio, Speech, and Language Processing
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
Automatic Speech Recognition Based on Non-Uniform Error Criteria

IEEE Transactions on Audio, Speech, and Language Processing
Experiments on Cross-Language Attribute Detection and Phone Recognition With Minimal Target-Specific Training Data

IEEE Transactions on Audio, Speech, and Language Processing

Phonetic feature extraction for context-sensitive glottal source processing

Speech Communication

Quantified Score

Hi-index	0.01

Visualization

Abstract

In recent years deep neural networks (DNNs) - multilayer perceptrons (MLPs) with many hidden layers - have been successfully applied to several speech tasks, i.e., phoneme recognition, out of vocabulary word detection, confidence measure, etc. In this paper, we show that DNNs can be used to boost the classification accuracy of basic speech units, such as phonetic attributes (phonological features) and phonemes. This boosting leads to higher flexibility and has the potential to integrate both top-down and bottom-up knowledge into the Automatic Speech Attribute Transcription (ASAT) framework. ASAT is a new family of lattice-based speech recognition systems grounded on accurate detection of speech attributes. In this paper we compare DNNs and shallow MLPs within the ASAT framework to classify phonetic attributes and phonemes. Several DNN architectures ranging from five to seven hidden layers and up to 2048 hidden units per hidden layer will be presented and evaluated. Experimental evidence on the speaker-independent Wall Street Journal corpus clearly demonstrates that DNNs can achieve significant improvements over the shallow MLPs with a single hidden layer, producing greater than 90% frame-level attribute estimation accuracies for all 21 phonetic features tested. Similar improvement is also observed on the phoneme classification task with excellent frame-level accuracy of 86.6% by using DNNs. This improved phoneme prediction accuracy, when integrated into a standard large vocabulary continuous speech recognition (LVCSR) system through a word lattice rescoring framework, results in improved word recognition accuracy, which is better than previously reported word lattice rescoring results.