Noise robust ASR in reverberated multisource environments applying convolutive NMF and Long Short-Term Memory

Authors:
Martin WöLlmer;Felix Weninger;JüRgen Geiger;BjöRn Schuller;Gerhard Rigoll
Affiliations:
Institute for Human-Machine Communication, Technische Universität München, Theresienstr. 90, 80333 München, Germany;Institute for Human-Machine Communication, Technische Universität München, Theresienstr. 90, 80333 München, Germany;Institute for Human-Machine Communication, Technische Universität München, Theresienstr. 90, 80333 München, Germany;Institute for Human-Machine Communication, Technische Universität München, Theresienstr. 90, 80333 München, Germany;Institute for Human-Machine Communication, Technische Universität München, Theresienstr. 90, 80333 München, Germany
Venue:
Computer Speech and Language
Year:
2013

Citing 19
Cited 1

A time-delay neural network architecture for isolated word recognition

Neural Networks
Learning complex, extended sequences using the principle of history compression

Neural Computation
2005 Special Issue: Framewise phoneme classification with bidirectional LSTM and other neural network architectures

Neural Networks - 2005 Special issue: IJCNN 2005
Learning to Forget: Continual Prediction with LSTM

Neural Computation
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

ICML '06 Proceedings of the 23rd international conference on Machine learning
Long Short-Term Memory

Neural Computation
Learning long-term dependencies with recurrent neural networks

Neurocomputing
Recognition of noisy speech: a comparative survey of robust model architecture and feature enhancement

EURASIP Journal on Audio, Speech, and Music Processing
Monaural speech separation and recognition challenge

Computer Speech and Language
A multiplicative algorithm for convolutive non-negative matrix factorization based on squared Euclidean distance

IEEE Transactions on Signal Processing
An application of recurrent neural networks to discriminative keyword spotting

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Enhancing spontaneous speech recognition with BLSTM features

NOLISP'11 Proceedings of the 5th international conference on Advances in nonlinear speech processing
Bidirectional recurrent neural networks

IEEE Transactions on Signal Processing
Convolutive Speech Bases and Their Application to Supervised Speech Separation

IEEE Transactions on Audio, Speech, and Language Processing
Learning long-term dependencies in NARX recurrent neural networks

IEEE Transactions on Neural Networks
Learning long-term dependencies with gradient descent is difficult

IEEE Transactions on Neural Networks
Online Driver Distraction Detection Using Long Short-Term Memory

IEEE Transactions on Intelligent Transportation Systems
Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
The PASCAL CHiME speech separation and recognition challenge

Computer Speech and Language

Fusion of parametric and non-parametric approaches to noise-robust ASR

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article proposes and evaluates various methods to integrate the concept of bidirectional Long Short-Term Memory (BLSTM) temporal context modeling into a system for automatic speech recognition (ASR) in noisy and reverberated environments. Building on recent advances in Long Short-Term Memory architectures for ASR, we design a novel front-end for context-sensitive Tandem feature extraction and show how the Connectionist Temporal Classification approach can be used as a BLSTM-based back-end, alternatively to Hidden Markov Models (HMM). We combine context-sensitive BLSTM-based feature generation and speech decoding techniques with source separation by convolutive non-negative matrix factorization. Applying our speaker adapted multi-stream HMM framework that processes MFCC features from NMF-enhanced speech as well as word predictions obtained via BLSTM networks and non-negative sparse classification (NSC), we obtain an average accuracy of 91.86% on the PASCAL CHiME Challenge task at signal-to-noise ratios ranging from -6 to 9dB. To our knowledge, this is the best result ever reported for the CHiME Challenge task.