Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks

Authors:
Martin Wollmer;Florian Eyben;Joseph Keshet;Alex Graves;Bjorn Schuller;Gerhard Rigoll
Affiliations:
Institute for Human-Machine Communication, Technische Universität München, Germany;Institute for Human-Machine Communication, Technische Universität München, Germany;Idiap Research Institute, Martigny, Switzerland;Institute for Computer Science VI, Technische Universität München, Germany;Institute for Human-Machine Communication, Technische Universität München, Germany;Institute for Human-Machine Communication, Technische Universität München, Germany
Venue:
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Year:
2009

Citing 0
Cited 9

Hybrid HMM/BLSTM-RNN for robust speech recognition

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
On the impact of children's emotional speech on acoustic and language models

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on atypical speech
Affective speaker state analysis in the presence of reverberation

International Journal of Speech Technology
Tandem decoding of children's speech for keyword detection in a child-robot interaction scenario

ACM Transactions on Speech and Language Processing (TSLP)
Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Speech Communication
Improving keyword spotting with a tandem BLSTM-DBN architecture

NOLISP'09 Proceedings of the 2009 international conference on Advances in Nonlinear Speech Processing
Keyword spotting exploiting Long Short-Term Memory

Speech Communication
LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework

Image and Vision Computing
Character confidence based on N-best list for keyword spotting in online Chinese handwritten documents

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we propose a new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding. Our approach overcomes the drawbacks of generative HMM modeling by applying a discriminative learning procedure that non-linearly maps speech features into an abstract vector space. By incorporating the outputs of a BLSTM network into the speech features, it is able to make use of past and future context for phoneme predictions. The robustness of the approach is evaluated on a keyword spotting task using the HUMAINE Sensitive Artificial Listener (SAL) database, which contains accented, spontaneous, and emotionally colored speech. The test is particularly stringent because the system is not trained on the SAL database, but only on the TIMIT corpus of read speech. We show that our method prevails over a discriminative keyword spotter without BLSTM-enhanced feature functions, which in turn has been proven to outperform HMM-based techniques.