Robust multi-stream keyword and non-linguistic vocalization detection for computationally intelligent virtual agents

Authors:
Martin Wöllmer;Erik Marchi;Stefano Squartini;Björn Schuller
Affiliations:
Institute for Human-Machine Communication, Technische Universität München, München, Germany;Dipartimento di Ingegneria Biomedica, Elettronica e Telecomunicazioni, Università Politecnica delle Marche, Ancona, Italy;Dipartimento di Ingegneria Biomedica, Elettronica e Telecomunicazioni, Università Politecnica delle Marche, Ancona, Italy;Institute for Human-Machine Communication, Technische Universität München, München, Germany
Venue:
ISNN'11 Proceedings of the 8th international conference on Advances in neural networks - Volume Part II
Year:
2011

Citing 7
Cited 2

Spoken dialogue technology: enabling the conversational user interface

ACM Computing Surveys (CSUR)
Long Short-Term Memory

Neural Computation
COSINE - A corpus of multi-party COnversational Speech In Noisy Environments

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Recognition of noisy speech: a comparative survey of robust model architecture and feature enhancement

EURASIP Journal on Audio, Speech, and Music Processing
Keyword spotting based system for conversation fostering in tabletop scenarios: preliminary evaluation

HSI'09 Proceedings of the 2nd conference on Human System Interactions
Opensmile: the munich versatile and fast open-source audio feature extractor

Proceedings of the international conference on Multimedia
Tandem connectionist feature extraction for conversational speech recognition

MLMI'04 Proceedings of the First international conference on Machine Learning for Multimodal Interaction

Environmental robust speech and speaker recognition through multi-channel histogram equalization

Neurocomputing
A real-time speech enhancement framework for multi-party meetings

NOLISP'11 Proceedings of the 5th international conference on Advances in nonlinear speech processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Systems for keyword and non-linguistic vocalization detection in conversational agent applications need to be robust with respect to background noise and different speaking styles. Focussing on the Sensitive Artificial Listener (SAL) scenario which involves spontaneous, emotionally colored speech, this paper proposes a multi-stream model that applies the principle of Long Short-Term Memory to generate contextsensitive phoneme predictions which can be used for keyword detection. Further, we investigate the incorporation of noisy training material in order to create noise robust acoustic models. We show that both strategies can improve recognition performance when evaluated on spontaneous human-machine conversations as contained in the SEMAINE database.