A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams

Authors:
Martin Wöllmer;Marc Al-Hames;Florian Eyben;Björn Schuller;Gerhard Rigoll
Affiliations:
Technische Universität München, Institute for Human-Machine Communication, 80290 München, Germany;Technische Universität München, Institute for Human-Machine Communication, 80290 München, Germany;Technische Universität München, Institute for Human-Machine Communication, 80290 München, Germany;Technische Universität München, Institute for Human-Machine Communication, 80290 München, Germany;Technische Universität München, Institute for Human-Machine Communication, 80290 München, Germany
Venue:
Neurocomputing
Year:
2009

Citing 28
Cited 9

Synergistic use of direct manipulation and natural language

CHI '89 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A tutorial on hidden Markov models and selected applications in speech recognition

Readings in speech recognition
Intelligent multi-media interface technology

Intelligent user interfaces
A generic platform for addressing the multimodal challenge

CHI '95 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Multimodal interfaces

Artificial Intelligence Review - Special issue on integration of natural language and vision processing: recent advances
Long short-term memory

Neural Computation
An evaluation of an eye tracker as a device for computer input2

CHI '87 Proceedings of the SIGCHI/GI Conference on Human Factors in Computing Systems and Graphics Interface
Perceptual user interfaces: multimodal interfaces that process what comes naturally

Communications of the ACM
On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey

IEEE Transactions on Pattern Analysis and Machine Intelligence
On-road driver eye movement tracking using head-mounted devices

ETRA '02 Proceedings of the 2002 symposium on Eye tracking research & applications
“Put-that-there”: Voice and gesture at the graphics interface

SIGGRAPH '80 Proceedings of the 7th annual conference on Computer graphics and interactive techniques
Toward a theory of organized multimodal integration patterns during human-computer interaction

Proceedings of the 5th international conference on Multimodal interfaces
Using multimodal interaction to navigate in arbitrary virtual VRML worlds

Proceedings of the 2001 workshop on Perceptive user interfaces
Multilayer architecture in sign language recognition system

Proceedings of the 6th international conference on Multimodal interfaces
Modeling Individual and Group Actions in Meetings: A Two-Layer HMM Framework

CVPRW '04 Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'04) Volume 7 - Volume 07
Indexing Multidimensional Time-Series

The VLDB Journal — The International Journal on Very Large Data Bases
A real-time system for hand gesture controlled operation of in-car devices

ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 3 (ICME '03) - Volume 03
Building an application framework for speech and pen input integration in multimodal learning interfaces

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 06
Comparison of approaches to continuous hand gesture recognition for a visual dialog system

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 06
Hidden Conditional Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
Audiovisual recognition of spontaneous interest within conversations

Proceedings of the 9th international conference on Multimodal interfaces
Using dynamic time warping for online temporal fusion in multisensor systems

Information Fusion
Designing the user interface for multimodal speech and pen-based gesture applications: state-of-the-art systems and future research directions

Human-Computer Interaction
Being bored? Recognising natural interest by extensive audiovisual integration for real-life application

Image and Vision Computing
Multimodal authentication using asynchronous HMMs

AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
Improving connected letter recognition by lipreading

ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: plenary, special, audio, underwater acoustics, VLSI, neural networks - Volume I
Multimodal integration-a statistical view

IEEE Transactions on Multimedia
Fusion of face and speech data for person identity verification

IEEE Transactions on Neural Networks

Early recognition of upper limb motor tasks through accelerometers: real-time implementation of a DTW-based algorithm

Computers in Biology and Medicine
Tandem decoding of children's speech for keyword detection in a child-robot interaction scenario

ACM Transactions on Speech and Language Processing (TSLP)
Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Speech Communication
GDTW-P-SVMs: Variable-length time series analysis using support vector machines

Neurocomputing
Dynamic Time Warping for Chinese calligraphic character matching and recognizing

Pattern Recognition Letters
Keyword spotting exploiting Long Short-Term Memory

Speech Communication
LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework

Image and Vision Computing
Feature selection techniques with class separability for multivariate time series

Neurocomputing
Probabilistic speech feature extraction with context-sensitive Bottleneck neural networks

Neurocomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

To overcome the computational complexity of the asynchronous hidden Markov model (AHMM), we present a novel multidimensional dynamic time warping (DTW) algorithm for hybrid fusion of asynchronous data. We show that our newly introduced multidimensional DTW concept requires significantly less decoding time while providing the same data fusion flexibility as the AHMM. Thus, it can be applied in a wide range of real-time multimodal classification tasks. Optimally exploiting mutual information during decoding even if the input streams are not synchronous, our algorithm outperforms late and early fusion techniques in a challenging bimodal speech and gesture fusion experiment.