Super-human multi-talker speech recognition: A graphical modeling approach

Authors:
John R. Hershey;Steven J. Rennie;Peder A. Olsen;Trausti T. Kristjansson
Affiliations:
IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA;IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA;IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA;Google New York, 75 Ninth Avenue, New York, NY 10011, USA
Venue:
Computer Speech and Language
Year:
2010

Citing 4
Cited 7

Single-channel speech separation and recognition using loopy belief propagation

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Monaural speech separation and recognition challenge

Computer Speech and Language
A Bayesian estimation approach for speech enhancement using hiddenMarkov models

IEEE Transactions on Signal Processing
On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs

IEEE Transactions on Information Theory

Monaural speech separation and recognition challenge

Computer Speech and Language
Evaluating source separation algorithms with reverberant speech

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on processing reverberant speech: methodologies and applications
Trends and advances in speech recognition

IBM Journal of Research and Development
The Markov selection model for concurrent speech recognition

Neurocomputing
A non-negative approach to language informed speech separation

LVA/ICA'12 Proceedings of the 10th international conference on Latent Variable Analysis and Signal Separation
The PASCAL CHiME speech separation and recognition challenge

Computer Speech and Language
Modelling non-stationary noise with spectral factorisation in automatic speech recognition

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a system that can separate and recognize the simultaneous speech of two people recorded in a single channel. Applied to the monaural speech separation and recognition challenge, the system out-performed all other participants -including human listeners - with an overall recognition error rate of 21.6%, compared to the human error rate of 22.3%. The system consists of a speaker recognizer, a model-based speech separation module, and a speech recognizer. For the separation models we explored a range of speech models that incorporate different levels of constraints on temporal dynamics to help infer the source speech signals. The system achieves its best performance when the model of temporal dynamics closely captures the grammatical constraints of the task. For inference, we compare a 2-D Viterbi algorithm and two loopy belief-propagation algorithms. We show how belief-propagation reduces the complexity of temporal inference from exponential to linear in the number of sources and the size of the language model. The best belief-propagation method results in nearly the same recognition error rate as exact inference.