Super-human multi-talker speech recognition: A graphical modeling approach
Computer Speech and Language
Trends and advances in speech recognition
IBM Journal of Research and Development
Hi-index | 0.00 |
We address the problem of single-channel speech separation and recognition using loopy belief propagation in a way that enables efficient inference for an arbitrary number of speech sources. The graphical model consists of a set of N Markov chains, each of which represents a language model or grammar for a given speaker. A Gaussian mixture model with shared states is used to model the hidden acoustic signal for each grammar state of each source. The combination of sources is modeled in the log spectrum domain using non-linear interaction functions. Previously, temporal inference in such a model has been performed using an N-dimensional Viterbi algorithm that scales exponentially with the number of sources. In this paper, we describe a loopy message passing algorithm that scales linearly with language model size. The algorithm achieves human levels of performance, and is an order of magnitude faster than competitive systems for two speakers.