A hearing-inspired approach for distant-microphone speech recognition in the presence of multiple sources

Authors:
Ning Ma;Jon Barker;Heidi Christensen;Phil Green
Affiliations:
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello, Sheffield S1 4DP, UK;Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello, Sheffield S1 4DP, UK;Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello, Sheffield S1 4DP, UK;Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello, Sheffield S1 4DP, UK
Venue:
Computer Speech and Language
Year:
2013

Citing 12
Cited 0

Learning Patterns of Activity Using Real-Time Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
Robust automatic speech recognition with missing and unreliable acoustic data

Speech Communication
On noise masking for automatic missing data speech recognition: A survey and discussion

Computer Speech and Language
Exploiting correlogram structure for robust speech recognition with multiple speech sources

Speech Communication
Issues with uncertainty decoding for noise robust automatic speech recognition

Speech Communication
A speech fragment approach to localising multiple speakers in reverberant environments

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Bounded conditional mean imputation with Gaussian mixture models: A reconstruction approach to partly occluded features

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Speech fragment decoding techniques for simultaneous speaker identification and speech recognition

Computer Speech and Language
Mask classification for missing-feature reconstruction for robust speech recognition in unknown background noise

Speech Communication
Blind Spatial Subtraction Array for Speech Enhancement in Noisy Environment

IEEE Transactions on Audio, Speech, and Language Processing
Combining Speech Fragment Decoding and Adaptive Noise Floor Modeling

IEEE Transactions on Audio, Speech, and Language Processing
The PASCAL CHiME speech separation and recognition challenge

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of speech recognition in reverberant multisource noise conditions using distant binaural microphones. Our scheme employs a two-stage fragment decoding approach inspired by Bregman's account of auditory scene analysis, in which innate primitive grouping 'rules' are balanced by the role of learnt schema-driven processes. First, the acoustic mixture is split into local time-frequency fragments of individual sound sources using signal-level primitive grouping cues. Second, statistical models are employed to select fragments belonging to the sound source of interest, and the hypothesis-driven stage simultaneously searches for the most probable speech/background segmentation and the corresponding acoustic model state sequence. The paper reports recent advances in combining adaptive noise floor modelling and binaural localisation cues within this framework. By integrating signal-level grouping cues with acoustic models of the target sound source in a probabilistic framework, the system is able to simultaneously separate and recognise the sound of interest from the mixture, and derive significant recognition performance benefits from different grouping cue estimates despite their inherent unreliability in noisy conditions. Finally, the paper will show that missing data imputation can be applied via fragment decoding to allow reconstruction of a clean spectrogram that can be further processed and used as input to conventional ASR systems. The best performing system achieves an average keyword recognition accuracy of 85.83% on the PASCAL CHiME Challenge task.