Mask estimation for missing data speech recognition based on statistics of binaural interaction

Authors:
S. Harding;J. Barker;G. J. Brown
Affiliations:
Dept. of Comput. Sci., Univ. of Sheffield, UK;-;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2006

Citing 0
Cited 8

On the optimality of ideal binary time-frequency masks

Speech Communication
Model-based expectation-maximization source separation and localization

IEEE Transactions on Audio, Speech, and Language Processing
Sequential organization of speech in reverberant environments by integrating monaural grouping and binaural localization

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on processing reverberant speech: methodologies and applications
Sparse imputation for large vocabulary noise robust ASR

Computer Speech and Language
Combining localization cues and source model constraints for binaural source separation

Speech Communication
Speech enhancement using combination of dereverberation and noise reduction for robust speech recognition

Proceedings of the Second Symposium on Information and Communication Technology
Mask estimation and imputation methods for missing data speech recognition in a multisource reverberant environment

Computer Speech and Language
A coherence-based noise reduction algorithm for binaural hearing aids

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a perceptually motivated computational auditory scene analysis (CASA) system that combines sound separation according to spatial location with the "missing data" approach for robust speech recognition in noise. Missing data time-frequency masks are created using probability distributions based on estimates of interaural time and level differences (ITD and ILD) for mixed utterances in reverberated conditions; these masks indicate which regions of the spectrum constitute reliable evidence of the target speech signal. A number of experiments compare the relative efficacy of the binaural cues when used individually and in combination. We also investigate the ability of the system to generalize to acoustic conditions not encountered during training. Performance on a continuous digit recognition task using this method is found to be good, even in a particularly challenging environment with three concurrent male talkers.