Watersheds in Digital Spaces: An Efficient Algorithm Based on Immersion Simulations
IEEE Transactions on Pattern Analysis and Machine Intelligence
Discrete-time signal processing (2nd ed.)
Discrete-time signal processing (2nd ed.)
Robust automatic speech recognition with missing and unreliable acoustic data
Speech Communication
A Review of Nonlinear Diffusion Filtering
SCALE-SPACE '97 Proceedings of the First International Conference on Scale-Space Theory in Computer Vision
Computational Auditory Scene Analysis: Principles, Algorithms, and Applications
Computational Auditory Scene Analysis: Principles, Algorithms, and Applications
Sequential organization in computational auditory scene analysis
Sequential organization in computational auditory scene analysis
A Bayesian estimation approach for speech enhancement using hiddenMarkov models
IEEE Transactions on Signal Processing
Transforming Binary Uncertainties for Robust Speech Recognition
IEEE Transactions on Audio, Speech, and Language Processing
Auditory Segmentation Based on Onset and Offset Analysis
IEEE Transactions on Audio, Speech, and Language Processing
Model-based sequential organization in cochannel speech
IEEE Transactions on Audio, Speech, and Language Processing
Separation of speech from interfering sounds based on oscillatory correlation
IEEE Transactions on Neural Networks
Monaural speech segregation based on pitch tracking and amplitude modulation
IEEE Transactions on Neural Networks
Monaural speech separation and recognition challenge
Computer Speech and Language
The Markov selection model for concurrent speech recognition
Neurocomputing
Disordered voice measurement and auditory analysis
Speech Communication
Uncertainty-based learning of acoustic models from noisy data
Computer Speech and Language
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
Hi-index | 0.00 |
A conventional automatic speech recognizer does not perform well in the presence of multiple sound sources, while human listeners are able to segregate and recognize a signal of interest through auditory scene analysis. We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary time-frequency (T-F) mask which retains the mixture in a local T-F unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T-F units across time frames. The resulting masks are used in an uncertainty decoding framework for automatic speech recognition. We evaluate our system on a speech separation challenge and show that our system yields substantial improvement over the baseline performance.