A tandem algorithm for pitch estimation and voiced speech segregation

Authors:
Guoning Hu;DeLiang Wang
Affiliations:
The Ohio State University, Columbus, OH and AOL Truveo Video Search, San Francisco, CA;Department of Computer Science and Engineering and Center for Cognitive Science, The Ohio State University, Columbus, OH
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2010

Citing 16
Cited 3

Learning internal representations by error propagation

Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
Modelling auditory processing and organisation

Modelling auditory processing and organisation
A pitch determination and voiced/unvoiced decision algorithm for noisy speech

Speech Communication
Speech recognition by machines and humans

Speech Communication
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Computational Auditory Scene Analysis: Principles, Algorithms, and Applications

Computational Auditory Scene Analysis: Principles, Algorithms, and Applications
A maximum likelihood estimation of vocal-tract-related filter characteristics for single channel speech separation

EURASIP Journal on Audio, Speech, and Music Processing
Sequential organization in computational auditory scene analysis

Sequential organization in computational auditory scene analysis
Incorporating spectral subtraction and noise type for unvoiced speech segregation

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Auditory Segmentation Based on Onset and Offset Analysis

IEEE Transactions on Audio, Speech, and Language Processing
Soft Mask Methods for Single-Channel Speaker Separation

IEEE Transactions on Audio, Speech, and Language Processing
Single and Multiple Contour Estimation Through Parametric Spectrogram Modeling of Speech in Noisy Environments

IEEE Transactions on Audio, Speech, and Language Processing
Monaural Speech Separation Based on Computational Auditory Scene Analysis and Objective Quality Assessment of Speech

IEEE Transactions on Audio, Speech, and Language Processing
Separation of speech from interfering sounds based on oscillatory correlation

IEEE Transactions on Neural Networks
A temporal-analysis-based pitch estimation system for noisy speech with a comparative study of performance of recent systems

IEEE Transactions on Neural Networks
Monaural speech segregation based on pitch tracking and amplitude modulation

IEEE Transactions on Neural Networks

Sequential organization of speech in reverberant environments by integrating monaural grouping and binaural localization

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on processing reverberant speech: methodologies and applications
Multi-pitch Streaming of Harmonic Sound Mixtures

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
The analysis of the simplification from the ideal ratio to binary mask in signal-to-noise ratio sense

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

A lot of effort has been made in computational auditory scene analysis (CASA) to segregate speech from monaural mixtures. The performance of current CASA systems on voiced speech segregation is limited by lacking a robust algorithm for pitch estimation. We propose a tandem algorithm that performs pitch estimation of a target utterance and segregation of voiced portions of target speech jointly and iteratively. This algorithm first obtains a rough estimate of target pitch, and then uses this estimate to segregate target speech using harmonicity and temporal zcontinuity. It then improves both pitch estimation and voiced speech segregation iteratively. Novel methods are proposed for performing segregation with a given pitch estimate and pitch determination with given segregation. Systematic evaluation shows that the tandem algorithm extracts a majority of target speech without including much interference, and it performs substantially better than previous systems for either pitch extraction or voiced speech segregation.