Exploiting correlogram structure for robust speech recognition with multiple speech sources

Authors:
Ning Ma;Phil Green;Jon Barker;André Coy
Affiliations:
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK;Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK;Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK;Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK
Venue:
Speech Communication
Year:
2007

Citing 5
Cited 8

Using knowledge to organize sound: the prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures

Speech Communication
Robust automatic speech recognition with missing and unreliable acoustic data

Speech Communication
Missing Data Techniques for Robust Speech Recognition

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
An automatic speech recognition system based on the scene analysis account of auditory perception

Speech Communication
Separation of speech from interfering sounds based on oscillatory correlation

IEEE Transactions on Neural Networks

Missing data mask estimation with frequency and temporal dependencies

Computer Speech and Language
Speech fragment decoding techniques for simultaneous speaker identification and speech recognition

Computer Speech and Language
Energetic and informational masking effects in an audiovisual speech recognition system

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Sparse imputation for large vocabulary noise robust ASR

Computer Speech and Language
Phoneme and tonal accent recognition for Thai speech

Expert Systems with Applications: An International Journal
Musical pitch estimation using a supervised single hidden layer feed-forward neural network

Expert Systems with Applications: An International Journal
Auditory inspired methods for localization of multiple concurrent speakers

Computer Speech and Language
A hearing-inspired approach for distant-microphone speech recognition in the presence of multiple sources

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together with the spectral representation, are employed by a 'speech fragment decoder' which employs 'missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy.