Speech fragment decoding techniques for simultaneous speaker identification and speech recognition

Authors:
Jon Barker;Ning Ma;André Coy;Martin Cooke
Affiliations:
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK;Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK;Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK;Departamento Electricidad y Electrónica, Facultad de Ciencias y Tecnología, Universidad del País Vasco, 48940 Leioa, Spain
Venue:
Computer Speech and Language
Year:
2010

Citing 6
Cited 4

The watershed transform: definitions, algorithms and parallelization strategies

Fundamenta Informaticae - Special issue on mathematical morphology
Robust automatic speech recognition with missing and unreliable acoustic data

Speech Communication
An automatic speech recognition system based on the scene analysis account of auditory perception

Speech Communication
Exploiting correlogram structure for robust speech recognition with multiple speech sources

Speech Communication
Monaural speech separation and recognition challenge

Computer Speech and Language
Data driven search organization for continuous speech recognition

IEEE Transactions on Signal Processing

Monaural speech separation and recognition challenge

Computer Speech and Language
The Markov selection model for concurrent speech recognition

Neurocomputing
A hearing-inspired approach for distant-microphone speech recognition in the presence of multiple sources

Computer Speech and Language
Recognizing the message and the messenger: biomimetic spectral analysis for robust speech and speaker recognition

International Journal of Speech Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of recognising speech in the presence of a competing speaker. We review a speech fragment decoding technique that treats segregation and recognition as coupled problems. Data-driven techniques are used to segment a spectro-temporal representation into a set of fragments, such that each fragment is dominated by one or other of the speech sources. A speech fragment decoder is used which employs missing data techniques and clean speech models to simultaneously search for the set of fragments and the word sequence that best matches the target speaker model. The paper investigates the performance of the system on a recognition task employing artificially mixed target and masker speech utterances. The fragment decoder produces significantly lower error rates than a conventional recogniser, and mimics the pattern of human performance that is produced by the interplay between energetic and informational masking. However, at around 0dB the performance is generally quite poor. An analysis of the errors shows that a large number of target/masker confusions are being made. The paper presents a novel fragment-based speaker identification approach that allows the target speaker to be reliably identified across a wide range of SNRs. This component is combined with the recognition system to produce significant improvements. When the target and masker utterance have the same gender, the recognition system has a performance at 0dB equal to that of humans; in other conditions the error rate is roughly twice the human error rate.