Unsupervised learning of time-frequency patches as a noise-robust representation of speech

Authors:
Maarten Van Segbroeck;Hugo Van hamme
Affiliations:
Katholieke Universiteit Leuven, Dept. ESAT, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium;Katholieke Universiteit Leuven, Dept. ESAT, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium
Venue:
Speech Communication
Year:
2009

Citing 8
Cited 3

Automatic segmentation and labeling of speech based on Hidden Markov Models

Speech Communication
Robust speech recognition using the modulation spectrogram

Speech Communication - Special issue on robust speech recognition
Non-negative Matrix Factorization with Sparseness Constraints

The Journal of Machine Learning Research
Temporal patterns (TRAPs) in ASR of noisy speech

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Discovering speech phones using convolutive non-negative matrix factorisation with a sparseness constraint

Neurocomputing
Improving the readability of time-frequency and time-scalerepresentations by the reassignment method

IEEE Transactions on Signal Processing
Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria

IEEE Transactions on Audio, Speech, and Language Processing
Convolutive Speech Bases and Their Application to Supervised Speech Separation

IEEE Transactions on Audio, Speech, and Language Processing

A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

Research on Language and Computation
Incremental word learning: Efficient HMM initialization and large margin discriminative adaptation

Speech Communication
Modelling non-stationary noise with spectral factorisation in automatic speech recognition

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a self-learning algorithm using a bottom-up based approach to automatically discover, acquire and recognize the words of a language. First, an unsupervised technique using non-negative matrix factorization (NMF) discovers phone-sized time-frequency patches into which speech can be decomposed. The input matrix for the NMF is constructed for static and dynamic speech features using a spectral representation of both short and long acoustic events. By describing speech in terms of the discovered time-frequency patches, patch activations are obtained which express to what extent each patch is present across time. We then show that speaker-independent patterns appear to recur in these patch activations and how they can be discovered by applying a second NMF-based algorithm on the co-occurrence counts of activation events. By providing information about the word identity to the learning algorithm, the retrieved patterns can be associated with meaningful objects of the language. In case of a small vocabulary task, the system is able to learn patterns corresponding to words and subsequently detects the presence of these words in speech utterances. Without the prior requirement of expert knowledge about the speech as is the case in conventional automatic speech recognition, we illustrate that the learning algorithm achieves a promising accuracy and noise robustness.