Point process models for spotting keywords in continuous speech

Authors:
Aren Jansen;Partha Niyogi
Affiliations:
Department of Computer Science, The University of Chicago, Chicago, IL;Department of Computer Science, The University of Chicago, Chicago, IL
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2009

Citing 9
Cited 1

Adaptive Sparseness for Supervised Learning

IEEE Transactions on Pattern Analysis and Machine Intelligence
Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds

IEEE Transactions on Pattern Analysis and Machine Intelligence
Object Recognition with Features Inspired by Visual Cortex

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Signal reconstruction in sensor arrays using sparse representations

Signal Processing - Sparse approximations in signal and image processing
Object Class Recognition and Localization Using Sparse Features with Limited Receptive Fields

International Journal of Computer Vision
Robust Face Recognition via Sparse Representation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Point process models for event-based speech recognition

Speech Communication
An application of recurrent neural networks to discriminative keyword spotting

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Phoneme based acoustics keyword spotting in informal continuous speech

TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue

Spoken keyword detection using autoassociative neural networks

International Journal of Speech Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the hypothesis that the linguistic content underlying human speech may be coded in the pattern of timings of various acoustic "events" (landmarks) in the speech signal. This hypothesis is supported by several strands of research in the fields of linguistics, speech perception, and neuroscience. In this paper, we put these scientific motivations to the test by formulating a point process-based computational framework for the task of spotting keywords in continuous speech. We find that even with a noisy and extremely sparse phonetic landmark-based point process representation, keywords can be spotted with accuracy levels comparable to recently studied hidden Markov model-based keyword spotting systems. We show that the performance of our keyword spotting system in the high-precision regime is better predicted by the median duration of the keyword rather than simply the number of its constituent syllables or phonemes. When we are confronted with very few (in the extreme case, zero) examples of the keyword in question, we find that constructing a keyword detector from its component syllable detectors provides a viable approach.