Comparison of methods for language-dependent and language-independent query-by-example spoken term detection

Authors:
Javier Tejedor;Michal Fapšo;Igor Szöke;Jan “Honza” Černocký;František Grézl
Affiliations:
Universidad Aut´onoma de Madrid, Spain;Brno University of Technology, Czech Republic;Brno University of Technology, Czech Republic;Brno University of Technology, Czech Republic;Brno University of Technology, Czech Republic
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2012

Citing 15
Cited 1

Out-of-Vocabulary Word Modeling and Rejection for Spanish Keyword Spotting Systems

MICAI '02 Proceedings of the Second Mexican International Conference on Artificial Intelligence: Advances in Artificial Intelligence
A Segment-Based Wordspotter Using Phonetic Filler Models

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Improved methods for vocal tract normalization

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 02
Hierarchical classification of audio data for archiving and retrieving

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 06
Towards language independent acoustic modeling

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 02
Vocabulary independent spoken term detection

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A lattice-based approach to query-by-example spoken document retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Effect of pronounciations on OOV queries in spoken term detection

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Improving multi-lattice alignment based spoken keyword spotting

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
General indexation of weighted automata: application to spoken utterance retrieval

SpeechIR '04 Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004
OpenFst: a general and efficient weighted finite-state transducer library

CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata
Spoken term detection system based on combination of LVCSR and phonetic search

MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
Audio query by example using similarity measures between probability density functions of features

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on scalable audio-content analysis
Novel methods for query selection and query combination in query-by-example spoken term detection

Proceedings of the 2010 international workshop on Searching spontaneous conversational speech
Unsupervised Pattern Discovery in Speech

IEEE Transactions on Audio, Speech, and Language Processing

Spoken Content Retrieval: A Survey of Techniques and Technologies

Foundations and Trends in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article investigates query-by-example (QbE) spoken term detection (STD), in which the query is not entered as text, but selected in speech data or spoken. Two feature extractors based on neural networks (NN) are introduced: the first producing phone-state posteriors and the second making use of a compressive NN layer. They are combined with three different QbE detectors: while the Gaussian mixture model/hidden Markov model (GMM/HMM) and dynamic time warping (DTW) both work on continuous feature vectors, the third one, based on weighted finite-state transducers (WFST), processes phone lattices. QbE STD is compared to two standard STD systems with text queries: acoustic keyword spotting and WFST-based search of phone strings in phone lattices. The results are reported on four languages (Czech, English, Hungarian, and Levantine Arabic) using standard metrics: equal error rate (EER) and two versions of popular figure-of-merit (FOM). Language-dependent and language-independent cases are investigated; the latter being particularly interesting for scenarios lacking standard resources to train speech recognition systems. While the DTW and GMM/HMM approaches produce the best results for a language-dependent setup depending on the target language, the GMM/HMM approach performs the best dealing with a language-independent setup. As far as WFSTs are concerned, they are promising as they allow for indexing and fast search.