A comparison of grapheme and phoneme-based units for Spanish spoken term detection

Authors:
Javier Tejedor;Dong Wang;Joe Frankel;Simon King;José Colás
Affiliations:
Human Computer Technology Laboratory, Escuela Politécnica Superior UAM Avenue Francisco Tomás y Valiente 11, 28049, Spain and Centre for Speech Technology Research, University of Edinbur ...;Centre for Speech Technology Research, University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW, United Kingdom;Centre for Speech Technology Research, University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW, United Kingdom;Centre for Speech Technology Research, University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW, United Kingdom;Human Computer Technology Laboratory, Escuela Politécnica Superior UAM Avenue Francisco Tomás y Valiente 11, 28049, Spain
Venue:
Speech Communication
Year:
2008

Citing 5
Cited 4

Out-of-Vocabulary Word Modeling and Rejection for Spanish Keyword Spotting Systems

MICAI '02 Proceedings of the Second Mexican International Conference on Artificial Intelligence: Advances in Artificial Intelligence
Indexing and Search of Multimodal Information

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1
Acoustic Indexing for Multimedia Retrieval and Browsing

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1
A study of phoneme and grapheme based context-dependent ASR systems

MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
Rapid Yet Accurate Speech Indexing Using Dynamic Match Lattice Spotting

IEEE Transactions on Audio, Speech, and Language Processing

Exploiting prosody hierarchy and dynamic features for pitch modeling and generation in HMM-based speech synthesis

IEEE Transactions on Audio, Speech, and Language Processing
Spoken Content Retrieval: A Survey of Techniques and Technologies

Foundations and Trends in Information Retrieval
Spoken keyword detection using autoassociative neural networks

International Journal of Speech Technology
Extension of a Kernel-Based Classifier for Discriminative Spoken Keyword Spotting

Neural Processing Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ever-increasing volume of audio data available online through the world wide web means that automatic methods for indexing and search are becoming essential. Hidden Markov model (HMM) keyword spotting and lattice search techniques are the two most common approaches used by such systems. In keyword spotting, models or templates are defined for each search term prior to accessing the speech and used to find matches. Lattice search (referred to as spoken term detection), uses a pre-indexing of speech data in terms of word or sub-word units, which can then quickly be searched for arbitrary terms without referring to the original audio. In both cases, the search term can be modelled in terms of sub-word units, typically phonemes. For in-vocabulary words (i.e. words that appear in the pronunciation dictionary), the letter-to-sound conversion systems are accepted to work well. However, for out-of-vocabulary (OOV) search terms, letter-to-sound conversion must be used to generate a pronunciation for the search term. This is usually a hard decision (i.e. not probabilistic and with no possibility of backtracking), and errors introduced at this step are difficult to recover from. We therefore propose the direct use of graphemes (i.e., letter-based sub-word units) for acoustic modelling. This is expected to work particularly well in languages such as Spanish, where despite the letter-to-sound mapping being very regular, the correspondence is not one-to-one, and there will be benefits from avoiding hard decisions at early stages of processing. In this article, we compare three approaches for Spanish keyword spotting or spoken term detection, and within each of these we compare acoustic modelling based on phone and grapheme units. Experiments were performed using the Spanish geographical-domain Albayzin corpus. Results achieved in the two approaches proposed for spoken term detection show us that trigrapheme units for acoustic modelling match or exceed the performance of phone-based acoustic models. In the method proposed for keyword spotting, the results achieved with each acoustic model are very similar.