Web augmentation of language models for continuous speech recognition of SMS text messages

Authors:
Mathias Creutz;Sami Virpioja;Anna Kovaleva
Affiliations:
Nokia Research Center, Helsinki, Finland;Nokia Research Center, Helsinki, Finland and Helsinki University of Technology, Espoo, Finland;Nokia Research Center, Helsinki, Finland
Venue:
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2009

Citing 3
Cited 1

Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Web resources for language modeling in conversational speech recognition

ACM Transactions on Speech and Language Processing (TSLP)
On Growing and Pruning Kneser–Ney Smoothed -Gram Models

IEEE Transactions on Audio, Speech, and Language Processing

VOSS: a voice operated suite for the Barbadian vernacular

HCII'11 Proceedings of the 14th international conference on Human-computer interaction: interaction techniques and environments - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present an efficient query selection algorithm for the retrieval of web text data to augment a statistical language model (LM). The number of retrieved relevant documents is optimized with respect to the number of queries submitted. The querying scheme is applied in the domain of SMS text messages. Continuous speech recognition experiments are conducted on three languages: English, Spanish, and French. The web data is utilized for augmenting in-domain LMs in general and for adapting the LMs to a user-specific vocabulary. Word error rate reductions of up to 6.6% (in LM augmentation) and 26.0% (in LM adaptation) are obtained in setups, where the size of the web mixture LM is limited to the size of the baseline in-domain LM.