Web resources for language modeling in conversational speech recognition

Authors:
Ivan Bulyko;Mari Ostendorf;Manhung Siu;Tim Ng;Andreas Stolcke;Özgür Çetin
Affiliations:
BBN Technologies, Cambridge, MA;University of Washington;Hong Kong University of Science and Technology;Hong Kong University of Science and Technology;SRI International and the International Computer Science Institute;International Computer Science Institute
Venue:
ACM Transactions on Speech and Language Processing (TSLP)
Year:
2007

Citing 14
Cited 8

Text genre classification with genre-revealing and subject-revealing features

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
Using register-diversified corpora for general language studies

Computational Linguistics - Special issue on using large corpora: II
Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing

HLT '01 Proceedings of the first international conference on Human language technology research
Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Web-based models for natural language processing

ACM Transactions on Speech and Language Processing (TSLP)
Topic learning in text and conversational speech

Topic learning in text and conversational speech
Speech recognition on Mandarin Call Home: a large-vocabulary, conversational, and telephone speech corpus

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Improved topic-dependent language modeling using information retrieval techniques

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
The development of the AMI system for the transcription of speech in meetings

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Further progress in meeting recognition: the ICSI-SRI spring 2005 speech-to-text evaluation system

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction

Web augmentation of language models for continuous speech recognition of SMS text messages

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Unsupervised approaches for automatic keyword extraction using meeting transcripts

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Classifying factored genres with part-of-speech histograms

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Multi-style language model for web scale information retrieval

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Recognition and understanding of meetings

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Web text data mining for building large scale language modelling corpus

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Focusing on novelty: a crawling strategy to build diverse language models

Proceedings of the 20th ACM international conference on Information and knowledge management
The imagination of crowds: conversational AAC language modeling using crowdsourcing and large data sources

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article describes a methodology for collecting text from the Web to match a target sublanguage both in style (register) and topic. Unlike other work that estimates n-gram statistics from page counts, the approach here is to select and filter documents, which provides more control over the type of material contributing to the n-gram counts. The data can be used in a variety of ways; here, the different sources are combined in two types of mixture models. Focusing on conversational speech where data collection can be quite costly, experiments demonstrate the positive impact of Web collections on several tasks with varying amounts of data, including Mandarin and English telephone conversations and English meetings and lectures.