Text data acquisition for domain-specific language models

Authors:
Abhinav Sethy;Panayiotis G. Georgiou;Shrikanth Narayanan
Affiliations:
University of Southern California;University of Southern California;University of Southern California
Venue:
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Year:
2006

Citing 6
Cited 1

Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Building Text Classifiers Using Positive and Unlabeled Examples

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Web-based models for natural language processing

ACM Transactions on Speech and Language Processing (TSLP)
Using the web to overcome data sparseness

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Mining key phrase translations from web corpora

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing

Language models learning for domain-specific natural language user interaction

ROBIO'09 Proceedings of the 2009 international conference on Robotics and biomimetics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The language modeling community is showing a growing interest in using large collections of text mined from the World Wide Web (WWW) to supplement sparse in-domain text resources. However, in most cases the style and content of the text harvested from these corpora differs significantly from the specific nature of these domains. In this paper we present a relative entropy (r.e.) based method to select relevant subsets of sentences whose distribution in an n-gram sense matches the domain of interest. Using simulations, we provide an analysis of how the proposed scheme outperforms filtering techniques proposed in recent language modeling literature on mining text from the web. A comparative study is presented using a text collection of over 800M words collected from the WWW. Experimental results show that by using the proposed subset selection scheme we can get performance improvement in both Word Error Rate (WER) and Perplexity (PPL) over the models built from the entire collection by using just 10% of the data. Improvements in data selection also translated to a significant reduction in the vocabulary size as well as the number of estimated parameters in the adapted language model.