Selecting relevant text subsets from web-data for building topic specific language models

Authors:
Abhinav Sethy;Panayiotis G. Georgiou;Shrikanth Narayanan
Affiliations:
University of Southern California;University of Southern California;University of Southern California
Venue:
NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Year:
2006

Citing 4
Cited 2

Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Building Text Classifiers Using Positive and Unlabeled Examples

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Web-based models for natural language processing

ACM Transactions on Speech and Language Processing (TSLP)

Language model adaptation using machine-translated text for resource-deficient languages

EURASIP Journal on Audio, Speech, and Music Processing
Topic adaptation for lecture translation through bilingual latent semantic models

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present a scheme to select relevant subsets of sentences from a large generic corpus such as text acquired from the web. A relative entropy (R.E) based criterion is used to incrementally select sentences whose distribution matches the domain of interest. Experimental results show that by using the proposed subset selection scheme we can get significant performance improvement in both Word Error Rate (WER) and Perplexity (PPL) over the models built from the entire web-corpus by using just 10% of the data. In addition incremental data selection enables us to achieve significant reduction in the vocabulary size as well as number of n-grams in the adapted language model. To demonstrate the gains from our method we provide a comparative analysis with a number of methods proposed in recent language modeling literature for cleaning up text.