Automatic web search query generation to create minority language corpora

Authors:
Rayid Ghani;Rosie Jones;Dunja Mladenic
Affiliations:
Carnegie Mellon Univ., Pittsburgh, PA;Carnegie Mellon Univ., Pittsburgh, PA;J. Stefan Inst., Slovenia and Carnegie Mellon Univ., Pittsburgh, PA
Venue:
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2001

Citing 3
Cited 4

Relevance feedback and inference networks

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Learning a monolingual language model from a multilingual text database

Proceedings of the ninth international conference on Information and knowledge management
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning

Modeling Information in Textual Data Combining Labeled and Unlabeled Data

Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
Automatic retrieval of similar content using search engine query interface

Proceedings of the 18th ACM conference on Information and knowledge management
Medical query generation by term-category correlation

Information Processing and Management: an International Journal
Language identification for creating language-specific Twitter collections

LSM '12 Proceedings of the Second Workshop on Language in Social Media

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web is a valuable source of language specific resources but collecting, organizing and utilizing this information is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries to collect documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant and a subset of documents is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for finding documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Applying the same approach to multiple languages show that our system generalizes to a variety of languages.