Building Minority Language Corpora by Learning to Generate Web Search Queries

Authors:
Rayid Ghani;Rosie Jones;Dunja Mladenic
Affiliations:
Accenture Technology Labs, 161 N. Clark St., 60601, Chicago, IL, USA;Carnegie Mellon University, 5000 Forbes Ave, 60601, Pittsburgh, PA, USA;J. Stefan Institute, Jamova 39, 1000, Ljubljana, PA, Slovenia
Venue:
Knowledge and Information Systems
Year:
2005

Citing 16
Cited 5

Relevance feedback and inference networks

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical methods for speech recognition

Statistical methods for speech recognition
Automatic discovery of language models for text databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A Winnow-Based Approach to Context-Sensitive Spelling Correction

Machine Learning - Special issue on natural language learning
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Learning a monolingual language model from a multilingual text database

Proceedings of the ninth international conference on Information and knowledge management
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
On-line Algorithms in Machine Learning

Developments from a June 1996 seminar on Online algorithms: the state of the art
Improving Category Specific Web Search by Learning Query Modifications

SAINT '01 Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001)
WebSail: From On-line Learning to Web Search

WISE '00 Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)-Volume 1 - Volume 1
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Using the web as a phonological corpus: a case study from Tagalog

WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
Usage of XSL stylesheets for the annotation of the Sámi language corpora

LAW '07 Proceedings of the Linguistic Annotation Workshop
Constructing a large scale text corpus based on the grid and trustworthiness

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
GrawlTCQ: terminology and corpora building by ranking simultaneously terms, queries and documents using graph random walks

TextGraphs-6 Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing
langid.py: an off-the-shelf language identification tool

ACL '12 Proceedings of the ACL 2012 System Demonstrations

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web is a source of valuable information, but the process of collecting, organizing, and effectively utilizing the resources it contains is difficult. We describe CorpusBuilder, an approach for automatically generating Web search queries for collecting documents matching a minority concept. The concept used for this paper is that of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or nonrelevant using a language filter, and the feedback is used to learn what query lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. Using odds ratio scores calculated over the documents acquired was one of the most consistently accurate query-generation methods. To reduce the number of estimated parameters, we parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our system performs well whether we initialize it with a whole document or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions.