Relevance feedback and inference networks
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical methods for speech recognition
Statistical methods for speech recognition
Automatic discovery of language models for text databases
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A Winnow-Based Approach to Context-Sensitive Spelling Correction
Machine Learning - Special issue on natural language learning
Finding related pages in the World Wide Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Document Categorization and Query Generation on the World Wide WebUsing WebACE
Artificial Intelligence Review - Special issue on data mining on the Internet
Learning a monolingual language model from a multilingual text database
Proceedings of the ninth international conference on Information and knowledge management
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
On-line Algorithms in Machine Learning
Developments from a June 1996 seminar on Online algorithms: the state of the art
Improving Category Specific Web Search by Learning Query Modifications
SAINT '01 Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001)
WebSail: From On-line Learning to Web Search
WISE '00 Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)-Volume 1 - Volume 1
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Mining the Web for bilingual text
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Using the web as a phonological corpus: a case study from Tagalog
WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
Usage of XSL stylesheets for the annotation of the Sámi language corpora
LAW '07 Proceedings of the Linguistic Annotation Workshop
Constructing a large scale text corpus based on the grid and trustworthiness
TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
TextGraphs-6 Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing
langid.py: an off-the-shelf language identification tool
ACL '12 Proceedings of the ACL 2012 System Demonstrations
Hi-index | 0.00 |
The Web is a source of valuable information, but the process of collecting, organizing, and effectively utilizing the resources it contains is difficult. We describe CorpusBuilder, an approach for automatically generating Web search queries for collecting documents matching a minority concept. The concept used for this paper is that of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or nonrelevant using a language filter, and the feedback is used to learn what query lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. Using odds ratio scores calculated over the documents acquired was one of the most consistently accurate query-generation methods. To reduce the number of estimated parameters, we parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our system performs well whether we initialize it with a whole document or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions.