Relevance feedback and inference networks
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical methods for speech recognition
Statistical methods for speech recognition
Automatic discovery of language models for text databases
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A Winnow-Based Approach to Context-Sensitive Spelling Correction
Machine Learning - Special issue on natural language learning
Document Categorization and Query Generation on the World Wide WebUsing WebACE
Artificial Intelligence Review - Special issue on data mining on the Internet
Learning a monolingual language model from a multilingual text database
Proceedings of the ninth international conference on Information and knowledge management
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Improving Category Specific Web Search by Learning Query Modifications
SAINT '01 Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001)
WebSail: From On-line Learning to Web Search
WISE '00 Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)-Volume 1 - Volume 1
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Mining the Web for bilingual text
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
QProber: A system for automatic classification of hidden-Web databases
ACM Transactions on Information Systems (TOIS)
Linguistic resource creation for research and technology development: A recent experiment
ACM Transactions on Asian Language Information Processing (TALIP)
Text characteristics of English language university Web sites: Research Articles
Journal of the American Society for Information Science and Technology
Classification-aware hidden-web text database selection
ACM Transactions on Information Systems (TOIS)
Lexical and Semantic Resources for NLP: From Words to Meanings
KES '08 Proceedings of the 12th international conference on Knowledge-Based Intelligent Information and Engineering Systems, Part III
Frontiers in linguistic annotation for lower-density languages
LAC '06 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006
Corpora building and processing
HSI'09 Proceedings of the 2nd conference on Human System Interactions
ALR7 Proceedings of the 7th Workshop on Asian Language Resources
Finding domain specific collocations and concordances on the web
MCTLLL '09 Proceedings of the Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning
NoWaC: a large web-based corpus for Norwegian
WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Language resources extracted from Wikipedia
Proceedings of the sixth international conference on Knowledge capture
Development of VisuaLexs for hybrid language learning
ICHL'09 Proceedings of the Second international conference on Hybrid Learning and Education
Hi-index | 0.00 |
The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for retrieving documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes to a variety of languages.