Mining the web to create minority language corpora

Authors:
Rayid Ghani;Rosie Jones;Dunja Mladenić
Affiliations:
Carnegie Mellon Univ., and Accenture Technology Labs;Carnegie Mellon University, Pittsburgh, PA;J. Stefan Inst., Slovenia and Carnegie Mellon Univ.
Venue:
Proceedings of the tenth international conference on Information and knowledge management
Year:
2001

Citing 14
Cited 12

Relevance feedback and inference networks

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical methods for speech recognition

Statistical methods for speech recognition
Automatic discovery of language models for text databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A Winnow-Based Approach to Context-Sensitive Spelling Correction

Machine Learning - Special issue on natural language learning
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Learning a monolingual language model from a multilingual text database

Proceedings of the ninth international conference on Information and knowledge management
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Improving Category Specific Web Search by Learning Query Modifications

SAINT '01 Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001)
WebSail: From On-line Learning to Web Search

WISE '00 Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)-Volume 1 - Volume 1
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Linguistic resource creation for research and technology development: A recent experiment

ACM Transactions on Asian Language Information Processing (TALIP)
Text characteristics of English language university Web sites: Research Articles

Journal of the American Society for Information Science and Technology
Classification-aware hidden-web text database selection

ACM Transactions on Information Systems (TOIS)
Lexical and Semantic Resources for NLP: From Words to Meanings

KES '08 Proceedings of the 12th international conference on Knowledge-Based Intelligent Information and Engineering Systems, Part III
Frontiers in linguistic annotation for lower-density languages

LAC '06 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006
Corpora building and processing

HSI'09 Proceedings of the 2nd conference on Human System Interactions
Using search engine to construct a scalable corpus for Vietnamese lexical development for word segmentation

ALR7 Proceedings of the 7th Workshop on Asian Language Resources
Finding domain specific collocations and concordances on the web

MCTLLL '09 Proceedings of the Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning
NoWaC: a large web-based corpus for Norwegian

WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Language resources extracted from Wikipedia

Proceedings of the sixth international conference on Knowledge capture
Development of VisuaLexs for hybrid language learning

ICHL'09 Proceedings of the Second international conference on Hybrid Learning and Education

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for retrieving documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes to a variety of languages.