Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction

Authors:
Clement de Groc
Affiliations:
-
Venue:
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2011

Citing 4
Cited 1

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus

Neoclassical compound alignments from comparable corpora

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

The use of the World Wide Web as a free source for large linguistic resources is a well-established idea. Such resources are keystones to domains such as lexicon-based categorization, information retrieval, machine translation and information extraction. In this paper, we present an industrial focused web crawler for the automatic compilation of specialized corpora from the web. This application, created within the framework of the TTC project, is used daily by several linguists to bootstrap large thematic corpora which are then used to automatically generate bilingual terminologies.