Language and the Internet
Mining the web to create minority language corpora
Proceedings of the tenth international conference on Information and knowledge management
Introduction to the special issue on the web as corpus
Computational Linguistics - Special issue on web as corpus
Computational Linguistics
Proceedings of the 2nd International Workshop on Web as Corpus
WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
hrWaC and slWac: compiling web corpora for Croatian and Slovene
TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Paddy WaC: a minimally-supervised web-corpus of Hiberno-English
DIALECTS '11 Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties
Hi-index | 0.00 |
In this paper we introduce the first version of noWaC, a large web-based corpus of Bokmål Norwegian currently containing about 700 million tokens. The corpus has been built by crawling, downloading and processing web documents in the .no top-level internet domain. The procedure used to collect the noWaC corpus is largely based on the techniques described by Ferraresi et al. (2008). In brief, first a set of "seed" URLs containing documents in the target language is collected by sending queries to commercial search engines (Google and Yahoo). The obtained seeds (overall 6900 URLs) are then used to start a crawling job using the Heritrix web-crawler limited to the .no domain. The downloaded documents are then processed in various ways in order to build a linguistic corpus (e.g. filtering by document size, language identification, duplicate and near duplicate detection, etc.).