NoWaC: a large web-based corpus for Norwegian

Authors:
Emiliano Guevara
Affiliations:
University of Oslo
Venue:
WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Year:
2010

Citing 5
Cited 2

Language and the Internet

Language and the Internet
Mining the web to create minority language corpora

Proceedings of the tenth international conference on Information and knowledge management
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Googleology is Bad Science

Computational Linguistics
Proceedings of the 2nd International Workshop on Web as Corpus

WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus

hrWaC and slWac: compiling web corpora for Croatian and Slovene

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Paddy WaC: a minimally-supervised web-corpus of Hiberno-English

DIALECTS '11 Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we introduce the first version of noWaC, a large web-based corpus of Bokmål Norwegian currently containing about 700 million tokens. The corpus has been built by crawling, downloading and processing web documents in the .no top-level internet domain. The procedure used to collect the noWaC corpus is largely based on the techniques described by Ferraresi et al. (2008). In brief, first a set of "seed" URLs containing documents in the target language is collected by sending queries to commercial search engines (Google and Yahoo). The obtained seeds (overall 6900 URLs) are then used to start a crawling job using the Heritrix web-crawler limited to the .no domain. The downloaded documents are then processed in various ways in order to build a linguistic corpus (e.g. filtering by document size, language identification, duplicate and near duplicate detection, etc.).