NoWaC: a large web-based corpus for Norwegian

  • Authors:
  • Emiliano Guevara

  • Affiliations:
  • University of Oslo

  • Venue:
  • WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we introduce the first version of noWaC, a large web-based corpus of Bokmål Norwegian currently containing about 700 million tokens. The corpus has been built by crawling, downloading and processing web documents in the .no top-level internet domain. The procedure used to collect the noWaC corpus is largely based on the techniques described by Ferraresi et al. (2008). In brief, first a set of "seed" URLs containing documents in the target language is collected by sending queries to commercial search engines (Google and Yahoo). The obtained seeds (overall 6900 URLs) are then used to start a crawling job using the Heritrix web-crawler limited to the .no domain. The downloaded documents are then processed in various ways in order to build a linguistic corpus (e.g. filtering by document size, language identification, duplicate and near duplicate detection, etc.).