EuroGOV: engineering a multilingual web corpus

  • Authors:
  • Börkur Sigurbjörnsson;Jaap Kamps;Maarten de Rijke

  • Affiliations:
  • ISLA, Faculty of Science, University of Amsterdam;ISLA, Faculty of Science, University of Amsterdam;ISLA, Faculty of Science, University of Amsterdam

  • Venue:
  • CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

EuroGOV is a multilingual web corpus that was created to serve as the document collection for WebCLEF, the CLEF 2005 web retrieval task. EuroGOV is a collection of web pages crawled from the European Union portal, European Union member state governmental web sites, and Russian governmental web sites. The corpus contains over 3 million documents written in more than 20 different European languages. In this paper we provide a detailed description of the EuroGOV collection.