CUCWeb: a Catalan corpus built from the web

  • Authors:
  • G. Boleda;S. Bott;R. Meza;C. Castillo;T. Badia;V. López

  • Affiliations:
  • Universitat Pompeu Fabra, Barcelona, Spain;Universitat Pompeu Fabra, Barcelona, Spain;Universitat Pompeu Fabra, Barcelona, Spain;Universitat Pompeu Fabra, Barcelona, Spain;Universitat Pompeu Fabra, Barcelona, Spain;Universitat Pompeu Fabra, Barcelona, Spain

  • Venue:
  • WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents CUCWeb, a 166 million word corpus for Catalan built by crawling the Web. The corpus has been annotated with NLP tools and made available to language users through a flexible web interface. The developed architecture is quite general, so that it can be used to create corpora for other languages.