"The first million is hardest to get": building a large tagged corpus as automatically as possible

  • Authors:
  • Gunnel Källgren

  • Affiliations:
  • University of Stockholm, Stockholm, Sweden

  • Venue:
  • COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3
  • Year:
  • 1990

Quantified Score

Hi-index 0.00

Visualization

Abstract

The paper describes a recently started project in Sweden. The goal of the project is to produce a corpus of (at least) one million words of running text from different genres, where all words are classified for word class and for a set of morpho-syntactic properties. A set of methods and tools for automating the process are being developed and will be presented, and problems and some solutions in connection with e.g. homography disambiguation will be discussed.