"The first million is hardest to get": building a large tagged corpus as automatically as possible

Authors:
Gunnel Källgren
Affiliations:
University of Stockholm, Stockholm, Sweden
Venue:
COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3
Year:
1990

Citing 4
Cited 3

Grammatical category disambiguation by statistical optimization

Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
An experiment with heuristic parsing of Swedish

EACL '83 Proceedings of the first conference on European chapter of the Association for Computational Linguistics
Corpus work with PC beta: a presentation

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3

Unification-Based Lexicon and Morphology with Speculative Feature Signalling

CICLing '01 Proceedings of the Second International Conference on Computational Linguistics and Intelligent Text Processing
Parsing without lexicon: the MorP system

EACL '91 Proceedings of the fifth conference on European chapter of the Association for Computational Linguistics
Linguistic indeterminacy as a source of errors in tagging

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper describes a recently started project in Sweden. The goal of the project is to produce a corpus of (at least) one million words of running text from different genres, where all words are classified for word class and for a set of morpho-syntactic properties. A set of methods and tools for automating the process are being developed and will be presented, and problems and some solutions in connection with e.g. homography disambiguation will be discussed.