Automatic acquisition of a slovak lexicon from a raw corpus

Authors:
Benoît Sagot
Affiliations:
INRIA-Rocquencourt, Projet Atoll, Le Chesnay Cedex, Rocquencourt, France
Venue:
TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
Year:
2005

Citing 3
Cited 3

Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Automatic extraction of subcategorization from corpora

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Morphological rule induction for terminology acquistion

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1

Automatic acquisition of inflectional lexica for morphological normalisation

Information Processing and Management: an International Journal
Large-coverage root lexicon extraction for Hindi

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an automatic methodology we used in an experiment to acquire a morphological lexicon for the Slovak language, and the lexicon we obtained. This methodology extends and refines approaches which have proven efficient, e.g., for the acquisition of French verbs or Croatian and Russian nouns, adjectives and verbs. It only relies on a raw corpus and on a morphological description of the language. The underlying idea is to build all possible lemmas that can explain all words found in the corpus, according to the morphological description, and to rank these hypothetical lemmas according to their likelihood given the corpus. Of course, hand-validation and iteration of the whole process is needed to achieve a high-quality lexicon, but the human involvement required is orders of magnitude lower than the cost of the fully manual development of such a resource. Moreover, this technique can be easily applied to other languages with a rich morphology that lack large-coverage lexical resources.