Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

Authors:
Bart Jongejan;Hercules Dalianis
Affiliations:
CST-University of Copenhagen, København S, Denmark;DSV, KTH - Stockholm University, Kista, Sweden and Euroling AB, Stockholm, Sweden
Venue:
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Year:
2009

Citing 7
Cited 3

Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Automatic acquisition of two-level morphological rules

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
YASS: Yet another suffix stripper

ACM Transactions on Information Systems (TOIS)
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
A Greek morphological lexicon and its exploitation by natural language processing applications

PCI'01 Proceedings of the 8th Panhellenic conference on Informatics
Automatic lemmatizer construction with focus on OOV words lemmatization

TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue

Lemmatisation as a tagging task

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Semantically enhanced text stemmer (SETS) for cross-domain document clustering

KES'12 Proceedings of the 16th international conference on Knowledge Engineering, Machine Learning and Lattice Computing with Applications
Enhanced cross-domain document clustering with a semantically enhanced text stemmer SETS

International Journal of Knowledge-based and Intelligent Engineering Systems - Selected papers of KES2012-Part 2 of 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish, Dutch, English, German, Greek, Icelandic, Norwegian, Polish, Slovene and Swedish full form-lemma pairs respectively. We obtained significant improvements of 24 percent for Polish, 2.3 percent for Dutch, 1.5 percent for English, 1.2 percent for German and 1.0 percent for Swedish compared to plain suffix lemmatization using a suffix-only lemmatizer. Icelandic deteriorated with 1.9 percent. We also made an observation regarding the number of produced lemmatization rules as a function of the number of training pairs.