A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI)

Authors:
Anton Karl Ingason;Sigrún Helgadóttir;Hrafn Loftsson;Eiríkur Rögnvaldsson
Affiliations:
Department of Icelandic, University of Iceland, Reykjavik, Iceland;The Árni Magnusson Institute for Icelandic Studies, Reykjavik, Iceland;School of Computer Science, Reykjavik University, Reykjavik, Iceland;Department of Icelandic, University of Iceland, Reykjavik, Iceland
Venue:
GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Year:
2008

Citing 4
Cited 1

How Effective is Stemming and Decompounding for German Text Retrieval?

Information Retrieval
A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for German

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Stemming and lemmatization in the clustering of finnish text documents

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Word normalization and decompounding in mono- and bilingual IR

Information Retrieval

Algorithms for the verification of the semantic relation between a compound and a given lexeme

Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new mixed method lemmatizer for Icelandic, Lemmald, which achieves good performance by relying on IceTagger [1] for tagging and The Icelandic Frequency Dictionary [2] corpus for training. We combine the advantages of data-driven machine learning with linguistic insights to maximize performance. To achieve this, we make use of a novel approach: Hierarchy of Linguistic Identities (HOLI), which involves organizing features and feature structures for the machine learning based on linguistic knowledge. Accuracy of the lemmatization is further improved using an add-on which connects to the Database of Modern Icelandic Inflections [3]. Given correct tagging, our system lemmatizes Icelandic text with an accuracy of 99.55%. We believe our method can be fruitfully adapted to other morphologically rich languages.