A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI)

  • Authors:
  • Anton Karl Ingason;Sigrún Helgadóttir;Hrafn Loftsson;Eiríkur Rögnvaldsson

  • Affiliations:
  • Department of Icelandic, University of Iceland, Reykjavik, Iceland;The Árni Magnusson Institute for Icelandic Studies, Reykjavik, Iceland;School of Computer Science, Reykjavik University, Reykjavik, Iceland;Department of Icelandic, University of Iceland, Reykjavik, Iceland

  • Venue:
  • GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a new mixed method lemmatizer for Icelandic, Lemmald, which achieves good performance by relying on IceTagger [1] for tagging and The Icelandic Frequency Dictionary [2] corpus for training. We combine the advantages of data-driven machine learning with linguistic insights to maximize performance. To achieve this, we make use of a novel approach: Hierarchy of Linguistic Identities (HOLI), which involves organizing features and feature structures for the machine learning based on linguistic knowledge. Accuracy of the lemmatization is further improved using an add-on which connects to the Database of Modern Icelandic Inflections [3]. Given correct tagging, our system lemmatizes Icelandic text with an accuracy of 99.55%. We believe our method can be fruitfully adapted to other morphologically rich languages.