How Effective is Stemming and Decompounding for German Text Retrieval?
Information Retrieval
A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for German
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Stemming and lemmatization in the clustering of finnish text documents
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Word normalization and decompounding in mono- and bilingual IR
Information Retrieval
Algorithms for the verification of the semantic relation between a compound and a given lexeme
Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Hi-index | 0.00 |
We present a new mixed method lemmatizer for Icelandic, Lemmald, which achieves good performance by relying on IceTagger [1] for tagging and The Icelandic Frequency Dictionary [2] corpus for training. We combine the advantages of data-driven machine learning with linguistic insights to maximize performance. To achieve this, we make use of a novel approach: Hierarchy of Linguistic Identities (HOLI), which involves organizing features and feature structures for the machine learning based on linguistic knowledge. Accuracy of the lemmatization is further improved using an add-on which connects to the Database of Modern Icelandic Inflections [3]. Given correct tagging, our system lemmatizes Icelandic text with an accuracy of 99.55%. We believe our method can be fruitfully adapted to other morphologically rich languages.