Comparison of different lemmatization approaches through the means of information retrieval performance

Authors:
Jakub Kanis;Lucie Skorkovská
Affiliations:
Univ. of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Pilsen, Czech Republic;Univ. of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Pilsen, Czech Republic
Venue:
TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Year:
2010

Citing 6
Cited 1

A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Tagging inflective languages: prediction of morphological categories for a rich, structured tagset

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
What Can and Cannot Be Found in Czech Spontaneous Speech Using Document-Oriented IR Methods -- UWB at CLEF 2007 CL-SR Track

Advances in Multilingual and Multimodal Information Retrieval
Information retrieval test collection for searching spontaneous Czech speech

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Automatic lemmatizer construction with focus on OOV words lemmatization

TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
Benefit of proper language processing for Czech speech retrieval in the CL-SR task at CLEF 2006

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval

Automatic topic identification for large scale language modeling data filtering

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a quantitative performance analysis of two different approaches to the lemmatization of the Czech text data. The first one is based on manually prepared dictionary of lemmas and set of derivation rules while the second one is based on automatic inference of the dictionary and the rules from training data. The comparison is done by evaluating the mean Generalized Average Precision (mGAP) measure of the lemmatized documents and search queries in the set of information retrieval (IR) experiments. Such method is suitable for efficient and rather reliable comparison of the lemmatization performance since a correct lemmatization has proven to be crucial for IR effectiveness in highly inflected languages. Moreover, the proposed indirect comparison of the lemmatizers circumvents the need for manually lemmatized test data which are hard to obtain and also face the problem of incompatible sets of lemmas across different systems.