A dictionary- and corpus-independent statistical lemmatizer for information retrieval in low resource languages

  • Authors:
  • Aki Loponen;Kalervo Järvelin

  • Affiliations:
  • Department of Information Studies and Interactive Media, University of Tampere, Finland;Department of Information Studies and Interactive Media, University of Tampere, Finland

  • Venue:
  • CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a dictionary- and corpus-independent statistical lemmatizer StaLe that deals with the out-of-vocabulary (OOV) problem of dictionary-based lemmatization by generating candidate lemmas for any inflected word forms. StaLe can be applied with little effort to languages lacking linguistic resources. We show the performance of StaLe both in lemmatization tasks alone and as a component in an IR system using several datasets and query types in four high resource languages. StaLe is competitive, reaching 88-108 % of gold standard performance of a commercial lemmatizer in IR experiments. Despite competitive performance, it is compact, efficient and fast to apply to new languages.