A dictionary- and corpus-independent statistical lemmatizer for information retrieval in low resource languages

Authors:
Aki Loponen;Kalervo Järvelin
Affiliations:
Department of Information Studies and Interactive Media, University of Tampere, Finland;Department of Information Studies and Interactive Media, University of Tampere, Finland
Venue:
CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Year:
2010

Citing 11
Cited 2

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Fuzzy translation of cross-lingual spelling variants

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Modeling and learning multilingual inflectional morphology in a minimally supervised framework

Modeling and learning multilingual inflectional morphology in a minimally supervised framework
Word normalization and decompounding in mono- and bilingual IR

Information Retrieval
Is 1 noun worth 2 adjectives?: measuring relative feature utility

Information Processing and Management: an International Journal
YASS: Yet another suffix stripper

ACM Transactions on Information Systems (TOIS)
Analysis of long queries in a large scale search log

Proceedings of the 2009 workshop on Web Search Click Data
A probabilistic model for guessing base forms of new words by analogy

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Is a morphologically complex language really that complex in full-text retrieval?

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing

Generating variant keyword forms for a morphologically complex language leads to successful information retrieval with finnish

IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval
Translation techniques in cross-language information retrieval

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a dictionary- and corpus-independent statistical lemmatizer StaLe that deals with the out-of-vocabulary (OOV) problem of dictionary-based lemmatization by generating candidate lemmas for any inflected word forms. StaLe can be applied with little effort to languages lacking linguistic resources. We show the performance of StaLe both in lemmatization tasks alone and as a component in an IR system using several datasets and query types in four high resource languages. StaLe is competitive, reaching 88-108 % of gold standard performance of a commercial lemmatizer in IR experiments. Despite competitive performance, it is compact, efficient and fast to apply to new languages.