Is a morphologically complex language really that complex in full-text retrieval?

Authors:
Kimmo Kettunen;Eija Airio
Affiliations:
Department of Information Studies, University of Tampere, Finland;Department of Information Studies, University of Tampere, Finland
Venue:
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Year:
2006

Citing 11
Cited 9

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Real life, real users, and real needs: a study and analysis of user queries on the web

Information Processing and Management: an International Journal
Modern Information Retrieval

Modern Information Retrieval
Single n-gram stemming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Monolingual Document Retrieval for European Languages

Information Retrieval
How Effective is Stemming and Decompounding for German Text Retrieval?

Information Retrieval
Using register-diversified corpora for general language studies

Computational Linguistics - Special issue on using large corpora: II
Finite state morphology and information retrieval

Natural Language Engineering
Inquery system overview

TIPSTER '93 Proceedings of a workshop on held at Fredericksburg, Virginia: September 19-23, 1993
Word normalization and decompounding in mono- and bilingual IR

Information Retrieval
Developing an automatic linguistic truncation operator for best-match retrieval of Finnish in inflected word form text database indexes

Journal of Information Science

Management of keyword variation with frequency based generation of word forms in IR

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Restricted inflectional form generation in management of morphological keyword variation

Information Retrieval
Searching strategies for the Hungarian language

Information Processing and Management: an International Journal
Automatic Generation of Frequent Case Forms of Query Keywords in Text Retrieval

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Does dictionary based bilingual retrieval work in a non-normalized index?

Information Processing and Management: an International Journal
Indexing and stemming approaches for the Czech language

Information Processing and Management: an International Journal
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

ACM Transactions on Asian Language Information Processing (TALIP)
A dictionary- and corpus-independent statistical lemmatizer for information retrieval in low resource languages

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Generating variant keyword forms for a morphologically complex language leads to successful information retrieval with finnish

IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we show that keyword variation of a morphologically complex language, Finnish, can be handled effectively for IR purposes by generating only the textually most frequent forms of the keyword. Theoretically Finnish nouns have about 2,000 different forms, but occurrences of most of the forms are rare. Corpus statistics showed that about 84 – 88 per cent of the occurrences of inflected noun forms are forms of only six cases out of the 14 possible. This number – maximally 2*6 – of keyword’s variant forms makes it feasible to try them all in a search. IR results of the frequent keyword form variation coverage were tested with three to twelve keyword variant forms in two test collections, TUTK and CLEF 2003’s Finnish material. The results show that the frequent keyword form generation method competes well with the gold standard, lemmatization, with nine and twelve variant keyword forms.