Restricted inflectional form generation in management of morphological keyword variation

Authors:
Kimmo Kettunen;Eija Airio;Kalervo Järvelin
Affiliations:
Department of Information Studies, University of Tampere, Tampere, Finland 33014;Department of Information Studies, University of Tampere, Tampere, Finland 33014;Department of Information Studies, University of Tampere, Tampere, Finland 33014
Venue:
Information Retrieval
Year:
2007

Citing 24
Cited 4

Using statistical testing in the evaluation of retrieval experiments

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Real life, real users, and real needs: a study and analysis of user queries on the web

Information Processing and Management: an International Journal
Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval

Information Processing and Management: an International Journal
Information Retrieval: Algorithms and Heuristics

Information Retrieval: Algorithms and Heuristics
Modern Information Retrieval

Modern Information Retrieval
Single n-gram stemming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Monolingual Document Retrieval for European Languages

Information Retrieval
How Effective is Stemming and Decompounding for German Text Retrieval?

Information Retrieval
Finite state morphology and information retrieval

Natural Language Engineering
Combining the language model and inference network approaches to retrieval

Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
A probabilistic model for stemmer generation

Information Processing and Management: an International Journal - Special issue: An Asian digital libraries perspective
An analysis of web searching by European AlltheWeb.com users

Information Processing and Management: an International Journal
How do search engines respond to some non-English queries?

Journal of Information Science
Word normalization and decompounding in mono- and bilingual IR

Information Retrieval
Developing an automatic linguistic truncation operator for best-match retrieval of Finnish in inflected word form text database indexes

Journal of Information Science
Indexing strategies for Swedish full text retrieval under different user scenarios

Information Processing and Management: an International Journal
Approach to construction of automatic morphological analysis systems for inflective languages with little effort

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Analysis of EU languages through text compression

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Is a morphologically complex language really that complex in full-text retrieval?

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
What happened in CLEF 2005

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Searching a russian document collection using english, chinese and japanese queries

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Finnish, portuguese and russian retrieval with hummingbird SearchServerTM at CLEF 2004

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images

Management of keyword variation with frequency based generation of word forms in IR

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic Generation of Frequent Case Forms of Query Keywords in Text Retrieval

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Does dictionary based bilingual retrieval work in a non-normalized index?

Information Processing and Management: an International Journal
Generating variant keyword forms for a morphologically complex language leads to successful information retrieval with finnish

IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Word form normalization through lemmatization or stemming is a standard procedure in information retrieval because morphological variation needs to be accounted for and several languages are morphologically non-trivial. Lemmatization is effective but often requires expensive resources. Stemming is also effective in most contexts, generally almost as good as lemmatization and typically much less expensive; besides it also has a query expansion effect. However, in both approaches the idea is to turn many inflectional word forms to a single lemma or stem both in the database index and in queries. This means extra effort in creating database indexes. In this paper we take an opposite approach: we leave the database index un-normalized and enrich the queries to cover for surface form variation of keywords. A potential penalty of the approach would be long queries and slow processing. However, we show that it only matters to cover a negligible number of possible surface forms even in morphologically complex languages to arrive at a performance that is almost as good as that delivered by stemming or lemmatization. Moreover, we show that, at least for typical test collections, it only matters to cover nouns and adjectives in queries. Furthermore, we show that our findings are particularly good for short queries that resemble normal searches of web users. Our approach is called FCG (for Frequent Case (form) Generation). It can be relatively easily implemented for Latin/Greek/Cyrillic alphabet languages by examining their (typically very skewed) nominal form statistics in a small text sample and by creating surface form generators for the 3---9 most frequent forms. We demonstrate the potential of our FCG approach for several languages of varying morphological complexity: Swedish, German, Russian, and Finnish in well-known test collections. Applications include in particular Web IR in languages poor in morphological resources.