Using statistical testing in the evaluation of retrieval experiments
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus-based stemming using cooccurrence of word variants
ACM Transactions on Information Systems (TOIS)
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Real life, real users, and real needs: a study and analysis of user queries on the web
Information Processing and Management: an International Journal
Information Processing and Management: an International Journal
Information Retrieval: Algorithms and Heuristics
Information Retrieval: Algorithms and Heuristics
Modern Information Retrieval
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Monolingual Document Retrieval for European Languages
Information Retrieval
How Effective is Stemming and Decompounding for German Text Retrieval?
Information Retrieval
Finite state morphology and information retrieval
Natural Language Engineering
Combining the language model and inference network approaches to retrieval
Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
A probabilistic model for stemmer generation
Information Processing and Management: an International Journal - Special issue: An Asian digital libraries perspective
An analysis of web searching by European AlltheWeb.com users
Information Processing and Management: an International Journal
How do search engines respond to some non-English queries?
Journal of Information Science
Word normalization and decompounding in mono- and bilingual IR
Information Retrieval
Indexing strategies for Swedish full text retrieval under different user scenarios
Information Processing and Management: an International Journal
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Analysis of EU languages through text compression
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Is a morphologically complex language really that complex in full-text retrieval?
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Searching a russian document collection using english, chinese and japanese queries
CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Finnish, portuguese and russian retrieval with hummingbird SearchServerTM at CLEF 2004
CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Management of keyword variation with frequency based generation of word forms in IR
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic Generation of Frequent Case Forms of Query Keywords in Text Retrieval
GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Does dictionary based bilingual retrieval work in a non-normalized index?
Information Processing and Management: an International Journal
IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval
Hi-index | 0.00 |
Word form normalization through lemmatization or stemming is a standard procedure in information retrieval because morphological variation needs to be accounted for and several languages are morphologically non-trivial. Lemmatization is effective but often requires expensive resources. Stemming is also effective in most contexts, generally almost as good as lemmatization and typically much less expensive; besides it also has a query expansion effect. However, in both approaches the idea is to turn many inflectional word forms to a single lemma or stem both in the database index and in queries. This means extra effort in creating database indexes. In this paper we take an opposite approach: we leave the database index un-normalized and enrich the queries to cover for surface form variation of keywords. A potential penalty of the approach would be long queries and slow processing. However, we show that it only matters to cover a negligible number of possible surface forms even in morphologically complex languages to arrive at a performance that is almost as good as that delivered by stemming or lemmatization. Moreover, we show that, at least for typical test collections, it only matters to cover nouns and adjectives in queries. Furthermore, we show that our findings are particularly good for short queries that resemble normal searches of web users. Our approach is called FCG (for Frequent Case (form) Generation). It can be relatively easily implemented for Latin/Greek/Cyrillic alphabet languages by examining their (typically very skewed) nominal form statistics in a small text sample and by creating surface form generators for the 3---9 most frequent forms. We demonstrate the potential of our FCG approach for several languages of varying morphological complexity: Swedish, German, Russian, and Finnish in well-known test collections. Applications include in particular Web IR in languages poor in morphological resources.