Stemming and decompounding for German text retrieval

Authors:
Martin Braschler;Bärbel Ripplinger
Affiliations:
Eurospider Information Technology AG, Zürich, Switzerland and Université de Neuchâtel, Institut Interfacultaire d'Informatique, Neuchâtel, Switzerland;Eurospider Information Technology AG, Zürich, Switzerland
Venue:
ECIR'03 Proceedings of the 25th European conference on IR research
Year:
2003

Citing 16
Cited 6

Stemming algorithms

Information retrieval
Using statistical testing in the evaluation of retrieval experiments

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Viewing stemming as recall enhancement

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Experiments in multilingual information retrieval using the SPIDER system

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The pragmatics of information retrieval experimentation, revisited

Readings in information retrieval
The TREC conferences

Readings in information retrieval
An algorithm for suffix stripping

Readings in information retrieval
A stemming procedure and stopword list for general French corpora

Journal of the American Society for Information Science
Experiments with the Eurospider Retrieval System for CLEF 2000

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
West Group at CLEF 2000: Non-english Monolingual Retrieval

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Stemming Evaluated in 6 Languages by Hummingbird SearchServerTM at CLEF 2001

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Cross-language information retrieval: experiments based on CLEF 2000 corpora

Information Processing and Management: an International Journal
Unsupervised learning of the morphology of a natural language

Computational Linguistics

A novel method for stemmer generation based on hidden markov models

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Monolingual Document Retrieval for European Languages

Information Retrieval
Embedding web-based statistical translation models in cross-language information retrieval

Computational Linguistics - Special issue on web as corpus
A probabilistic model for stemmer generation

Information Processing and Management: an International Journal - Special issue: An Asian digital libraries perspective
Unsupervised and knowledge-free learning of compound splits and periphrases

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The stemming problem, i.e. finding a common stem for different forms of a term, has been extensively studied for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. Rarely do studies on stemming for any language cover more than one or two different approaches. This paper makes a major contribution that transcends its focus on German by investigating a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection.