Indexing and stemming approaches for the Czech language

Authors:
Ljiljana Dolamic;Jacques Savoy
Affiliations:
Computer Science Department, University of Neuchatel, 2009 Neuchítel, Switzerland;Computer Science Department, University of Neuchatel, 2009 Neuchítel, Switzerland
Venue:
Information Processing and Management: an International Journal
Year:
2009

Citing 11
Cited 6

Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
Experimentation as a way of life: Okapi at TREC

Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
A stop list for general text

ACM SIGIR Forum
Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Stemming and lemmatization in the clustering of finnish text documents

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Light stemming approaches for the French, Portuguese, German and Hungarian languages

Proceedings of the 2006 ACM symposium on Applied computing
Is a morphologically complex language really that complex in full-text retrieval?

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing

Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

ACM Transactions on Asian Language Information Processing (TALIP)
Ad hoc retrieval with the Persian language

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
A novel corpus-based stemming algorithm using co-occurrence statistics

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
GRAS: An effective and efficient stemming algorithm for information retrieval

ACM Transactions on Information Systems (TOIS)
SWSNL: Semantic Web Search Using Natural Language

Expert Systems with Applications: An International Journal
Effective and Robust Query-Based Stemming

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (n-gram). To evaluate the suggested solutions we used various IR models, including Okapi, Divergence from Randomness (DFR), a statistical language model (LM) as well as the classical tf idf vector-space approach. We found that the Divergence from Randomness paradigm tend to propose better retrieval effectiveness than the Okapi, LM or tf idf models, the performance differences were however statistically significant only with the last two IR approaches. Ignoring the stemming reduces generally the MAP by more than 40%, and these differences are always significant. Finally, if our more aggressive stemmer tends to show the best performance, the differences in performance with a light stemmer are not statistically significant.