Don't have a stemmer?: be un+concern+ed

Authors:
Paul McNamee;Charles Nicholas;James Mayfield
Affiliations:
Johns Hopkins University, Baltimore, MD, USA;University of Maryland Baltimore County, Baltimore, MD, USA;Johns Hopkins University, Baltimore, MD, USA
Venue:
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2008

Citing 4
Cited 3

Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Unsupervised discovery of morphemes

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6

JHU ad hoc experiments at CLEF 2008

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Query-based text normalization selection models for enhanced retrieval accuracy

SS '10 Proceedings of the NAACL HLT 2010 Workshop on Semantic Search
EMMA: a novel Evaluation Metric for Morphological Analysis

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The choice of indexing terms used to represent documents crucially determines how e ective subsequent retrieval will be. IR systems commonly use rule-based stemmers to normalize surface word forms to combat the problem of not finding documents that contain words related to query terms by inflectional or derivational morphology. But such stemmers are not available in all languages. In this paper we explore the effectiveness of unsupervised morphological segmentation as an alternative to stemming using test sets in thirteen European languages. We find that unsupervised segmentation is significantly better than unnormalized words, in several cases by more than 20%. However, rule-based stemming, if available, is better in low complexity languages. We also compare these methods to the use of character n-grams, finding that on average n-grams yield the best performance.