Don't have a stemmer?: be un+concern+ed

  • Authors:
  • Paul McNamee;Charles Nicholas;James Mayfield

  • Affiliations:
  • Johns Hopkins University, Baltimore, MD, USA;University of Maryland Baltimore County, Baltimore, MD, USA;Johns Hopkins University, Baltimore, MD, USA

  • Venue:
  • Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The choice of indexing terms used to represent documents crucially determines how e ective subsequent retrieval will be. IR systems commonly use rule-based stemmers to normalize surface word forms to combat the problem of not finding documents that contain words related to query terms by inflectional or derivational morphology. But such stemmers are not available in all languages. In this paper we explore the effectiveness of unsupervised morphological segmentation as an alternative to stemming using test sets in thirteen European languages. We find that unsupervised segmentation is significantly better than unnormalized words, in several cases by more than 20%. However, rule-based stemming, if available, is better in low complexity languages. We also compare these methods to the use of character n-grams, finding that on average n-grams yield the best performance.