Viewing morphology as an inference process
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Stemming algorithms: a case study for detailed evaluation
Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Character N-Gram Tokenization for European Language Text Retrieval
Information Retrieval
Unsupervised discovery of morphemes
MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
JHU ad hoc experiments at CLEF 2008
CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Query-based text normalization selection models for enhanced retrieval accuracy
SS '10 Proceedings of the NAACL HLT 2010 Workshop on Semantic Search
EMMA: a novel Evaluation Metric for Morphological Analysis
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Hi-index | 0.00 |
The choice of indexing terms used to represent documents crucially determines how e ective subsequent retrieval will be. IR systems commonly use rule-based stemmers to normalize surface word forms to combat the problem of not finding documents that contain words related to query terms by inflectional or derivational morphology. But such stemmers are not available in all languages. In this paper we explore the effectiveness of unsupervised morphological segmentation as an alternative to stemming using test sets in thirteen European languages. We find that unsupervised segmentation is significantly better than unnormalized words, in several cases by more than 20%. However, rule-based stemming, if available, is better in low complexity languages. We also compare these methods to the use of character n-grams, finding that on average n-grams yield the best performance.