SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Diachronic stemmed corpus and dictionary of Galician language
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Hi-index | 0.00 |
The paper describes a word stemming algorithm for the Spanish language. Experiments in document retrieval regarding English text suggest that word stemming based on morphological analysis does not generally or consistently outperform ad-hoc hand tuned algorithms such as that proposed by M. Porter (1980). It is difficult to produce a Porter style algorithm for a romantic language such as Spanish, however due to the greater grammatical complexity and due to the fact that inflection often causes changes to the root of words, not just to their endings (as is mostly the case with English). In general terms, the difficulty consists of producing an algorithm which can cope with the additional complexity of Spanish morphology whilst preserving the simplicity of a Porter style algorithm. One such algorithm is presented. The algorithm combines dictionary look-ups with some 300 stemming and intermediate reduction rules.