Stemming Indonesian

Authors:
Jelita Asian;Hugh E. Williams;S. M. M. Tahaghoghi
Affiliations:
RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia
Venue:
ACSC '05 Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38
Year:
2005

Citing 3
Cited 1

Stemming algorithms

Information retrieval
Experiments with a stemming algorithm for Malay words

Journal of the American Society for Information Science
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)

Stemming Indonesian: A confix-stripping approach

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stemming words to (usually) remove suffixes has applications in text search, machine translation, document summarisation, and text classification. For example, English stemming reduces the words "computer", "computing", "computation", and "computability" to their common morphological root, "comput-". In text search, this permits a search for "computers" to find documents containing all words with the stem "comput-". In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make matching related words difficult. In this paper, we investigate the performance of five Indonesian stemming algorithms through a user study. Our results show that, with the availability of a reasonable dictionary, the unpublished algorithm of Nazief and Adriani correctly stems around 93% of word occurrences to the correct root word. With the improvements we propose, this almost reaches 95%. We conclude that stemming for Indonesian should be performed using our modified Nazief and Adriani approach.