Stemming Indonesian: A confix-stripping approach

Authors:
Mirna Adriani;Jelita Asian;Bobby Nazief;S. M.M. Tahaghoghi;Hugh E. Williams
Affiliations:
University of Indonesia;RMIT University;University of Indonesia;RMIT University;Microsoft
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2007

Citing 21
Cited 7

Stemming algorithms

Information retrieval
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Method for evaluation of stemming algorithms based on error counting

Journal of the American Society for Information Science
Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Phonetic string matching: lessons from information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Experiments with a stemming algorithm for Malay words

Journal of the American Society for Information Science
Phrasal translation and query expansion techniques for cross-language information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A stemming procedure and stopword list for general French corpora

Journal of the American Society for Information Science
An evaluation of retrieval effectiveness using spelling-correction and string-similarity matching methods on malay texts

Journal of the American Society for Information Science
Approximate String Matching

ACM Computing Surveys (CSUR)
Experiments in spoken document retrieval using phoneme n-grams

Speech Communication - Special issue on accessing information in spoken audio
A probabilistic model of information retrieval: development and comparative experiments

Information Processing and Management: an International Journal
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Monolingual Document Retrieval for European Languages

Information Retrieval
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Stemming Indonesian

ACSC '05 Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38
Language independent NER using a maximum entropy tagger

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4

Applying Link Grammar Formalism in the Development of English-Indonesian Machine Translation System

Proceedings of the 9th AISC international conference, the 15th Calculemas symposium, and the 7th international MKM conference on Intelligent Computer Mathematics
Mixed monolingual homepage finding in 34 languages: the role of language script and search domain

Information Retrieval
Web and corpus methods for Malay count classifier prediction

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Translating from morphologically complex languages: a paraphrase-based approach

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Investigating the effectiveness of thesaurus generated using tolerance rough set model

ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
A malay stemmer for jawi characters

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
Lexicon-based Document Representation

Fundamenta Informaticae - Cognitive Informatics and Computational Intelligence: Theory and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stemming words to (usually) remove suffixes has applications in text search, machine translation, document summarization, and text classification. For example, English stemming reduces the words "computer," "computing," "computation," and "computability" to their common morphological root, "comput-." In text search, this permits a search for "computers" to find documents containing all words with the stem "comput-." In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make matching related words difficult. This work surveys existing techniques for stemming Indonesian words to their morphological roots, presents our novel and highly accurate CS algorithm, and explores the effectiveness of stemming in the context of general-purpose text information retrieval through ad hoc queries.