Automatic Stemming for Indexing of an Agglutinative Language

  • Authors:
  • Sehyeong Cho;Seung-Soo Han

  • Affiliations:
  • -;-

  • Venue:
  • ADVIS '02 Proceedings of the Second International Conference on Advances in Information Systems
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Stemming is an essential process in information retrieval. Though there are extremely simple stemming algorithms for inflectional languages, the story goes totally different for agglutinative languages. It is even more difficult if significant portion of the vocabulary is new or unknown. This paper explores the possibility of stemming of an agglutinative language, in particular, Korean language, by unsupervised morphology learning. We use only raw corpus and make use of no dictionary. Unlike heuristic algorithms that are theoretically ungrounded, this method is based on statistical methods, which are widely accepted. Although the method is currently applied only to Korean language, the method can be adapted to other agglutinative languages with similar characteristics, since language-specific knowledge is not used.