Web-based acquisition of Japanese katakana variants

  • Authors:
  • Takeshi Masuyama;Hiroshi Nakagawa

  • Affiliations:
  • University of Tokyo, Tokyo, Japan;University of Tokyo, Tokyo, Japan

  • Venue:
  • Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper describes a method of detecting Japanese Katakana variants from a large corpus. Katakana words, which are mainly used as loanwords, cause problems with information retrieval and so on, because transliteration creates several variations in spelling and all of these can be orthographic. Previous works manually defined Katakana rewrite rules such as %Y (be) and %t%' (ve) being replaceable with each other, for generating variants and also defined the weight of each operation to edit one string into another to detect these variants. However, these previous researches have not been able to keep up with the ever-increasing number of loanwords and their variants. With our method proposed in this paper, the weight of each edit operation is mechanically assigned based on Web data. In experiments, it performed almost as well as one with manually determined weights. Thus, the advantages of our method are: 1) need no expertise in linguistics to determine weight of each operation, and 2) able to keep up with new Katakana loanwords only by collecting text data from Web and acquiring new weights of edit operations automatically. It also achieved 98.6% recall and 86.3% precision in the task of extracting Katakana variant pairs from 38 year's worth of corpora of Japanese newspaper articles.