Web-based acquisition of Japanese katakana variants

Authors:
Takeshi Masuyama;Hiroshi Nakagawa
Affiliations:
University of Tokyo, Tokyo, Japan;University of Tokyo, Tokyo, Japan
Venue:
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2005

Citing 5
Cited 1

Approximate String Matching

ACM Computing Surveys (CSUR)
Translation of web queries using anchor text mining

ACM Transactions on Asian Language Information Processing (TALIP)
Weakly supervised learning for cross-document person name disambiguation supported by information extraction

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Detecting transliterated orthographic variants via two similarity metrics

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Automatic construction of Japanese KATAKANA variant list from large corpus

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

Machine transliteration survey

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper describes a method of detecting Japanese Katakana variants from a large corpus. Katakana words, which are mainly used as loanwords, cause problems with information retrieval and so on, because transliteration creates several variations in spelling and all of these can be orthographic. Previous works manually defined Katakana rewrite rules such as %Y (be) and %t%' (ve) being replaceable with each other, for generating variants and also defined the weight of each operation to edit one string into another to detect these variants. However, these previous researches have not been able to keep up with the ever-increasing number of loanwords and their variants. With our method proposed in this paper, the weight of each edit operation is mechanically assigned based on Web data. In experiments, it performed almost as well as one with manually determined weights. Thus, the advantages of our method are: 1) need no expertise in linguistics to determine weight of each operation, and 2) able to keep up with new Katakana loanwords only by collecting text data from Web and acquiring new weights of edit operations automatically. It also achieved 98.6% recall and 86.3% precision in the task of extracting Katakana variant pairs from 38 year's worth of corpora of Japanese newspaper articles.