ACM Computing Surveys (CSUR)
Translation of web queries using anchor text mining
ACM Transactions on Asian Language Information Processing (TALIP)
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Detecting transliterated orthographic variants via two similarity metrics
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Automatic construction of Japanese KATAKANA variant list from large corpus
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Machine transliteration survey
ACM Computing Surveys (CSUR)
Hi-index | 0.01 |
This paper describes a method of detecting Japanese Katakana variants from a large corpus. Katakana words, which are mainly used as loanwords, cause problems with information retrieval and so on, because transliteration creates several variations in spelling and all of these can be orthographic. Previous works manually defined Katakana rewrite rules such as %Y (be) and %t%' (ve) being replaceable with each other, for generating variants and also defined the weight of each operation to edit one string into another to detect these variants. However, these previous researches have not been able to keep up with the ever-increasing number of loanwords and their variants. With our method proposed in this paper, the weight of each edit operation is mechanically assigned based on Web data. In experiments, it performed almost as well as one with manually determined weights. Thus, the advantages of our method are: 1) need no expertise in linguistics to determine weight of each operation, and 2) able to keep up with new Katakana loanwords only by collecting text data from Web and acquiring new weights of edit operations automatically. It also achieved 98.6% recall and 86.3% precision in the task of extracting Katakana variant pairs from 38 year's worth of corpora of Japanese newspaper articles.