Automatic construction of Japanese KATAKANA variant list from large corpus

Authors:
Takeshi Masuyama;Satoshi Sekine;Hiroshi Nakagawa
Affiliations:
University of Tokyo, Hongo, Bunkyo, Tokyo, Japan;New York University, Broadway, New York, NY;University of Tokyo, Hongo, Bunkyo, Tokyo, Japan
Venue:
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Year:
2004

Citing 1
Cited 6

Approximate String Matching

ACM Computing Surveys (CSUR)

Web-based acquisition of Japanese katakana variants

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Experiments in CLIR using fuzzy string search based on surface similarity

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Japanese query alteration based on semantic similarity

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Unsupervised Text Normalization Approach for Morphological Analysis of Blog Documents

AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
Discovery of term variation in Japanese web search queries

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Automatic acquisition of basic katakana lexicon from a given corpus

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a method to construct Japanese KATAKANA variant list from large corpus. Our method is useful for information retrieval, information extraction, question answering, and so on, because KATAKANA words tend to be used as "loan words" and the transliteration causes several variations of spelling. Our method consists of three steps. At step 1, our system collects KATAKANA words from large corpus. At step 2, our system collects candidate pairs of KATAKANA variants from the collected KATAKANA words using a spelling similarity which is based on the edit distance. At step 3, our system selects variant pairs from the candidate pairs using a semantic similarity which is calculated by a vector space model of a context of each KATAKANA word. We conducted experiments using 38 years of Japanese newspaper articles and constructed Japanese KATAKANA variant list with the performance of 97.4% recall and 89.1% precision. Estimating from this precision, our system can extract 178,569 variant pairs from the corpus.