ACM Computing Surveys (CSUR)
Web-based acquisition of Japanese katakana variants
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Experiments in CLIR using fuzzy string search based on surface similarity
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Japanese query alteration based on semantic similarity
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Unsupervised Text Normalization Approach for Morphological Analysis of Blog Documents
AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
Discovery of term variation in Japanese web search queries
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Automatic acquisition of basic katakana lexicon from a given corpus
IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Hi-index | 0.00 |
This paper presents a method to construct Japanese KATAKANA variant list from large corpus. Our method is useful for information retrieval, information extraction, question answering, and so on, because KATAKANA words tend to be used as "loan words" and the transliteration causes several variations of spelling. Our method consists of three steps. At step 1, our system collects KATAKANA words from large corpus. At step 2, our system collects candidate pairs of KATAKANA variants from the collected KATAKANA words using a spelling similarity which is based on the edit distance. At step 3, our system selects variant pairs from the candidate pairs using a semantic similarity which is calculated by a vector space model of a context of each KATAKANA word. We conducted experiments using 38 years of Japanese newspaper articles and constructed Japanese KATAKANA variant list with the performance of 97.4% recall and 89.1% precision. Estimating from this precision, our system can extract 178,569 variant pairs from the corpus.