Finding ideographic representations of Japanese names written in Latin script via language identification and corpus validation

Authors:
Yan Qu;Gregory Grefenstette
Affiliations:
Clairvoyance Corporation, Pittsburgh, PA;LIC2M/LIST/CEA, Fontenay-aux-Roses, France
Venue:
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Year:
2004

Citing 3
Cited 11

Automatic transliteration for Japanese-to-English text retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Machine transliteration

Computational Linguistics
Machine transliteration of names in Arabic text

SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages

Cluster-specific named entity transliteration

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Automatic Acronym Dictionary Construction Based on Acronym Generation Types

IEICE - Transactions on Information and Systems
Conquering Language: Using NLP on a Massive Scale to Build High Dimensional Language Models from the Web

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
A comparison of different machine transliteration models

Journal of Artificial Intelligence Research
Transliteration alignment

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Identification of transliterated foreign words in Hebrew script

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Improving name origin recognition with context features and unlabelled data

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Improving machine transliteration performance by using multiple transliteration models

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Extracting english-korean transliteration pairs from web corpora

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
The use of monolingual context vectors for missing translations in cross-language information retrieval

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Learning to find translations and transliterations on the web

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multilingual applications frequently involve dealing with proper names, but names are often missing in bilingual lexicons. This problem is exacerbated for applications involving translation between Latin-scripted languages and Asian languages such as Chinese, Japanese and Korean (CJK) where simple string copying is not a solution. We present a novel approach for generating the ideographic representations of a CJK name written in a Latin script. The proposed approach involves first identifying the origin of the name, and then back-transliterating the name to all possible Chinese characters using language-specific mappings. To reduce the massive number of possibilities for computation, we apply a three-tier filtering process by filtering first through a set of attested bigrams, then through a set of attested terms, and lastly through the WWW for a final validation. We illustrate the approach with English-to-Japanese back-transliteration. Against test sets of Japanese given names and surnames, we have achieved average precisions of 73% and 90%, respectively.