Mining the Web for Transliteration Lexicons: Joint-Validation Approach

Authors:
Jong-Hoon Oh;Hitoshi Isahara
Affiliations:
National Institute of Information and Communications Technology (NICT), Japan;National Institute of Information and Communications Technology (NICT), Japan
Venue:
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Year:
2006

Citing 0
Cited 6

Learning phoneme mappings for transliteration without parallel data

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Transliteration mining with phonetic conflation and iterative training

NEWS '10 Proceedings of the 2010 Named Entities Workshop
Machine transliteration survey

ACM Computing Surveys (CSUR)
Improved transliteration mining using graph reinforcement

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Transliteration mining using large training and test sets

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
A Bayesian Alignment Approach to Transliteration Mining

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web provides the largest data collection, which reflects language use in daily life. With the advent of new technology and the flood of information on the Web, it has become quite common to create new terms supporting new concepts and translate these terms into non-Latin languages with "transliteration" referring to "translation by sound". Cross-language natural language processing applications, such as machine translation and cross-language information retrieval, usually need a translation dictionary, which affects the quality of the applications. However, the transliteration lexicons are usually unregistered in the translation dictionary. To address the problem, here, we present a transliteration lexicon acquisition model that mines the Web for transliteration lexicons. In this paper, we describe techniques of comparing phonetic-similarity to recognize transliteration pair candidates on the Web and of finding the correct transliteration pairs based on joint-validation. The techniques were evaluated against manually constructed transliteration lexicons. Our experiments revealed that the techniques effectively found transliteration lexicons on the Web.