Multilingual modeling of cross-lingual spelling variants
Information Retrieval
A joint source-channel model for machine transliteration
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
A modified joint source-channel model for transliteration
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Discriminative methods for transliteration
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Transliteration as constrained optimization
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Combining MDL transliteration training with discriminative modeling
NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Machine transliteration survey
ACM Computing Surveys (CSUR)
Using context and phonetic features in models of etymological sound change
EACL 2012 Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH
Name phylogeny: a generative model of string variation
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Whitepaper of NEWS 2012 shared task on machine transliteration
NEWS '12 Proceedings of the 4th Named Entity Workshop
Hi-index | 0.00 |
This paper presents models for automatic transliteration of proper names between languages that use different alphabets. The models are an extension of our work on automatic discovery of patterns of etymological sound change, based on the Minimum Description Length Principle. The models for pairwise alignment are extended with algorithms for prediction that produce transliterated names. We present results on 13 parallel corpora for 7 languages, including English, Russian, and Farsi, extracted from Wikipedia headlines. The transliteration corpora are released for public use. The models achieve up to 88% on word-level accuracy and up to 99% on symbol-level F-score. We discuss the results from several perspectives, and analyze how corpus size, the language pair, the type of names (persons, locations), and noise in the data affect the performance.