MDL-based models for transliteration generation

Authors:
Javad Nouri;Lidia Pivovarova;Roman Yangarber
Affiliations:
Department of Computer Science, University of Helsinki, Finland;Department of Computer Science, University of Helsinki, Finland,St.Petersburg State University, Russia;Department of Computer Science, University of Helsinki, Finland
Venue:
SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing
Year:
2013

Citing 11
Cited 0

Multilingual modeling of cross-lingual spelling variants

Information Retrieval
A joint source-channel model for machine transliteration

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
A modified joint source-channel model for transliteration

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Discriminative methods for transliteration

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Transliteration as constrained optimization

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Transliteration alignment

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Combining MDL transliteration training with discriminative modeling

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Machine transliteration survey

ACM Computing Surveys (CSUR)
Using context and phonetic features in models of etymological sound change

EACL 2012 Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH
Name phylogeny: a generative model of string variation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Whitepaper of NEWS 2012 shared task on machine transliteration

NEWS '12 Proceedings of the 4th Named Entity Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents models for automatic transliteration of proper names between languages that use different alphabets. The models are an extension of our work on automatic discovery of patterns of etymological sound change, based on the Minimum Description Length Principle. The models for pairwise alignment are extended with algorithms for prediction that produce transliterated names. We present results on 13 parallel corpora for 7 languages, including English, Russian, and Farsi, extracted from Wikipedia headlines. The transliteration corpora are released for public use. The models achieve up to 88% on word-level accuracy and up to 99% on symbol-level F-score. We discuss the results from several perspectives, and analyze how corpus size, the language pair, the type of names (persons, locations), and noise in the data affect the performance.