MDL-based models for transliteration generation

  • Authors:
  • Javad Nouri;Lidia Pivovarova;Roman Yangarber

  • Affiliations:
  • Department of Computer Science, University of Helsinki, Finland;Department of Computer Science, University of Helsinki, Finland,St.Petersburg State University, Russia;Department of Computer Science, University of Helsinki, Finland

  • Venue:
  • SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents models for automatic transliteration of proper names between languages that use different alphabets. The models are an extension of our work on automatic discovery of patterns of etymological sound change, based on the Minimum Description Length Principle. The models for pairwise alignment are extended with algorithms for prediction that produce transliterated names. We present results on 13 parallel corpora for 7 languages, including English, Russian, and Farsi, extracted from Wikipedia headlines. The transliteration corpora are released for public use. The models achieve up to 88% on word-level accuracy and up to 99% on symbol-level F-score. We discuss the results from several perspectives, and analyze how corpus size, the language pair, the type of names (persons, locations), and noise in the data affect the performance.