A statistical model for lost language decipherment

Authors:
Benjamin Snyder;Regina Barzilay;Kevin Knight
Affiliations:
Massachusetts Institute of Technology;Massachusetts Institute of Technology;University of Southern California
Venue:
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Year:
2010

Citing 9
Cited 9

The reconstruction engine: a computer implementation of the comparative method

Computational Linguistics - Special issue on computational phonology
Automatic identification of word translations from unrelated English and German corpora

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
Identifying cognates by phonetic and semantic similarity

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Learning a translation lexicon from monolingual corpora

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Unsupervised models for morpheme segmentation and morphology learning

ACM Transactions on Speech and Language Processing (TSLP)
Unsupervised analysis for decipherment problems

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Cross-lingual propagation for morphological analysis

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Writing systems, transliteration and decipherment

NAACL-Tutorials '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts

Entropy, the indus script, and language: A reply to r. sproat

Computational Linguistics
Deciphering foreign language

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Bayesian inference for Zodiac and other homophonic ciphers

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Unsupervised multilingual learning

Unsupervised multilingual learning
Simple effective decipherment via combinatorial optimization

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Visualization of linguistic patterns and uncovering language history from multilingual resources

EACL 2012 Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH
Deciphering foreign language by combining language models and context vectors

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Name phylogeny: a generative model of string variation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Software helps linguists reconstruct, decipher ancient languages

Communications of the ACM

Quantified Score

Hi-index	0.02

Visualization

Abstract

In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and high-level morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew.