Levenshtein distances fail to identify language relationships accurately

Authors:
Simon J. Greenhill
Affiliations:
The University of Auckland
Venue:
Computational Linguistics
Year:
2011

Citing 7
Cited 0

Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Computational dialectology in Irish Gaelic

EACL '95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics
Improved reconstruction of protolanguage word forms

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Bayesian identification of cognates and correspondences

SigMorPhon '07 Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology
Evaluation of several phonetic similarity algorithms on the task of cognate identification

LD '06 Proceedings of the Workshop on Linguistic Distances
Evaluation of string distance algorithms for dialectology

LD '06 Proceedings of the Workshop on Linguistic Distances
Evaluating the pairwise string alignment of pronunciations

LaTeCH-SHELT&R '09 Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Levenshtein distance is a simple distance metric derived from the number of edit operations needed to transform one string into another. This metric has received recent attention as a means of automatically classifying languages into genealogical subgroups. In this article I test the performance of the Levenshtein distance for classifying languages by subsampling three language subsets from a large database of Austronesian languages. Comparing the classification proposed by the Levenshtein distance to that of the comparative method shows that the Levenshtein classification is correct only 40% of time. Standardizing the orthography increases the performance, but only to a maximum of 65% accuracy within language subgroups. The accuracy of the Levenshtein classification decreases rapidly with phylogenetic distance, failing to discriminate homology and chance similarity across distantly related languages.This poor performance suggests the need for more linguistically nuanced methods for automated language classification tasks.