Cross-entropy and linguistic typology

Authors:
Patrick Juola
Affiliations:
University of Oxford, Oxford, UK
Venue:
NeMLaP3/CoNLL '98 Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
Year:
1998

Citing 3
Cited 1

An estimate of an upper bound for the entropy of English

Computational Linguistics
On the entropy of DNA: algorithms and measurements based on memory and rapid convergence

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition

Authorship attribution

Foundations and Trends in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The idea of "familial relationships" among languages is well-established and accepted, although some controversies persist in a few specific instances. By painstakingly recording and identifying regularities and similarities and comparing these to the historical record, linguists have been able to produce a general "family tree" incorporating most natural languages. We suggest here that much of these trees can be automatically determined by a complementary technique of distributional analysis. Recent work by (Farach et al., 1995) and (Juola, 1997) suggests that Kullback-Leibler divergence (or cross-entropy) can be meaningfully measured from small samples, in some cases as small as only 20 or so words. Using these techniques, we define and measure a distance function between translations of a small corpus (c. 70 words/sample) covering much of the accepted Indo-European family, and reconstruct a relationship tree by hierarchical cluster analysis. The resulting tree shows remarkable similarity to the accepted Indo-European family; this we read as evidence both for the immense power of this measurement technique and for the validity of this kind of mechanical similarity judgement in the identification of typological relationships. Furthermore, this technique is in theory sensitive to different sorts of relationships than more common word-list based methods and may help illuminate these from a different direction.