Investigating Esperanto's statistical proportions relative to other languages using neural networks and Zipf's law

Authors:
Bill Manaris;Luca Pellicoro;George Pothering;Harland Hodges
Affiliations:
Computer Science Department, College of Charleston, Charleston, SC;Computer Science Department, College of Charleston, Charleston, SC;Computer Science Department, College of Charleston, Charleston, SC;Management and Marketing Department, College of Charleston, Charleston, SC
Venue:
AIA'06 Proceedings of the 24th IASTED international conference on Artificial intelligence and applications
Year:
2006

Citing 2
Cited 0

Zipf's Law, Music Classification, and Aesthetics

Computer Music Journal
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Esperanto is a constructed natural language, which was intended to be an easy-to-learn lingua franca. Zipf's law models the statistical proportions of various phenomena in human ecology, including natural languages. Given Esperanto's artificial origins, one wonders how "natural" it appears, relative to other natural languages, in the context of Zipf's law. To explore this question, we collected a total of 283 books from six languages: English, French, German, Italian, Spanish, and Esperanto. We applied Zipf-based metrics on our corpus to extract distributions for word, word distance, word bigram, word trigram, and word length for each book. Statistical analyses show that Esperanto's statistical proportions are similar to those of other languages. We then trained artificial neural networks (ANNs) to classify books according to language. The ANNs achieved high accuracy rates (86.3% to 98.6%). Subsequent analysis identified German as having the most unique proportions, followed by Esperanto, Italian, Spanish, English, and French. Analysis of misclassified patterns shows that Esperanto's statistical proportions resemble mostly those of German and Spanish, and least those of French and Italian.