Analysis of EU languages through text compression

Authors:
Kimmo Kettunen;Markus Sadeniemi;Tiina Lindh-Knuutila;Timo Honkela
Affiliations:
Department of Information Studies, University of Tampere, Finland;Laboratory of Computer and Information Science, Helsinki University of Technology, Finland;Laboratory of Computer and Information Science, Helsinki University of Technology, Finland;Laboratory of Computer and Information Science, Helsinki University of Technology, Finland
Venue:
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Year:
2006

Citing 4
Cited 4

An introduction to Kolmogorov complexity and its applications

An introduction to Kolmogorov complexity and its applications
Information distance

IEEE Transactions on Information Theory
The similarity metric

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory

Restricted inflectional form generation in management of morphological keyword variation

Information Retrieval
Addressing morphological variation in alphabetic languages

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Clues to compare languages for morphosyntactic analysis: a study run on parallel corpora and morphosyntactic lexicons

LTC'09 Proceedings of the 4th conference on Human language technology: challenges for computer science and linguistics
Detection of naming convention violations in process models for different languages

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article, we are studying the differences between the European languages using statistical and unsupervised methods. The analysis is conducted in different levels of language, lexical, morphological and syntactic. Our premise is that the difficulty of the translation could be perceived as differences or similarities in different levels of language. The results are compared to linguistic groupings. The analyses of this paper are based on the concept of Kolmogorov complexity, which is used to compare the language structure in syntactic and morphological levels. The way the languages convey information in these levels is taken as a measure of similarity or dissimilarity between languages and the results are compared to classical linguistic classification. The results will serve as a tool in developing machine translation system(s), e.g., in the following way: if source language conveys more information in the morphological level and the target language more in the syntactic level, it is clear that the (machine) translator must be able to transfer the information from one level to another.