Text segmentation by language using minimum description length

Authors:
Hiroshi Yamaguchi;Kumiko Tanaka-Ishii
Affiliations:
University of Tokyo;Kyushu University
Venue:
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Year:
2012

Citing 7
Cited 0

Text compression

Text compression
On the entropy of DNA: algorithms and measurements based on memory and rapid convergence

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Identifying, the coding system and language, of on-line documents on the Internet

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
An unsupervised system for identifying English inclusions in German text

ACLstudent '05 Proceedings of the ACL Student Research Workshop
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia, covering more than 200 languages.