Reducing the Loss of Information through Annealing Text Distortion

Authors:
Ana Granados;Manuel Cebrian;David Camacho;Francisco de Borja Rodriguez
Affiliations:
Universidad Autonoma de Madrid, Madrid;Massachusetts Institute of Technology, Cambridge;Universidad Autonoma de Madrid, Madrid;Universidad Autonoma de Madrid, Madrid
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2011

Citing 0
Cited 4

Using virtual worlds for behaviour clustering-based analysis

Proceedings of the 2010 ACM workshop on Surreal media and virtual cloning
Relevance of contextual information in compression-based text clustering

IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
Is the contextual information relevant in text clustering by compression?

Expert Systems with Applications: An International Journal
Analysis and study on text representation to improve the accuracy of the normalized compression distance

AI Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper, we take a step toward understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the nondistorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-Ziv, Statistical and Block-Sorting.