An introduction to Kolmogorov complexity and its applications (2nd ed.)
An introduction to Kolmogorov complexity and its applications (2nd ed.)
IEEE Transactions on Information Theory
The Normalized Compression Distance Is Resistant to Noise
IEEE Transactions on Information Theory
Relevance of contextual information in compression-based text clustering
IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
Is the contextual information relevant in text clustering by compression?
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
In this paper we apply different techniques of information distortion on a set of classical books written in English. We study the impact that these distortions have upon the Kolmogorov complexity and the clustering by compression technique (the latter based on Normalized Compression Distance, NCD). We show how to decrease the complexity of the considered books introducing several modifications in them. We measure how the information contained in each book is maintained using a clustering error measure. We find experimentally that the best way to keep the clustering error is by means of modifications in the most frequent words. We explain the details of these information distortions and we compare with other kinds of modifications like random word distortions and unfrequent word distortions. Finally, some phenomenological explanations from the different empirical results that have been carried out are presented.