The automatic identification of stop words
Journal of Information Science
Noise reduction in a statistical approach to text categorization
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval
The Google Similarity Distance
IEEE Transactions on Knowledge and Data Engineering
Information distance from a question to an answer
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluating the Impact of Information Distortion on Normalized Compression Distance
ICMCTA '08 Proceedings of the 2nd international Castle meeting on Coding Theory and Applications
Detecting Word Substitutions in Text
IEEE Transactions on Knowledge and Data Engineering
Reducing the Loss of Information through Annealing Text Distortion
IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Information Theory
IEEE Transactions on Information Theory
The Information Lost in Erasures
IEEE Transactions on Information Theory
Is the contextual information relevant in text clustering by compression?
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
In this paper we take a step towards understanding compression distances by analyzing the relevance of contextual information in compression-based text clustering. In order to do so, two kinds of word removal are explored, one that maintains part of the contextual information despite the removal, and one that does not maintain it. We show how removing words in such a way that the contextual information is maintained despite the word removal helps the compression-based text clustering and improves its accuracy, while on the contrary, removing words losing that contextual information makes the clustering results worse.