A new method for clustering heterogeneous data: clustering by compression

Authors:
Dorin Carstoiu;Alexandra Cernian;Valentin Sgarciu;Adriana Olteanu
Affiliations:
Automatic Control and Computer Science Faculty, University Politehnica of Bucharest, Romania;Automatic Control and Computer Science Faculty, University Politehnica of Bucharest, Romania;Automatic Control and Computer Science Faculty, University Politehnica of Bucharest, Romania;Automatic Control and Computer Science Faculty, University Politehnica of Bucharest, Romania
Venue:
WSEAS Transactions on Computers
Year:
2009

Citing 20
Cited 0

Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Unsupervised Texture Segmentation in a Deterministic Annealing Framework

IEEE Transactions on Pattern Analysis and Machine Intelligence
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Partitioning-based clustering for Web document categorization

Decision Support Systems - Special issue on WITS '97
Learning from dyadic data

Proceedings of the 1998 conference on Advances in neural information processing systems II
Agglomerative clustering of a search engine query log

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining: concepts and techniques

Data mining: concepts and techniques
Data Mining Techniques: For Marketing, Sales, and Customer Support

Data Mining Techniques: For Marketing, Sales, and Customer Support
Clustering Categorical Data: An Approach Based on Dynamical Systems

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments

UAI '01 Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence
Correlation-based Document Clustering using Web Logs

HICSS '01 Proceedings of the 34th Annual Hawaii International Conference on System Sciences ( HICSS-34)-Volume 5 - Volume 5
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Associativity based clustering algorithm in mobile ad hoc networks

ICCOMP'07 Proceedings of the 11th WSEAS International Conference on Computers
The cluster-abstraction model: unsupervised learning of topic hierarchies from text data

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
Architecting for next generation business applications

ICCOMP'06 Proceedings of the 10th WSEAS international conference on Computers
Architectural representations for describing enterprise information and data

ICCOMP'06 Proceedings of the 10th WSEAS international conference on Computers
Information distance

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nowadays, we have to deal with a large quantity of unstructured data, produced by a number of sources. For example, clustering web pages is essential to getting structured information in response to user queries. In this paper, we intend to test the results of a new clustering technique -- clustering by compression -- when applied to heterogeneous sets of data. The clustering by compression procedure is based on a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pair-wise concatenation). Compression algorithms allow defining a similarity measure based on the degree of common information, whereas clustering methods allow clustering similar data without any previous knowledge.