A new method for clustering heterogeneous data: clustering by compression

  • Authors:
  • Dorin Carstoiu;Alexandra Cernian;Valentin Sgarciu;Adriana Olteanu

  • Affiliations:
  • Automatic Control and Computer Science Faculty, University Politehnica of Bucharest, Romania;Automatic Control and Computer Science Faculty, University Politehnica of Bucharest, Romania;Automatic Control and Computer Science Faculty, University Politehnica of Bucharest, Romania;Automatic Control and Computer Science Faculty, University Politehnica of Bucharest, Romania

  • Venue:
  • WSEAS Transactions on Computers
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Nowadays, we have to deal with a large quantity of unstructured data, produced by a number of sources. For example, clustering web pages is essential to getting structured information in response to user queries. In this paper, we intend to test the results of a new clustering technique -- clustering by compression -- when applied to heterogeneous sets of data. The clustering by compression procedure is based on a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pair-wise concatenation). Compression algorithms allow defining a similarity measure based on the degree of common information, whereas clustering methods allow clustering similar data without any previous knowledge.