Summarising data by clustering items

  • Authors:
  • Michael Mampaey;Jilles Vreeken

  • Affiliations:
  • Department of Mathematics and Computer Science, Universiteit Antwerpen;Department of Mathematics and Computer Science, Universiteit Antwerpen

  • Venue:
  • ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

For a book, the title and abstract provide a good first impression of what to expect from it. For a database, getting a first impression is not so straightforward. While low-order statistics only provide limited insight, mining the data quickly provides too much detail. In this paper we propose a middle ground, and introduce a parameter-free method for constructing high-quality summaries for binary data. Our method builds a summary by grouping items that strongly correlate, and uses the Minimum Description Length principle to identify the best grouping --without requiring a distance measure between items. Besides offering a practical overview of which attributes interact most strongly, these summaries are also easily-queried surrogates for the data. Experiments show that our method discovers high-quality results: correlated attributes are correctly grouped and the supports of frequent itemsets are closely approximated.