Summarising data by clustering items

Authors:
Michael Mampaey;Jilles Vreeken
Affiliations:
Department of Mathematics and Computer Science, Universiteit Antwerpen;Department of Mathematics and Computer Science, Universiteit Antwerpen
Venue:
ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Year:
2010

Citing 16
Cited 5

Discovering Frequent Closed Itemsets for Association Rules

ICDT '99 Proceedings of the 7th International Conference on Database Theory
Mining All Non-derivable Frequent Itemsets

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
SUMMARY: Efficiently Summarizing Transactions for Clustering

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Summarizing itemset patterns: a profile-based approach

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Summarization — Compressing Data into an Informative Representation

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Maximally informative k-itemsets and their efficient discovery

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)

The Minimum Description Length Principle (Adaptive Computation and Machine Learning)
Frequent pattern mining: current status and future directions

Data Mining and Knowledge Discovery
Finding low-entropy sets and trees from binary data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Assessing data mining results via swap randomization

ACM Transactions on Knowledge Discovery from Data (TKDD)
Banded structure in binary matrices

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
The Chosen Few: On Identifying Valuable Patterns

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Preserving Privacy through Data Generation

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Tell me something I don't know: randomization strategies for iterative data mining

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A bi-clustering framework for categorical data

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases

Tell me what i need to know: succinctly summarizing data with itemsets

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Comparing apples and oranges: measuring differences between data mining results

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Summarizing data succinctly with the most informative itemsets

ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on the Best of SIGKDD 2011
Summarizing categorical data by clustering attributes

Data Mining and Knowledge Discovery
Data summarization for network traffic monitoring

Journal of Network and Computer Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

For a book, the title and abstract provide a good first impression of what to expect from it. For a database, getting a first impression is not so straightforward. While low-order statistics only provide limited insight, mining the data quickly provides too much detail. In this paper we propose a middle ground, and introduce a parameter-free method for constructing high-quality summaries for binary data. Our method builds a summary by grouping items that strongly correlate, and uses the Minimum Description Length principle to identify the best grouping --without requiring a distance measure between items. Besides offering a practical overview of which attributes interact most strongly, these summaries are also easily-queried surrogates for the data. Experiments show that our method discovers high-quality results: correlated attributes are correctly grouped and the supports of frequent itemsets are closely approximated.