Summarizing categorical data by clustering attributes

Authors:
Michael Mampaey;Jilles Vreeken
Affiliations:
Advanced Database Research and Modelling, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium;Advanced Database Research and Modelling, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
Venue:
Data Mining and Knowledge Discovery
Year:
2013

Citing 31
Cited 2

An introduction to Kolmogorov complexity and its applications

An introduction to Kolmogorov complexity and its applications
Discovering Frequent Closed Itemsets for Association Rules

ICDT '99 Proceedings of the 7th International Conference on Database Theory
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Fully automatic cross-associations

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
SUMMARY: Efficiently Summarizing Transactions for Clustering

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics)

Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics)
Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Modelling of classification rules on metabolic patterns including machine learning and expert knowledge

Journal of Biomedical Informatics
Summarizing itemset patterns: a profile-based approach

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Summarization — Compressing Data into an Informative Representation

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Maximally informative k-itemsets and their efficient discovery

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Summarizing itemset patterns using probabilistic models

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)

The Minimum Description Length Principle (Adaptive Computation and Machine Learning)
Non-derivable itemset mining

Data Mining and Knowledge Discovery
Frequent pattern mining: current status and future directions

Data Mining and Knowledge Discovery
Finding low-entropy sets and trees from binary data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Assessing data mining results via swap randomization

ACM Transactions on Knowledge Discovery from Data (TKDD)
Information and Complexity in Statistical Modeling

Information and Complexity in Statistical Modeling
The Chosen Few: On Identifying Valuable Patterns

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Preserving Privacy through Data Generation

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Tell me something I don't know: randomization strategies for iterative data mining

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Summarising data by clustering items

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data

Journal of Biomedical Informatics
Krimp: mining itemsets that compress

Data Mining and Knowledge Discovery
Banded structure in binary matrices

Knowledge and Information Systems
Tell me what i need to know: succinctly summarizing data with itemsets

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Maximum entropy models and subjective interestingness: an application to tiles in binary databases

Data Mining and Knowledge Discovery
A bi-clustering framework for categorical data

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Paper: Modeling by shortest data description

Automatica (Journal of IFAC)
Kolmogorov's structure functions and model selection

IEEE Transactions on Information Theory

Fast and reliable anomaly detection in categorical data

Proceedings of the 21st ACM international conference on Information and knowledge management
Summarizing clinical pathways from event logs

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

For a book, its title and abstract provide a good first impression of what to expect from it. For a database, obtaining a good first impression is typically not so straightforward. While low-order statistics only provide very limited insight, downright mining the data rapidly provides too much detail for such a quick glance. In this paper we propose a middle ground, and introduce a parameter-free method for constructing high-quality descriptive summaries of binary and categorical data. Our approach builds a summary by clustering attributes that strongly correlate, and uses the Minimum Description Length principle to identify the best clustering--without requiring a distance measure between attributes. Besides providing a practical overview of which attributes interact most strongly, these summaries can also be used as surrogates for the data, and can easily be queried. Extensive experimentation shows that our method discovers high-quality results: correlated attributes are correctly grouped, which is verified both objectively and subjectively. Our models can also be employed as surrogates for the data; as an example of this we show that we can quickly and accurately query the estimated supports of frequent generalized itemsets.