Finding low-entropy sets and trees from binary data

Authors:
Hannes Heikinheimo;Eino Hinkkanen;Heikki Mannila;Taneli Mielikäinen;Jouni K. Seppänen
Affiliations:
Helsinki University of Technology;University of Helsinki;University of Helsinki;University of Helsinki and Nokia Research Center;Helsinki University of Technology
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 18
Cited 9

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Computational learning theory: an introduction

Computational learning theory: an introduction
Elements of information theory

Elements of information theory
A Bayesian Method for the Induction of Probabilistic Networks from Data

Machine Learning
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Learning Bayesian Networks: The Combination of Knowledge and Statistical Data

Machine Learning
Fast discovery of association rules

Advances in knowledge discovery and data mining
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Transversing itemset lattices with statistical metric pruning

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Learning with mixtures of trees

The Journal of Machine Learning Research
Fragments of order

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Assessing data mining results via swap randomization

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Maximally informative k-itemsets and their efficient discovery

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Projection Pursuit Algorithm for Exploratory Data Analysis

IEEE Transactions on Computers
Don't be afraid of simpler patterns

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Finding trees from unordered 0–1 data

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Compression picks item sets that matter

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Maximum entropy based significance of itemsets

Knowledge and Information Systems
Mining non-redundant high order correlations in binary data

Proceedings of the VLDB Endowment
An Improved Algorithm for Mining Non-Redundant Interacting Feature Subsets

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Probably the best itemsets

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering highly informative feature sets from data streams

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I
Mining non-redundant information-theoretic dependencies between itemsets

DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Summarising data by clustering items

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Krimp: mining itemsets that compress

Data Mining and Knowledge Discovery
Summarizing categorical data by clustering attributes

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

The discovery of subsets with special properties from binary data hasbeen one of the key themes in pattern discovery. Pattern classes suchas frequent itemsets stress the co-occurrence of the value 1 in the data. While this choice makes sense in the context of sparse binary data, it disregards potentially interesting subsets of attributes that have some other type of dependency structure. We consider the problem of finding all subsets of attributes that have low complexity. The complexity is measured by either the entropy of the projection of the data on the subset, or the entropy of the data for the subset when modeled using a Bayesian tree, with downward or upward pointing edges. We show that the entropy measure on sets has a monotonicity property, and thus a levelwise approach can find all low-entropy itemsets. We also show that the tree-based measures are bounded above by the entropy of the corresponding itemset, allowing similar algorithms to be used for finding low-entropy trees. We describe algorithms for finding all subsets satisfying an entropy condition. We give an extensive empirical evaluation of the performance of the methods both on synthetic and on real data. We also discuss the search for high-entropy subsets and the computation of the Vapnik-Chervonenkis dimension of the data.