Finding Good Itemsets by Packing Data

Authors:
Nikolaj Tatti;Jilles Vreeken
Affiliations:
-;-
Venue:
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Year:
2008

Citing 0
Cited 7

CloseViz: visualizing useful patterns

Proceedings of the ACM SIGKDD Workshop on Useful Patterns
Margin-closed frequent sequential pattern mining

Proceedings of the ACM SIGKDD Workshop on Useful Patterns
Probably the best itemsets

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Making pattern mining useful

ACM SIGKDD Explorations Newsletter
Krimp: mining itemsets that compress

Data Mining and Knowledge Discovery
Model order selection for boolean matrix factorization

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast and reliable anomaly detection in categorical data

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.01

Visualization

Abstract

The problem of selecting small groups of itemsets that represent the data well has recently gained a lot of attention. We approach the problem by searching for the itemsets that compress the data efficiently. As a compression technique we use decision trees combined with a refined version of MDL. More formally, assuming that the items are ordered, we create a decision tree for each item that may only depend on the previous items. Our approach allows us to find complex interactions between the attributes, not just co-occurrences of 1s. Further, we present a link between the itemsets and the decision trees and use this link to export the itemsets from the decision trees. In this paper we present two algorithms. The first one is a simple greedy approach that builds a family of itemsets directly from data. The second one, given a collection of candidate itemsets, selects a small subset of these itemsets. Our experiments show that these approaches result in compact and high quality descriptions of the data.