Krimp: mining itemsets that compress

Authors:
Jilles Vreeken;Matthijs Leeuwen;Arno Siebes
Affiliations:
Algorithmic Data Analysis, Department of Information and Computing Sciences, Faculty of Science, Universiteit Utrecht, Utrecht, The Netherlands and ADReM, Department of Mathematics and Computer Sc ...;Algorithmic Data Analysis, Department of Information and Computing Sciences, Faculty of Science, Universiteit Utrecht, Utrecht, The Netherlands;Algorithmic Data Analysis, Department of Information and Computing Sciences, Faculty of Science, Universiteit Utrecht, Utrecht, The Netherlands
Venue:
Data Mining and Knowledge Discovery
Year:
2011

Citing 44
Cited 20

C4.5: programs for machine learning

C4.5: programs for machine learning
An introduction to Kolmogorov complexity and its applications

An introduction to Kolmogorov complexity and its applications
Fast discovery of association rules

Advances in knowledge discovery and data mining
Efficiently mining long patterns from databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
KDD-Cup 2000 organizers' report: peeling the onion

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Levelwise Search and Borders of Theories in KnowledgeDiscovery

Data Mining and Knowledge Discovery
A Study on the Performance of Large Bayes Classifier

ECML '00 Proceedings of the 11th European Conference on Machine Learning
FOIL: A Midterm Report

ECML '93 Proceedings of the European Conference on Machine Learning
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Discovering Frequent Closed Itemsets for Association Rules

ICDT '99 Proceedings of the 7th International Conference on Database Theory
Mining All Non-derivable Frequent Itemsets

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Mining Surprising Patterns Using Temporal Description Length

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Information-Based Classification by Aggregating Emerging Patterns

IDEAL '00 Proceedings of the Second International Conference on Intelligent Data Engineering and Automated Learning, Data Mining, Financial Engineering, and Intelligent Agents
Pattern Detection and Discovery

Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
Fully automatic cross-associations

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics)

Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics)
Summarizing itemset patterns: a profile-based approach

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Mining compressed frequent-pattern sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
On efficiently summarizing categorical databases

Knowledge and Information Systems
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Maximally informative k-itemsets and their efficient discovery

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Summarizing itemset patterns using probabilistic models

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)

The Minimum Description Length Principle (Adaptive Computation and Machine Learning)
Compression-based data mining of sequential data

Data Mining and Knowledge Discovery
Reducing the Frequent Pattern Set

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
On data mining, compression, and Kolmogorov complexity

Data Mining and Knowledge Discovery
Finding low-entropy sets and trees from binary data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
GraphScope: parameter-free mining of large time-evolving graphs

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Characterising the difference

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Summarization – compressing data into an informative representation

Knowledge and Information Systems
Assessing data mining results via swap randomization

ACM Transactions on Knowledge Discovery from Data (TKDD)
Succinct summarization of transactional databases: an overlapped hyperrectangle scheme

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
StreamKrimp: Detecting Change in Data Streams

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
The Chosen Few: On Identifying Valuable Patterns

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Preserving Privacy through Data Generation

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Filling in the Blanks - Krimp Minimisation for Missing Data

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Finding Good Itemsets by Packing Data

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Characteristic relational patterns

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying the components

Data Mining and Knowledge Discovery
Pattern teams

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Compression picks item sets that matter

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Proceedings of the 2004 international conference on Local Pattern Detection

LPD'04 Proceedings of the 2004 international conference on Local Pattern Detection

Model order selection for boolean matrix factorization

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Comparing apples and oranges: measuring differences between data mining results

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Non-redundant subgroup discovery in large and complex data

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
"Tell me more": finding related items from user provided feedback

DS'11 Proceedings of the 14th international conference on Discovery science
Towards an automatic construction of Contextual Attribute-Value Taxonomies

Proceedings of the 27th Annual ACM Symposium on Applied Computing
A constraint language for declarative pattern discovery

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Linear space direct pattern sampling using coupling from the past

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
The long and the short of it: summarising event sequences with serial episodes

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Summarizing data succinctly with the most informative itemsets

ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on the Best of SIGKDD 2011
Frequent item set mining

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Incorporating occupancy into frequent pattern mining for high quality pattern recommendation

Proceedings of the 21st ACM international conference on Information and knowledge management
Fast and reliable anomaly detection in categorical data

Proceedings of the 21st ACM international conference on Information and knowledge management
Discovering descriptive tile trees: by mining optimal geometric subtiles

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Summarizing categorical data by clustering attributes

Data Mining and Knowledge Discovery
Mining Minimal Motif Pair Sets Maximally Covering Interactions in a Protein-Protein Interaction Network

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Zips: mining compressing sequential patterns in streams

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
Randomly sampling maximal itemsets

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
Formal and computational properties of the confidence boost of association rules

ACM Transactions on Knowledge Discovery from Data (TKDD)
A statistical significance testing approach to mining the most informative set of patterns

Data Mining and Knowledge Discovery
Data summarization for network traffic monitoring

Journal of Network and Computer Applications

Quantified Score

Hi-index	0.01

Visualization

Abstract

One of the major problems in pattern mining is the explosion of the number of results. Tight constraints reveal only common knowledge, while loose constraints lead to an explosion in the number of returned patterns. This is caused by large groups of patterns essentially describing the same set of transactions. In this paper we approach this problem using the MDL principle: the best set of patterns is that set that compresses the database best. For this task we introduce the Krimp algorithm. Experimental evaluation shows that typically only hundreds of itemsets are returned; a dramatic reduction, up to seven orders of magnitude, in the number of frequent item sets. These selections, called code tables, are of high quality. This is shown with compression ratios, swap-randomisation, and the accuracies of the code table-based Krimp classifier, all obtained on a wide range of datasets. Further, we extensively evaluate the heuristic choices made in the design of the algorithm.