Memory-efficient frequent-itemset mining

Authors:
Benjamin Schlegel;Rainer Gemulla;Wolfgang Lehner
Affiliations:
Technische Universität Dresden, Dresden, Germany;Max-Planck-Institut, Saarbrücken, Germany;Technische Universität Dresden, Dresden, Germany
Venue:
Proceedings of the 14th International Conference on Extending Database Technology
Year:
2011

Citing 19
Cited 4

Self-adjusting binary search trees

Journal of the ACM (JACM)
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Using association rules for product assortment decisions: a case study

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
The implementation and performance of compressed databases

ACM SIGMOD Record
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
CT-ITL: efficient frequent item set mining using a compressed prefix tree with pattern growth

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
New Algorithms for Fast Discovery of Association Rules

New Algorithms for Fast Discovery of Association Rules
Advances in frequent itemset mining implementations: report on FIMI'03

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Cache-conscious frequent pattern mining on a modern processor

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Out-of-core frequent pattern mining on a commodity PC

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Toward terabyte pattern mining: an architecture-conscious solution

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Special Feature An Overview of Data Compression Techniques

Computer
Optimization of frequent itemset mining on multiple-core processor

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pfp: parallel fp-growth for query recommendation

Proceedings of the 2008 ACM conference on Recommender systems
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Data Mining in Bioinformatics

Data Mining in Bioinformatics

Efficient colossal pattern mining in high dimensional datasets

Knowledge-Based Systems
Frequent item set mining

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Parallel frequent itemset mining using systolic arrays

Knowledge-Based Systems
Energy-efficient in-memory database computing

Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficient discovery of frequent itemsets in large datasets is a key component of many data mining tasks. In-core algorithms---which operate entirely in main memory and avoid expensive disk accesses---and in particular the prefix tree-based algorithm FP-growth are generally among the most efficient of the available algorithms. Unfortunately, their excessive memory requirements render them inapplicable for large datasets with many distinct items and/or itemsets of high cardinality. To overcome this limitation, we propose two novel data structures---the CFP-tree and the CFP-array---, which reduce memory consumption by about an order of magnitude. This allows us to process significantly larger datasets in main memory than previously possible. Our data structures are based on structural modifications of the prefix tree that increase compressability, an optimized physical representation, lightweight compression techniques, and intelligent node ordering and indexing. Experiments with both real-world and synthetic datasets show the effectiveness of our approach.