Toward terabyte pattern mining: an architecture-conscious solution

Authors:
Gregory Buehrer;Srinivasan Parthasarathy;Shirish Tatikonda;Tahsin Kurc;Joel Saltz
Affiliations:
The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH
Venue:
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2007

Citing 26
Cited 12

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Beyond market baskets: generalizing association rules to correlations

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Scalable parallel data mining for association rules

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Asynchronous parallel algorithm for mining association rules on a shared-memory multi-processors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Efficient mining of emerging patterns: discovering trends and differences

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data mining: concepts and techniques

Data mining: concepts and techniques
Communication-efficient distributed mining of association rules

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A fast distributed algorithm for mining association rules

DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Parallel data mining for association rules on shared memory systems

Knowledge and Information Systems
Shared State for Distributed Interactive Data Mining Applications

Distributed and Parallel Databases - Special issue: Parallel and distributed data mining
Discovery of Frequent Episodes in Event Sequences

Data Mining and Knowledge Discovery
Parallel Algorithms for Discovery of Association Rules

Data Mining and Knowledge Discovery
Parallel and Distributed Association Mining: A Survey

IEEE Concurrency
Parallel Mining of Association Rules

IEEE Transactions on Knowledge and Data Engineering
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Scalable Techniques for Mining Causal Structures

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
An Efficient Algorithm for Mining Association Rules in Large Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Mining Frequent Itemsets in Distributed and Dynamic Databases

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
A High-Performance Distributed Algorithm for Mining Association Rules

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Frequent Pattern Mining on Message Passing Multiprocessor Systems

Distributed and Parallel Databases
A sampling-based framework for parallel data mining

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Cache-conscious frequent pattern mining on a modern processor

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Out-of-core frequent pattern mining on a commodity PC

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Cut-and-stitch: efficient parallel learning of linear dynamical systems on smps

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Pfp: parallel fp-growth for query recommendation

Proceedings of the 2008 ACM conference on Recommender systems
Frequent itemset mining on graphics processors

Proceedings of the Fifth International Workshop on Data Management on New Hardware
A distributed placement service for graph-structured and tree-structured data

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Memory-efficient frequent-itemset mining

Proceedings of the 14th International Conference on Extending Database Technology
CLAP: Collaborative pattern mining for distributed information systems

Decision Support Systems
Message-driven FP-growth

Proceedings of the WICSA/ECSA 2012 Companion Volume
PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

Proceedings of the 21st ACM international conference on Information and knowledge management
Parallel frequent itemset mining using systolic arrays

Knowledge-Based Systems
Parallel approaches to machine learning-A comprehensive survey

Journal of Parallel and Distributed Computing
Mind the gap: large-scale frequent sequence mining

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Efficient mining of frequent itemsets in social network data based on MapReduce framework

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a strategy for mining frequent item sets from terabyte-scale data sets on cluster systems. The algorithm embraces the holistic notion of architecture-conscious datamining, taking into account the capabilities of the processor, the memory hierarchy and the available network interconnects. Optimizations have been designed for lowering communication costs using compressed data structures and a succinct encoding. Optimizations for improving cache, memory and I/O utilization using pruningand tiling techniques, and smart data placement strategies are also employed. We leverage the extended memory spaceand computational resources of a distributed message-passing clusterto design a scalable solution, where each node can extend its metastructures beyond main memory by leveraging 64-bit architecture support. Our solution strategy is presented in the context of FPGrowth, a well-studied and rather efficient frequent pattern mining algorithm. Results demonstrate that the proposed strategy result in near-linearscaleup on up to 48 nodes.