Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Mining frequent patterns without candidate generation
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Parallel and Distributed Association Mining: A Survey
IEEE Concurrency
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Evaluation of sampling for data mining of association rules
RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Efficient Progressive Sampling for Association Rules
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
IEEE Transactions on Knowledge and Data Engineering
A sampling-based framework for parallel data mining
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Probability and Computing: Randomized Algorithms and Probabilistic Analysis
Probability and Computing: Randomized Algorithms and Probabilistic Analysis
Parallel Leap: Large-Scale Maximal Pattern Mining in a Distributed Environment
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Toward terabyte pattern mining: an architecture-conscious solution
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimization of frequent itemset mining on multiple-core processor
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pfp: parallel fp-growth for query recommendation
Proceedings of the 2008 ACM conference on Recommender systems
Proceedings of the 19th international conference on World wide web
DBKDA '10 Proceedings of the 2010 Second International Conference on Advances in Databases, Knowledge, and Data Applications
Design patterns for efficient graph algorithms in MapReduce
Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Mining top-K frequent itemsets through progressive sampling
Data Mining and Knowledge Discovery
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Parallel Frequent Item Set Mining with Selective Item Replication
IEEE Transactions on Parallel and Distributed Systems
The Strategy of Mining Association Rule Based on Cloud Computing
BCGIN '11 Proceedings of the 2011 International Conference on Business Computing and Global Informatization
Effective sampling for mining association rules
AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
Space-round tradeoffs for MapReduce computations
Proceedings of the 26th ACM international conference on Supercomputing
Hi-index | 0.00 |
Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.