ML-DS: a novel deterministic sampling algorithm for association rules mining

Authors:
Samir A. Mohamed Elsayed;Sanguthevar Rajasekaran;Reda A. Ammar
Affiliations:
Computer Science Department, University of Connecticut;Computer Science Department, University of Connecticut;Computer Science Department, University of Connecticut
Venue:
ICDM'12 Proceedings of the 12th Industrial conference on Advances in Data Mining: applications and theoretical aspects
Year:
2012

Citing 16
Cited 0

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Efficiently mining long patterns from databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Selection algorithms for parallel disk systems

Journal of Parallel and Distributed Computing
Set-Oriented Mining for Association Rules in Relational Databases

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
The Discrepancy Method

ISAAC '98 Proceedings of the 9th International Symposium on Algorithms and Computation
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Efficient data reduction with EASE

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Fast vertical mining using diffsets

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A Transaction Mapping Algorithm for Frequent Itemsets Mining

IEEE Transactions on Knowledge and Data Engineering
Deterministic algorithms for sampling count data

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the explosive growth of data in every aspect of our life, data mining algorithms often suffer from scalability issues. One effective way to tackle this problem is to employ sampling techniques. This paper introduces, ML-DS, a novel deterministic sampling algorithm for mining association rules in large datasets. Unlike most algorithms in the literature that use randomness in sampling, our algorithm is fully deterministic. The process of sampling proceeds in stages. The size of the sample data in any stage is half that of the previous stage. In any given stage, the data is partitioned into disjoint groups of equal size. Some distance measure is used to determine the importance of each group in identifying accurate association rules. The groups are then sorted based on this measure. Only the best 50% of the groups move to the next stage. We perform as many stages of sampling as needed to produce a sample of a desired target size. The resultant sample is then employed to identify association rules. Empirical results show that our approach outperforms simple randomized sampling in accuracy and is competitive in comparison with the state-of-the-art sampling algorithms in terms of both time and accuracy.