An efficient algorithm for sequential random sampling
ACM Transactions on Mathematical Software (TOMS)
Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Dynamic itemset counting and implication rules for market basket data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Efficiently mining long patterns from databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Mining frequent patterns without candidate generation
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Turbo-charging vertical mining of large databases
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Depth first generation of long patterns
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding authorities and hubs from link structures on the World Wide Web
Proceedings of the 10th international conference on World Wide Web
Pincer Search: A New Algorithm for Discovering the Maximum Frequent Set
EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A new two-phase sampling based algorithm for discovering association rules
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of Sampling for Data Mining of Association Rules
Evaluation of Sampling for Data Mining of Association Rules
Efficient data reduction with EASE
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Hi-index | 12.05 |
Classical data mining algorithms require expensive passes over the entire database to generate frequent items and hence to generate association rules. With the increase in the size of database, it is becoming very difficult to handle large amount of data for computation. One of the solutions to this problem is to generate sample from the database that acts as representative of the entire database for finding association rules in such a way that the distance of the sample from the complete database is minimal. Choosing correct sample that could represent data is not an easy task. Many algorithms have been proposed in the past. Some of them are computationally fast while others give better accuracy. In this paper, we present an algorithm for generating a sample from the database that can replace the entire database for generating association rules and is aimed at keeping a balance between accuracy and speed. The algorithm that is proposed takes into account the average number of small, medium and large 1-itemset in the database and average weight of the transactions to define threshold condition for the transactions. Set of transactions that satisfy the threshold condition is chosen as the representative for the entire database. The effectiveness of the proposed algorithm has been tested over several runs of database generated by IBM synthetic data generator. A vivid comparative performance evaluation of the proposed technique with the existing sampling techniques for comparing the accuracy and speed has also been carried out.