A Services Oriented Framework for Next Generation Data Analysis Centers
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Distributed approximate mining of frequent patterns
Proceedings of the 2005 ACM symposium on Applied computing
Toward Unsupervised Correlation Preserving Discretization
IEEE Transactions on Knowledge and Data Engineering
Design of a next generation sampling service for large scale data analysis applications
Proceedings of the 19th annual international conference on Supercomputing
Multi-scaling sampling: an adaptive sampling method for discovering approximate association rules
Journal of Computer Science and Technology
Approximate mining of frequent patterns on streams
Intelligent Data Analysis - Knowlegde Discovery from Data Streams
The VLDB Journal — The International Journal on Very Large Data Bases
Feature-preserved sampling over streaming data
ACM Transactions on Knowledge Discovery from Data (TKDD)
A new sampling technique for association rule mining
Journal of Information Science
Efficient Frequent Itemsets Mining by Sampling
Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006
A test paradigm for detecting changes in transactional data streams
DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
Mining top-K frequent itemsets through progressive sampling
Data Mining and Knowledge Discovery
I/O conscious algorithm design and systems support for data analysis on emerging architectures
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Discovery of frequent patterns in transactional data streams
Transactions on large-scale data- and knowledge-centered systems II
Discovery of frequent patterns in transactional data streams
Transactions on large-scale data- and knowledge-centered systems II
Locality sensitive hashing for sampling-based algorithms in association rule mining
Expert Systems with Applications: An International Journal
LPD'04 Proceedings of the 2004 international conference on Local Pattern Detection
Sampling ensembles for frequent patterns
FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
Progressive sampling for association rules based on sampling error estimation
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Effective sampling for mining association rules
AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Stratified k-means clustering over a deep web data source
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce
Proceedings of the 21st ACM international conference on Information and knowledge management
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Hi-index | 0.00 |
In data mining, sampling has often been suggested as aneffective tool to reduce the size of the dataset operated atsome cost to accuracy. However, this loss to accuracy isoften difficult to measure and characterize since the exactnature of the learning curve (accuracy vs. sample size) isparameter and data dependent, i.e., we do not know aprioriwhat sample size is needed to achieve a desired accuracyon a particular dataset for a particular set of parameters.In this article we propose the use of progressive sampling todetermine the required sample size for association rule mining.We first show that a naive application of progressivesampling is not very efficient for association rule mining.We then present a refinement based on equivalence classes,that seems to work extremely well in practice and is able toconverge to the desired sample size very quickly and veryaccurately. An additional novelty of our approach is thedefinition of a support-sensitive, interactive measure of accuracyacross progressive samples.