Communications of the ACM
Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Derandomization in computational geometry
Journal of Algorithms
Mining frequent patterns without candidate generation
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Product Range Spaces, Sensitive Sampling, and Derandomization
SIAM Journal on Computing
The discrepancy method: randomness and complexity
The discrepancy method: randomness and complexity
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total
ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A new two-phase sampling based algorithm for discovering association rules
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of Sampling for Data Mining of Association Rules
Evaluation of Sampling for Data Mining of Association Rules
Efficient data reduction with EASE
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate frequency counts over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Efficient data reduction with EASE
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mission-critical management of mobile sensors: or, how to guide a flock of sensors
DMSN '04 Proceeedings of the 1st international workshop on Data management for sensor networks: in conjunction with VLDB 2004
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Efficient sampling of training set in large and noisy multimedia data
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Deterministic sampling and range counting in geometric data streams
ACM Transactions on Algorithms (TALG)
Quality-Aware Sampling and Its Applications in Incremental Data Mining
IEEE Transactions on Knowledge and Data Engineering
A new deterministic data aggregation method for wireless sensor networks
Signal Processing
Deterministic algorithms for sampling count data
Data & Knowledge Engineering
Feature-preserved sampling over streaming data
ACM Transactions on Knowledge Discovery from Data (TKDD)
A new sampling technique for association rule mining
Journal of Information Science
Data reduction for data analysis
ECC'08 Proceedings of the 2nd conference on European computing conference
Journal of Data and Information Quality (JDIQ)
Which Is Better for Frequent Pattern Mining: Approximate Counting or Sampling?
DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
extraRelief: improving relief by efficient selection of instances
AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
A new approach for generating efficient sample from market basket data
Expert Systems with Applications: An International Journal
Discovery of frequent patterns in transactional data streams
Transactions on large-scale data- and knowledge-centered systems II
Discovery of frequent patterns in transactional data streams
Transactions on large-scale data- and knowledge-centered systems II
Locality sensitive hashing for sampling-based algorithms in association rule mining
Expert Systems with Applications: An International Journal
Sampling ensembles for frequent patterns
FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
Efficient sampling: application to image data
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
ML-DS: a novel deterministic sampling algorithm for association rules mining
ICDM'12 Proceedings of the 12th Industrial conference on Advances in Data Mining: applications and theoretical aspects
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Hi-index | 0.00 |
A variety of mining and analysis problems --- ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes --- involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" --- appropriately defined --- from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical χ2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.