Efficient data reduction with EASE

Authors:
Hervé Brönnimann;Bin Chen;Manoranjan Dash;Peter Haas;Peter Scheuermann
Affiliations:
Comp & Info Sci Polytechnic Univ., Brooklyn, NY;Exelixis Inc., San Francisco, CA;Elect & Comp Engg, Evanston, IL, Northwestern Univ.;IBM Almaden, San Jose, CA;Elect & Comp Engg, Evanston, IL, Northwestern Univ.
Venue:
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2003

Citing 13
Cited 23

A theory of the learnable

Communications of the ACM
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Derandomization in computational geometry

Journal of Algorithms
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Product Range Spaces, Sensitive Sampling, and Derandomization

SIAM Journal on Computing
The discrepancy method: randomness and complexity

The discrepancy method: randomness and complexity
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of Sampling for Data Mining of Association Rules

Evaluation of Sampling for Data Mining of Association Rules
Efficient data reduction with EASE

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Efficient data reduction with EASE

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mission-critical management of mobile sensors: or, how to guide a flock of sensors

DMSN '04 Proceeedings of the 1st international workshop on Data management for sensor networks: in conjunction with VLDB 2004
Proactive re-optimization

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Efficient sampling of training set in large and noisy multimedia data

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Deterministic sampling and range counting in geometric data streams

ACM Transactions on Algorithms (TALG)
Quality-Aware Sampling and Its Applications in Incremental Data Mining

IEEE Transactions on Knowledge and Data Engineering
A new deterministic data aggregation method for wireless sensor networks

Signal Processing
Deterministic algorithms for sampling count data

Data & Knowledge Engineering
Feature-preserved sampling over streaming data

ACM Transactions on Knowledge Discovery from Data (TKDD)
A new sampling technique for association rule mining

Journal of Information Science
Data reduction for data analysis

ECC'08 Proceedings of the 2nd conference on European computing conference
Mining in Large Noisy Domains

Journal of Data and Information Quality (JDIQ)
Which Is Better for Frequent Pattern Mining: Approximate Counting or Sampling?

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
extraRelief: improving relief by efficient selection of instances

AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
A new approach for generating efficient sample from market basket data

Expert Systems with Applications: An International Journal
A comparison between approximate counting and sampling methods for frequent pattern mining on data streams

Intelligent Data Analysis
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Locality sensitive hashing for sampling-based algorithms in association rule mining

Expert Systems with Applications: An International Journal
Sampling ensembles for frequent patterns

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
Efficient sampling: application to image data

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
ML-DS: a novel deterministic sampling algorithm for association rules mining

ICDM'12 Proceedings of the 12th Industrial conference on Advances in Data Mining: applications and theoretical aspects
Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

A variety of mining and analysis problems --- ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes --- involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" --- appropriately defined --- from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical χ2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.