Finding highly correlated pairs efficiently with powerful pruning

Authors:
Jian Zhang;Joan Feigenbaum
Affiliations:
Stanford University, Stanford, CA;Yale University, New Haven, CT
Venue:
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Year:
2006

Citing 14
Cited 6

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Dynamic itemset counting and implication rules for market basket data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Using association rules for product assortment decisions: a case study

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Beyond Market Baskets: Generalizing Association Rules to Dependence Rules

Data Mining and Knowledge Discovery
Constraint-Based Rule Mining in Large, Dense Databases

Data Mining and Knowledge Discovery
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Selecting the right interestingness measure for association patterns

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DualMiner: a dual-pruning algorithm for itemsets with constraints

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Mining of Constrained Correlated Sets

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning bayesian network structure from massive datasets: the «sparse candidate« algorithm

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Correlation search in graph databases

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Correlated pattern mining in quantitative databases

ACM Transactions on Database Systems (TODS)
Scaling up top-K cosine similarity search

Data & Knowledge Engineering
Distributed threshold querying of general functions by a difference of monotonic representation

Proceedings of the VLDB Endowment
A log-linear approach to mining significant graph-relational patterns

Data & Knowledge Engineering
Speeding up correlation search for binary data

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of finding highly correlated pairs in a large data set. That is, given a threshold not too small, we wish to report all the pairs of items (or binary attributes) whose (Pearson) correlation coefficients are greater than the threshold. Correlation analysis is an important step in many statistical and knowledge-discovery tasks. Normally, the number of highly correlated pairs is quite small compared to the total number of pairs. Identifying highly correlated pairs in a naive way by computing the correlation coefficients for all the pairs is wasteful. With massive data sets, where the total number of pairs may exceed the main-memory capacity, the computational cost of the naive method is prohibitive. In their KDD'04 paper [15], Hui Xiong et al. address this problem by proposing the TAPER algorithm. The algorithm goes through the data set in two passes. It uses the first pass to generate a set of candidate pairs whose correlation coefficients are then computed directly in the second pass. The efficiency of the algorithm depends greatly on the selectivity (pruning power) of its candidate-generating stage.In this work, we adopt the general framework of the TAPER algorithm but propose a different candidate-generation method. For a pair of items, TAPER's candidate-generation method considers only the frequencies (supports) of individual items. Our method also considers the frequency (support) of the pair but does not explicitly count this frequency (support). We give a simple randomized algorithm whose false-negative probability is negligible. The space and time complexities of generating the candidate set in our algorithm are asymptotically the same as TAPER's. We conduct experiments on synthesized and real data. The results show that our algorithm produces a greatly reduced candidate set - one that can be several orders of magnitude smaller than that generated by TAPER. Because of this, our algorithm uses much less memory and can be faster. The former is critical for dealing with massive data.