Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Dynamic itemset counting and implication rules for market basket data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Using association rules for product assortment decisions: a case study
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Beyond Market Baskets: Generalizing Association Rules to Dependence Rules
Data Mining and Knowledge Discovery
Constraint-Based Rule Mining in Large, Dense Databases
Data Mining and Knowledge Discovery
Finding Interesting Associations without Support Pruning
IEEE Transactions on Knowledge and Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Selecting the right interestingness measure for association patterns
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DualMiner: a dual-pruning algorithm for itemsets with constraints
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Mining of Constrained Correlated Sets
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning bayesian network structure from massive datasets: the «sparse candidate« algorithm
UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Correlation search in graph databases
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Correlated pattern mining in quantitative databases
ACM Transactions on Database Systems (TODS)
Scaling up top-K cosine similarity search
Data & Knowledge Engineering
Distributed threshold querying of general functions by a difference of monotonic representation
Proceedings of the VLDB Endowment
A log-linear approach to mining significant graph-relational patterns
Data & Knowledge Engineering
Speeding up correlation search for binary data
Pattern Recognition Letters
Hi-index | 0.00 |
We consider the problem of finding highly correlated pairs in a large data set. That is, given a threshold not too small, we wish to report all the pairs of items (or binary attributes) whose (Pearson) correlation coefficients are greater than the threshold. Correlation analysis is an important step in many statistical and knowledge-discovery tasks. Normally, the number of highly correlated pairs is quite small compared to the total number of pairs. Identifying highly correlated pairs in a naive way by computing the correlation coefficients for all the pairs is wasteful. With massive data sets, where the total number of pairs may exceed the main-memory capacity, the computational cost of the naive method is prohibitive. In their KDD'04 paper [15], Hui Xiong et al. address this problem by proposing the TAPER algorithm. The algorithm goes through the data set in two passes. It uses the first pass to generate a set of candidate pairs whose correlation coefficients are then computed directly in the second pass. The efficiency of the algorithm depends greatly on the selectivity (pruning power) of its candidate-generating stage.In this work, we adopt the general framework of the TAPER algorithm but propose a different candidate-generation method. For a pair of items, TAPER's candidate-generation method considers only the frequencies (supports) of individual items. Our method also considers the frequency (support) of the pair but does not explicitly count this frequency (support). We give a simple randomized algorithm whose false-negative probability is negligible. The space and time complexities of generating the candidate set in our algorithm are asymptotically the same as TAPER's. We conduct experiments on synthesized and real data. The results show that our algorithm produces a greatly reduced candidate set - one that can be several orders of magnitude smaller than that generated by TAPER. Because of this, our algorithm uses much less memory and can be faster. The former is critical for dealing with massive data.