TAPER: A Two-Step Approach for All-Strong-Pairs Correlation Query in Large Databases

Authors:
Hui Xiong;Shashi Shekhar;Pang-Ning Tan;Vipin Kumar
Affiliations:
IEEE;IEEE;IEEE;IEEE
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2006

Citing 19
Cited 12

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Beyond market baskets: generalizing association rules to correlations

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Exploratory mining via constrained frequent set queries

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Mining association rules with multiple minimum supports

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Empirical bayes screening for multi-item associations

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Mining confident rules without support requirement

Proceedings of the tenth international conference on Information and knowledge management
Information Retrieval

Information Retrieval
Constraint-Based Rule Mining in Large, Dense Databases

Data Mining and Knowledge Discovery
Mining Optimized Association Rules with Categorical and Numeric Attributes

IEEE Transactions on Knowledge and Data Engineering
Mining Both Positive and Negative Association Rules

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases

Proceedings of the 17th International Conference on Data Engineering
The Computational Complexity of High-Dimensional Correlation Search

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints

Data Mining and Knowledge Discovery
Efficient Mining of Constrained Correlated Sets

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Finding Interesting Associations without Support Pruning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Playing hide-and-seek with correlations

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
CORDS: automatic discovery of correlations and soft functional dependencies

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Correlation search in graph databases

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Correlated pattern mining in quantitative databases

ACM Transactions on Database Systems (TODS)
Volatile correlation computation: a checkpoint view

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Searching Correlated Objects in a Long Sequence

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Capturing truthiness: mining truth tables in binary datasets

Proceedings of the 2009 ACM symposium on Applied Computing
Models for association rules based on clustering and correlation

Intelligent Data Analysis
An association analysis approach to biclustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient set-correlation operator inside databases

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
GENCCS: a correlated group difference approach to contrast set mining

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Dynamic rank correlation computing for financial risk analysis

KSEM'11 Proceedings of the 5th international conference on Knowledge Science, Engineering and Management
Speeding up correlation search for binary data

Pattern Recognition Letters
Correlation range query

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a user-specified minimum correlation threshold \theta and a market-basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold \theta. However, when the number of items and transactions are large, the computation cost of this query can be very high. The goal of this paper is to provide computationally efficient algorithms to answer the all-strong-pairs correlation query. Indeed, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient, but also exhibits special monotone properties which allow pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corElation queRy (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent of or improves when the number of items is increased in data sets with Zipf-like or linear rank-support distributions. Experimental results from synthetic and real-world data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives. Finally, we demonstrate that the algorithmic ideas developed in the TAPER algorithm can be extended to efficiently compute negative correlation and uncentered Pearson's correlation coefficient.