Finding the most interesting correlations in a database: how hard can it be?

Authors:
Christopher Jermaine
Affiliations:
Computer and Information Sciences and Engineering Department, University of Florida, Gainesville, FL
Venue:
Information Systems
Year:
2005

Citing 24
Cited 3

The NP-completeness column: An ongoing guide

Journal of Algorithms
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Approximation algorithms for NP-hard problems

Approximation algorithms for NP-hard problems
Hardness of approximations

Approximation algorithms for NP-hard problems
Beyond market baskets: generalizing association rules to correlations

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Efficiently mining long patterns from databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Approximating clique and biclique problems

Journal of Algorithms
Bottom-up computation of sparse and Iceberg CUBE

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Mining the most interesting rules

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Transversing itemset lattices with statistical metric pruning

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Small is beautiful: discovering the minimal set of unexpected patterns

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient search for association rules

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient computation of Iceberg cubes with complex measures

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Relations between average case complexity and approximation complexity

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
On the Complexity of Mining Quantitative Association Rules

Data Mining and Knowledge Discovery
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Discovering All Most Specific Sentences by Randomized Algorithms

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Mining Surprising Patterns Using Temporal Description Length

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Explaining Differences in Multidimensional Aggregates

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Mining Optimized Support Rules for Numeric Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Finding Interesting Associations without Support Pruning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I

Playing hide-and-seek with correlations

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering significant rules

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Mining top-k frequent closed itemsets is not in APX

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses some of the foundational issues associated with discovering the best few correlations from a database. Specifically, we consider the computational complexity of various definitions of the "top-k correlation problem," where the goal is to discover the few sets of events whose co-occurrence exhibits the smallest degree of independence. Our results show that many rigorous definitions of correlation lead to intractable and strongly inapproximable problems. Proof of this inapproximability is significant, since similar problems studied by the computer science theory community have resisted such analysis. One goal of the paper (and for future research) is to develop alternative correlation metrics whose use will both allow efficient search and produce results that are satisfactory for users.