A statistical significance testing approach to mining the most informative set of patterns

Authors:
Jefrey Lijffijt;Panagiotis Papapetrou;Kai Puolamäki
Affiliations:
Department of Information and Computer Science, Aalto University, Aalto, Finland 00076;Department of Information and Computer Science, Aalto University, Aalto, Finland 00076 and Department of Computer Science and Information Systems, Birkbeck, University of London Malet street, Lond ...;Department of Information and Computer Science, Aalto University, Aalto, Finland 00076 and Finnish Institute of Occupational Health, Topeliuksenkatu, Helsinki, Finland FI-00025
Venue:
Data Mining and Knowledge Discovery
Year:
2014

Citing 16
Cited 0

Dynamic itemset counting and implication rules for market basket data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Efficiently mining long patterns from databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient mining of association rules using closed itemset lattices

Information Systems
Surrogate time series

Physica D
On the approximation of curves by line segments using dynamic programming

Communications of the ACM
Non-derivable itemset mining

Data Mining and Knowledge Discovery
Discovering Significant Patterns

Machine Learning
Assessing data mining results via swap randomization

ACM Transactions on Knowledge Discovery from Data (TKDD)
MINI: Mining Informative Non-redundant Itemsets

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
One in a million: picking the right patterns

Knowledge and Information Systems
Tell me something I don't know: randomization strategies for iterative data mining

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Randomization methods for assessing data analysis results on real-valued matrices

Statistical Analysis and Data Mining
Krimp: mining itemsets that compress

Data Mining and Knowledge Discovery
An information theoretic framework for data mining

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Maximum entropy models and subjective interestingness: an application to tiles in binary databases

Data Mining and Knowledge Discovery
Pattern teams

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hypothesis testing using constrained null models can be used to compute the significance of data mining results given what is already known about the data. We study the novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value. The resulting set of patterns, such as frequent patterns or clusterings, is the smallest set that statistically explains the data. We show that the newly formulated problem is, in its general form, NP-hard and there exists no efficient algorithm with finite approximation ratio. However, we show that in a special case a solution can be computed efficiently with a provable approximation ratio. We find that a greedy algorithm gives good results on real data and that, using our approach, we can formulate and solve many known data-mining tasks. We demonstrate our method on several data mining tasks. We conclude that our framework is able to identify in various settings a small set of patterns that statistically explains the data and to formulate data mining problems in the terms of statistical significance.