Multiple hypothesis testing in pattern discovery

Authors:
Sami Hanhijärvi
Affiliations:
Department of Information and Computer Science, Aalto University, Finland
Venue:
DS'11 Proceedings of the 14th international conference on Discovery science
Year:
2011

Citing 9
Cited 0

Detecting Group Differences: Mining Contrast Sets

Data Mining and Knowledge Discovery
On the discovery of significant statistical quantitative rules

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An Efficient Algorithm for Discovering Frequent Subgraphs

IEEE Transactions on Knowledge and Data Engineering
Discovering significant rules

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering Significant Patterns

Machine Learning
Assessing data mining results via swap randomization

ACM Transactions on Knowledge Discovery from Data (TKDD)
Layered critical values: a powerful direct-adjustment approach to discovering significant patterns

Machine Learning
Tell me something I don't know: randomization strategies for iterative data mining

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Randomization methods for assessing data analysis results on real-valued matrices

Statistical Analysis and Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a single dataset entails a host of hypothesis, one for each itemset. A multiple hypothesis testing method is needed to control the number of false positives (Type I error). Our contribution in this paper is to extend the multiple hypothesis framework to be used in a generic data mining setting. We provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive). We show the power of our solution on real data.