A statistical significance testing approach to mining the most informative set of patterns

  • Authors:
  • Jefrey Lijffijt;Panagiotis Papapetrou;Kai Puolamäki

  • Affiliations:
  • Department of Information and Computer Science, Aalto University, Aalto, Finland 00076;Department of Information and Computer Science, Aalto University, Aalto, Finland 00076 and Department of Computer Science and Information Systems, Birkbeck, University of London Malet street, Lond ...;Department of Information and Computer Science, Aalto University, Aalto, Finland 00076 and Finnish Institute of Occupational Health, Topeliuksenkatu, Helsinki, Finland FI-00025

  • Venue:
  • Data Mining and Knowledge Discovery
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Hypothesis testing using constrained null models can be used to compute the significance of data mining results given what is already known about the data. We study the novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value. The resulting set of patterns, such as frequent patterns or clusterings, is the smallest set that statistically explains the data. We show that the newly formulated problem is, in its general form, NP-hard and there exists no efficient algorithm with finite approximation ratio. However, we show that in a special case a solution can be computed efficiently with a provable approximation ratio. We find that a greedy algorithm gives good results on real data and that, using our approach, we can formulate and solve many known data-mining tasks. We demonstrate our method on several data mining tasks. We conclude that our framework is able to identify in various settings a small set of patterns that statistically explains the data and to formulate data mining problems in the terms of statistical significance.