Tell me something I don't know: randomization strategies for iterative data mining

Authors:
Sami Hanhijärvi;Markus Ojala;Niko Vuokko;Kai Puolamäki;Nikolaj Tatti;Heikki Mannila
Affiliations:
Helsinki University of Technology, Espoo, Finland;Helsinki University of Technology, Espoo, Finland;Helsinki University of Technology, Espoo, Finland;Helsinki University of Technology, Espoo, Finland;Helsinki University of Technology, Espoo, Finland;Helsinki University of Technology, Espoo, Finland
Venue:
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2009

Citing 4
Cited 19

Non-derivable itemset mining

Data Mining and Knowledge Discovery
Discovering Significant Patterns

Machine Learning
Assessing data mining results via swap randomization

ACM Transactions on Knowledge Discovery from Data (TKDD)
Layered critical values: a powerful direct-adjustment approach to discovering significant patterns

Machine Learning

A framework for mining interesting pattern sets

Proceedings of the ACM SIGKDD Workshop on Useful Patterns
Probably the best itemsets

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Using background knowledge to rank itemsets

Data Mining and Knowledge Discovery
Preservation of statistically significant patterns in multiresolution 0-1 data

PRIB'10 Proceedings of the 5th IAPR international conference on Pattern recognition in bioinformatics
Summarising data by clustering items

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
A framework for mining interesting pattern sets

ACM SIGKDD Explorations Newsletter
An information theoretic framework for data mining

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Tell me what i need to know: succinctly summarizing data with itemsets

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Comparing apples and oranges: measuring differences between data mining results

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Maximum entropy models and subjective interestingness: an application to tiles in binary databases

Data Mining and Knowledge Discovery
Multiple hypothesis testing in pattern discovery

DS'11 Proceedings of the 14th international conference on Discovery science
Summarizing data succinctly with the most informative itemsets

ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on the Best of SIGKDD 2011
A pattern mining based integrative framework for biomarker discovery

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
Knowledge discovery interestingness measures based on unexpectedness

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Discovering descriptive tile trees: by mining optimal geometric subtiles

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Formalizing complex prior information to quantify subjective interestingness of frequent pattern sets

IDA'12 Proceedings of the 11th international conference on Advances in Intelligent Data Analysis
Summarizing categorical data by clustering attributes

Data Mining and Knowledge Discovery
A statistical significance testing approach to mining the most informative set of patterns

Data Mining and Knowledge Discovery
Interesting pattern mining in multi-relational data

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.