Assessing data mining results via swap randomization

Authors:
Aristides Gionis;Heikki Mannila;Taneli Mielikäinen;Panayiotis Tsaparas
Affiliations:
Yahoo! Research, Barcelona, Spain;University of Helsinki and Helsinki University of Technology, Helsinki, Finland;Nokia Research Center, Palo Alto, CA;Search Labs, Microsoft Research, Mountain View, CA
Venue:
ACM Transactions on Knowledge Discovery from Data (TKDD)
Year:
2007

Citing 14
Cited 28

Beyond market baskets: generalizing association rules to correlations

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
On the precise number of (0,1)-matrices in U(R,S)

Discrete Mathematics
Pruning and summarizing the discovered associations

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Using association rules for product assortment decisions: a case study

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Empirical bayes screening for multi-item associations

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying non-actionable association rules

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Selecting the right interestingness measure for association patterns

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate counting by dynamic programming

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Computational complexity of itemset frequency satisfiability

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs

Bioinformatics
Sampling binary contingency tables with a greedy start

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Discovering significant rules

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering Significant Patterns

Machine Learning

Randomization Techniques for Data Mining Methods

ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
Tell me something I don't know: randomization strategies for iterative data mining

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A framework for mining interesting pattern sets

Proceedings of the ACM SIGKDD Workshop on Useful Patterns
Using background knowledge to rank itemsets

Data Mining and Knowledge Discovery
Permutation Tests for Studying Classifier Performance

The Journal of Machine Learning Research
Preservation of statistically significant patterns in multiresolution 0-1 data

PRIB'10 Proceedings of the 5th IAPR international conference on Pattern recognition in bioinformatics
Summarising data by clustering items

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Fast random graph generation

Proceedings of the 14th International Conference on Extending Database Technology
A framework for mining interesting pattern sets

ACM SIGKDD Explorations Newsletter
Krimp: mining itemsets that compress

Data Mining and Knowledge Discovery
An information theoretic framework for data mining

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Tell me what i need to know: succinctly summarizing data with itemsets

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering Algorithms for Chains

The Journal of Machine Learning Research
Maximum entropy models and subjective interestingness: an application to tiles in binary databases

Data Mining and Knowledge Discovery
Multiple hypothesis testing in pattern discovery

DS'11 Proceedings of the 14th international conference on Discovery science
Gene selection in time-series gene expression data

PRIB'11 Proceedings of the 6th IAPR international conference on Pattern recognition in bioinformatics
Approaches to the selection of relevant concepts in the case of noisy data

ICFCA'10 Proceedings of the 8th international conference on Formal Concept Analysis
Testing the significance of spatio-temporal teleconnection patterns

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Summarizing data succinctly with the most informative itemsets

ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on the Best of SIGKDD 2011
Knowledge discovery interestingness measures based on unexpectedness

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Formalizing complex prior information to quantify subjective interestingness of frequent pattern sets

IDA'12 Proceedings of the 11th international conference on Advances in Intelligent Data Analysis
Summarizing categorical data by clustering attributes

Data Mining and Knowledge Discovery
An effective and efficient parallel approach for random graph generation over GPUs

Journal of Parallel and Distributed Computing
One-mode Projection of Multiplex Bipartite Graphs

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
A statistical significance testing approach to mining the most informative set of patterns

Data Mining and Knowledge Discovery
Interesting pattern mining in multi-relational data

Data Mining and Knowledge Discovery
A people-to-people matching system using graph mining techniques

World Wide Web
Compass: A hybrid method for clinical and biobank data mining

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of assessing the significance of data mining results on high-dimensional 0--1 datasets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by standard statistical tests such as chi-square, or other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are difficult to apply to sets of patterns or other complex results of data mining algorithms. In this article, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins as the given dataset, computing the results of interest on the randomized instances and comparing them to the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and spectral analysis. To generate random datasets with given margins, we use variations of a Markov chain approach which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well-known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is expected, given the row and column margins of the datasets, while for other datasets the discovered structure conveys information that is not captured by the margin counts.