Assessing data mining results via swap randomization

Authors:
Aristides Gionis;Heikki Mannila;Taneli Mielikäinen;Panayiotis Tsaparas
Affiliations:
University of Helsinki & Helsinki University of Technology;University of Helsinki & Helsinki University of Technology;University of Helsinki;University of Helsinki & Helsinki University of Technology
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 8
Cited 16

Beyond market baskets: generalizing association rules to correlations

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
On the precise number of (0,1)-matrices in U(R,S)

Discrete Mathematics
Pruning and summarizing the discovered associations

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Using association rules for product assortment decisions: a case study

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Selecting the right interestingness measure for association patterns

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate counting by dynamic programming

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs

Bioinformatics

Finding low-entropy sets and trees from binary data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Nestedness and segmented nestedness

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Semantic annotation of frequent patterns

ACM Transactions on Knowledge Discovery from Data (TKDD)
Layered critical values: a powerful direct-adjustment approach to discovering significant patterns

Machine Learning
Privacy Preserving Market Basket Data Analysis

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Finding Outlying Items in Sets of Partial Rankings

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Mining probabilistic automata: a statistical view of sequential pattern mining

Machine Learning
An efficient rigorous approach for identifying statistically significant frequent itemsets

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Motif discovery in physiological datasets: A methodology for inferring predictive elements

ACM Transactions on Knowledge Discovery from Data (TKDD)
Actionability and formal concepts: a data mining perspective

ICFCA'08 Proceedings of the 6th international conference on Formal concept analysis
Assessing and ranking structural correlations in graphs

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Generating random graphic sequences

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Inductive databases and constraint-based data mining

ICFCA'11 Proceedings of the 9th international conference on Formal concept analysis
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Journal of the ACM (JACM)
Frequent item set mining

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A framework for evaluating the smoothness of data-mining results

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of assessing the significance of data mining results on high-dimensional 0-1 data sets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by, e.g., chi-square tests, or many other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are more difficult to apply to sets of patterns or other complex results of data mining. In this paper, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins with the given dataset, computing the results of interest on the randomized instances, and comparing them against the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and rankings. To generate random datasets with given margins, we use variations of a Markov chain approach, which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well-known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is a random artifact, while for other datasets the discovered structure conveys meaningful information.