Assessing data mining results via swap randomization

  • Authors:
  • Aristides Gionis;Heikki Mannila;Taneli Mielikäinen;Panayiotis Tsaparas

  • Affiliations:
  • Yahoo! Research, Barcelona, Spain;University of Helsinki and Helsinki University of Technology, Helsinki, Finland;Nokia Research Center, Palo Alto, CA;Search Labs, Microsoft Research, Mountain View, CA

  • Venue:
  • ACM Transactions on Knowledge Discovery from Data (TKDD)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The problem of assessing the significance of data mining results on high-dimensional 0--1 datasets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by standard statistical tests such as chi-square, or other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are difficult to apply to sets of patterns or other complex results of data mining algorithms. In this article, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins as the given dataset, computing the results of interest on the randomized instances and comparing them to the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and spectral analysis. To generate random datasets with given margins, we use variations of a Markov chain approach which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well-known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is expected, given the row and column margins of the datasets, while for other datasets the discovered structure conveys information that is not captured by the margin counts.