Assessing Data Mining Results on Matrices with Randomization

  • Authors:
  • Markus Ojala

  • Affiliations:
  • -

  • Venue:
  • ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Randomization is a general technique for evaluating the significance of data analysis results. In randomization-based significance testing, a result is considered to be interesting if it is unlikely to obtain as good result on random data sharing some basic properties with the original data. Recently, the randomization approach has been applied to assess data mining results on binary matrices and limited types of real-valued matrices. In these works, the row and column value distributions are approximately preserved in randomization. However, the previous approaches suffer from various technical and practical shortcomings. In this paper, we give solutions to these problems and introduce a new practical algorithm for randomizing various types of matrices while preserving the row and column value distributions more accurately. We propose a new approach for randomizing matrices containing features measured in different scales. Compared to previous work, our approach can be applied to assess data mining results on different types of real-life matrices containing dissimilar features, nominal values, non-Gaussian value distributions, missing values and sparse structure. We provide an easily usable implementation that does not need problematic manual tuning as theoretically justified parameter values are given. We perform extensive experiments on various real-life datasets showing that our approach produces reasonable results on practically all types of matrices while being easy and fast to use.