Randomization methods for assessing data analysis results on real-valued matrices

Authors:
Markus Ojala;Niko Vuokko;Aleksi Kallio;Niina Haiminen;Heikki Mannila
Affiliations:
HIIT, Department of Information and Computer Science, Helsinki University of Technology, Finland;HIIT, Department of Information and Computer Science, Helsinki University of Technology, Finland;CSC - IT Center for Science Ltd, Finland;HIIT, Department of Computer Science, University of Helsinki, Finland;HIIT, Department of Information and Computer Science, Helsinki University of Technology, Finland and HIIT, Department of Computer Science, University of Helsinki, Finland
Venue:
Statistical Analysis and Data Mining
Year:
2009

Citing 0
Cited 6

Permutation Tests for Studying Classifier Performance

The Journal of Machine Learning Research
Multiple hypothesis testing in pattern discovery

DS'11 Proceedings of the 14th international conference on Discovery science
From black and white to full color: extending redescription mining outside the Boolean world

Statistical Analysis and Data Mining
Knowledge discovery interestingness measures based on unexpectedness

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A framework for evaluating the smoothness of data-mining results

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
A statistical significance testing approach to mining the most informative set of patterns

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Randomization is an important technique for assessing the significance of data analysis results. Given an input dataset, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is then compared to the measure on the samples to assess its significance. For certain types of data, e.g., gene expression matrices, it is useful to be able to sample datasets that have the same row and column distributions of values as the original dataset. Testing whether the results of a data mining algorithm on such randomized datasets differ from the results on the true dataset tells us whether the results on the true data were an artifact of the row and column statistics, or due to some more interesting phenomena in the data. We study the problem of generating such randomized datasets. We describe methods based on local transformations and Metropolis sampling, and show that the methods are efficient and usable in practice. We evaluate the performance of the methods both on real and generated data. We also show how our methods can be applied to a real data analysis scenario on DNA microarray data. The results indicate that the methods work efficiently and are usable in significance testing of data mining results on real-valued matrices. Copyright © 2009 Wiley Periodicals, Inc., A Wiley Company