Comparing apples and oranges: measuring differences between data mining results

Authors:
Nikolaj Tatti;Jilles Vreeken
Affiliations:
Advanced Database Research and Modeling, Universiteit Antwerpen;Advanced Database Research and Modeling, Universiteit Antwerpen
Venue:
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Year:
2011

Citing 19
Cited 0

Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Turning CARTwheels: an alternating algorithm for mining redescriptions

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Geometric and combinatorial tiles in 0-1 data

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Reasoning about sets using redescription mining

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Summarizing itemset patterns using probabilistic models

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Distances between Data Sets Based on Summary Statistics

The Journal of Machine Learning Research
Characterising the difference

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An approximation ratio for biclustering

Information Processing Letters
The Discrete Basis Problem

IEEE Transactions on Knowledge and Data Engineering
Tell me something I don't know: randomization strategies for iterative data mining

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluating clustering in subspace projections of high dimensional data

Proceedings of the VLDB Endowment
Computational complexity of queries based on itemsets

Information Processing Letters
Summarising data by clustering items

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Krimp: mining itemsets that compress

Data Mining and Knowledge Discovery
Summarizing transactional databases with overlapped hyperrectangles

Data Mining and Knowledge Discovery
Tell me what i need to know: succinctly summarizing data with itemsets

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Maximum entropy models and subjective interestingness: an application to tiles in binary databases

Data Mining and Knowledge Discovery
A bi-clustering framework for categorical data

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Deciding whether the results of two different mining algorithms provide significantly different information is an important open problem in exploratory data mining. Whether the goal is to select the most informative result for analysis, or decide which mining approach will likely provide the most novel insight, it is essential that we can tell how different the information is that two results provide. In this paper we take a first step towards comparing exploratory results on binary data. We propose to meaningfully convert results into sets of noisy tiles, and compare between these sets byMaximum Entropy modelling and Kullback-Leibler divergence. The measure we construct this way is flexible, and allows us to naturally include background knowledge, such that differences in results can be measured from the perspective of what a user already knows. Furthermore, adding to its interpretability, it coincides with Jaccard dissimilarity when we only consider exact tiles. Our approach provides a means to study and tell differences between results of different data mining methods. As an application, we show that it can also be used to identify which parts of results best redescribe other results. Experimental evaluation shows our measure gives meaningful results, correctly identifies methods that are similar in nature, and automatically provides sound redescriptions of results.