Characterising the difference

  • Authors:
  • Jilles Vreeken;Matthijs van Leeuwen;Arno Siebes

  • Affiliations:
  • Universiteit Utrecht;Universiteit Utrecht;Universiteit Utrecht

  • Venue:
  • Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Characterising the differences between two databases is an often occurring problem in Data Mining. Detection of change over time is a prime example, comparing databases from two branches is another one. The key problem is to discover the patterns that describe the difference. Emerging patterns provide only a partial answer to this question. In previous work, we showed that the data distribution can be captured in a pattern-based model using compression [12]. Here, we extend this approach to define a generic dissimilarity measure on databases. Moreover, we show that this approach can identify those patterns that characterise the differences between two distributions. Experimental results show that our method provides a well-founded way to independently measure database dissimilarity that allows for thorough inspection of the actual differences. This illustrates the use of our approach in real world data mining.