Distances between Data Sets Based on Summary Statistics

Authors:
Nikolaj Tatti
Affiliations:
-
Venue:
The Journal of Machine Learning Research
Year:
2007

Citing 7
Cited 4

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
The computational complexity of probabilistic inference using Bayesian belief networks (research note)

Artificial Intelligence
Distance measures for signal processing and pattern recognition

Signal Processing
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Fast discovery of association rules

Advances in knowledge discovery and data mining
Prediction with local patterns using cross-entropy

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Principles of data mining

Principles of data mining

Learning to combine distances for complex representations

Proceedings of the 24th international conference on Machine learning
Comparing apples and oranges: measuring differences between data mining results

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Automatic recommendation of classification algorithms based on data set characteristics

Pattern Recognition
A novel feature subset selection algorithm based on association rule mining

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The concepts of similarity and distance are crucial in data mining. We consider the problem of defining the distance between two data sets by comparing summary statistics computed from the data sets. The initial definition of our distance is based on geometrical notions of certain sets of distributions. We show that this distance can be computed in cubic time and that it has several intuitive properties. We also show that this distance is the unique Mahalanobis distance satisfying certain assumptions. We also demonstrate that if we are dealing with binary data sets, then the distance can be represented naturally by certain parity functions, and that it can be evaluated in linear time. Our empirical tests with real world data show that the distance works well.