Is the Distance Compression Effect Overstated? Some Theory and Experimentation

Authors:
Stephen France;Douglas Carroll
Affiliations:
Lubar School of Business, UW --- Milwaukee, Wisconsin 53201-0742;Graduate School of Management, Newark, Rutgers University, New Jersey 07102-3027
Venue:
MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Year:
2009

Citing 7
Cited 2

Spoken letter recognition

NIPS-3 Proceedings of the 1990 conference on Advances in neural information processing systems 3
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Partitioning-based clustering for Web document categorization

Decision Support Systems - Special issue on WITS '97
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
On the effects of dimensionality on data analysis with neural networks

IWANN '03 Proceedings of the 7th International Work-Conference on Artificial and Natural Neural Networks: Part II: Artificial Neural Nets Problem Solving Methods

Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization

Information Sciences: an International Journal
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previous work in the document clustering literature has shown that the Minkowski-p distance metrics are unsuitable for clustering very high dimensional document data. This unsuitability is put down to the effect of "compression" of the distances created using the Minkowski-p metrics on high dimensional data. Previous experimental work on distance compression has generally used the performance of clustering algorithms on distances created by the different distance metrics as a proxy for the quality of the distance representations created by those metrics. In order to separate out the effects of distances from the performance of the clustering algorithms we tested the homogeneity of the latent classes with respect to item neighborhoods rather than testing the homogeneity of clustering solutions with respect to latent classes. We show the theoretical relationships between the cosine, correlation, and Euclidean metrics. We posit that some of the performance differential between the cosine and correlation metrics and the Minkowski-p metrics is due to the inbuilt normalization of the cosine and correlation metrics. The normalization effect decreases with increasing dimensionality and the distance compression effect increases with increasing dimensionality. For document datasets with dimensionality up to 20,000, the normalization effect dominates the distance compression effect. We propose a methodology for measuring the relative normalization and distance compression effects.