On the Design and Applicability of Distance Functions in High-Dimensional Data Space

Authors:
Chih-Ming Hsu;Ming-Syan Chen
Affiliations:
National Taiwan University, Taipei;National Taiwan University, Taipei
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2009

Citing 0
Cited 5

New instability results for high-dimensional nearest neighbor search

Information Processing Letters
On the distance concentration awareness of certain data reduction techniques

Pattern Recognition
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

The Journal of Machine Learning Research
Non-parametric detection of meaningless distances in high dimensional data

Statistics and Computing
Classification and outlier detection based on topic based pattern synthesis

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Effective distance functions in high dimensional data space are very important in solutions for many data mining problems. Recent research has shown that if the Pearson variation of the distance distribution converges to zero with increasing dimensionality, the distance function will become unstable (or meaningless) in high dimensional space, even with the commonly used $L_p$ metric in the Euclidean space. This result has spawned many studies the along the same lines. However, the necessary condition for unstability of a distance function, which is required for function design, remains unknown. In this paper, we shall prove that several important conditions are in fact equivalent to unstability. Based on these theoretical results, we employ some effective and valid indices for testing the stability of a distance function. In addition, this theoretical analysis inspires us that unstable phenomena are rooted in variation of the distance distribution. To demonstrate the theoretical results, we design a meaningful distance function, called the Shrinkage-Divergence Proximity (SDP), based on a given distance function. It is shown empirically that the SDP significantly outperforms other measures in terms of stability in high dimensional data space, and is thus more suitable for distance-based clustering applications.