On the distance concentration awareness of certain data reduction techniques

Authors:
Ata Kabán
Affiliations:
School of Computer Science, The University of Birmingham, Edgbaston, Birmingham B15 2TT, UK
Venue:
Pattern Recognition
Year:
2011

Citing 16
Cited 2

Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Independent component analysis by general nonlinear Hebbian-like learning rules

Signal Processing - Special issue on neural networks
On the geometry of similarity search: dimensionality curse and concentration of measure

Information Processing Letters
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
An elementary proof of a theorem of Johnson and Lindenstrauss

Random Structures & Algorithms
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
Learning Mixtures of Gaussians

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
BagBoosting for tumor classification with gene expression data

Bioinformatics
Reconstructing biological networks using conditional correlation analysis

Bioinformatics
A Bayesian regression approach to the inference of regulatory networks from gene expression data

Bioinformatics
Gene selection in cancer classification using sparse logistic regression with Bayesian regularization

Bioinformatics
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
On the Design and Applicability of Distance Functions in High-Dimensional Data Space

IEEE Transactions on Knowledge and Data Engineering
When is 'nearest neighbour' meaningful: A converse theorem and implications

Journal of Complexity
New instability results for high-dimensional nearest neighbor search

Information Processing Letters
Compressed fisher linear discriminant analysis: classification of randomly projected data

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Non-parametric detection of meaningless distances in high dimensional data

Statistics and Computing
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining

Quantified Score

Hi-index	0.01

Visualization

Abstract

We make a first investigation into a recently raised concern about the suitability of existing data analysis techniques when faced with the counter-intuitive properties of high dimensional data spaces, such as the phenomenon of distance concentration. Under the structural assumption of a generic linear model with a latent variable and an additive unstructured noise, we find that dimension reduction that explicitly guards against distance concentration recovers the well-known techniques of Fisher's linear discriminant analysis, Fisher's discriminant ratio and a variant of projection pursuit. Extrapolation to regression uncovers a close link to sure independence screening, which is a recently proposed technique for variable selection in ultra-high dimensional feature spaces. Hence, these techniques may be seen as distance concentration aware, despite they have not been explicitly designed to have this property. Throughout our analysis, other than the dependency structure implied by the mentioned linear model, we make no assumptions about the distributions of the variables involved.