Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization

Authors:
Stephen L. France;J. Douglas Carroll;Hui Xiong
Affiliations:
Lubar School of Business, UW - Milwaukee, 3202 N. Maryland Avenue, Milwaukee, WI 53201-0742, United States;Rutgers Business School, 1 Washington Park, 1 Washington Street, Newark, NJ 07102, United States;Rutgers Business School, 1 Washington Park, 1 Washington Street, Newark, NJ 07102, United States
Venue:
Information Sciences: an International Journal
Year:
2012

Citing 31
Cited 4

Spoken letter recognition

NIPS-3 Proceedings of the 1990 conference on Advances in neural information processing systems 3
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Partitioning-based clustering for Web document categorization

Decision Support Systems - Special issue on WITS '97
Item-based collaborative filtering recommendation algorithms

Proceedings of the 10th international conference on World Wide Web
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
An Empirical Analysis of Design Choices in Neighborhood-Based Collaborative Filtering Algorithms

Information Retrieval
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Towards systematic design of distance functions for data mining applications

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Similarity between Euclidean and cosine angle distance for nearest neighbor queries

Proceedings of the 2004 ACM symposium on Applied computing
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Learning Subjective Language

Computational Linguistics
Thumbs up?: sentiment classification using machine learning techniques

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
Matrix comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results

Journal of the American Society for Information Science and Technology
Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums

ACM Transactions on Information Systems (TOIS)
High-dimensional Data Analysis: From Optimal Metrics to Feature Selection

High-dimensional Data Analysis: From Optimal Metrics to Feature Selection
Designing Specific Weighted Similarity Measures to Improve Collaborative Filtering Systems

ICDM '08 Proceedings of the 8th industrial conference on Advances in Data Mining: Medical Applications, E-Commerce, Marketing, and Theoretical Aspects
On the effects of dimensionality on data analysis with neural networks

IWANN '03 Proceedings of the 7th International Work-Conference on Artificial and Natural Neural Networks: Part II: Artificial Neural Nets Problem Solving Methods
Distance Metric Learning for Large Margin Nearest Neighbor Classification

The Journal of Machine Learning Research
Is the Distance Compression Effect Overstated? Some Theory and Experimentation

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

Journal of Classification
A survey of collaborative filtering techniques

Advances in Artificial Intelligence
Collaborative filtering with ordinal scale-based implicit ratings for mobile music recommendations

Information Sciences: an International Journal
Semi-supervised distance metric learning for collaborative image retrieval and clustering

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Aggregation functions: Means

Information Sciences: an International Journal
Aggregation functions: Construction methods, conjunctive, disjunctive and mixed classes

Information Sciences: an International Journal
Selecting Attributes for Sentiment Classification Using Feature Relation Networks

IEEE Transactions on Knowledge and Data Engineering
On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news

Information Sciences: an International Journal
Nearest neighbor pattern classification

IEEE Transactions on Information Theory
A unified data mining solution for authorship analysis in anonymous textual communications

Information Sciences: an International Journal

Clustering high dimensional data

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A study on service identification methods for software product lines

Proceedings of the 16th International Software Product Line Conference - Volume 2
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
Two-factor face authentication using matrix permutation transformation and a user password

Information Sciences: an International Journal

Quantified Score

Hi-index	0.07

Visualization

Abstract

Previous work has shown that the Minkowski-p distance metrics are unsuitable for clustering very high dimensional document data. We extend this work. We frame statistical theory on the relationships between the Euclidean, cosine, and correlation distance metrics in terms of item neighborhoods. We discuss the differences between the cosine and correlation distance metrics and illustrate our discussion with an example from collaborative filtering. We introduce a family of normalized Minkowski metrics and test their use on both document data and synthetic data generated from the uniform distribution. We describe a range of criteria for testing neighborhood homogeneity relative to underlying latent classes. We discuss how these criteria are explicitly and implicitly linked to classification performance. By testing both normalized and non-normalized Minkowski-p metrics for multiple values of p, we separate out distance compression effects from normalization effects. For multi-class classification problems, we believe that distance compression on high dimensional data aids classification and data analysis. For document data, we find that the cosine (and normalized Euclidean), correlation, and proportioned city block metrics give strong neighborhood recovery. The proportioned city block metric gives particularly good results for nearest neighbors recovery and should be used when utilizing document data analysis techniques for which nearest neighborhood recovery is important. For data generated from the uniform distribution, neighborhood recovery improves as the value of p increases.