NIPS-3 Proceedings of the 1990 conference on Advances in neural information processing systems 3
OHSUMED: an interactive retrieval evaluation and new large test collection for research
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Partitioning-based clustering for Web document categorization
Decision Support Systems - Special issue on WITS '97
Item-based collaborative filtering recommendation algorithms
Proceedings of the 10th international conference on World Wide Web
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
When Is ''Nearest Neighbor'' Meaningful?
ICDT '99 Proceedings of the 7th International Conference on Database Theory
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces
ICDT '01 Proceedings of the 8th International Conference on Database Theory
Towards systematic design of distance functions for data mining applications
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Similarity between Euclidean and cosine angle distance for nearest neighbor queries
Proceedings of the 2004 ACM symposium on Applied computing
Computational Linguistics
Thumbs up?: sentiment classification using machine learning techniques
EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
The Concentration of Fractional Distances
IEEE Transactions on Knowledge and Data Engineering
Journal of the American Society for Information Science and Technology
Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums
ACM Transactions on Information Systems (TOIS)
High-dimensional Data Analysis: From Optimal Metrics to Feature Selection
High-dimensional Data Analysis: From Optimal Metrics to Feature Selection
Designing Specific Weighted Similarity Measures to Improve Collaborative Filtering Systems
ICDM '08 Proceedings of the 8th industrial conference on Advances in Data Mining: Medical Applications, E-Commerce, Marketing, and Theoretical Aspects
On the effects of dimensionality on data analysis with neural networks
IWANN '03 Proceedings of the 7th International Work-Conference on Artificial and Natural Neural Networks: Part II: Artificial Neural Nets Problem Solving Methods
Distance Metric Learning for Large Margin Nearest Neighbor Classification
The Journal of Machine Learning Research
Is the Distance Compression Effect Overstated? Some Theory and Experimentation
MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering
Journal of Classification
A survey of collaborative filtering techniques
Advances in Artificial Intelligence
Collaborative filtering with ordinal scale-based implicit ratings for mobile music recommendations
Information Sciences: an International Journal
Semi-supervised distance metric learning for collaborative image retrieval and clustering
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Information Sciences: an International Journal
Aggregation functions: Construction methods, conjunctive, disjunctive and mixed classes
Information Sciences: an International Journal
Selecting Attributes for Sentiment Classification Using Feature Relation Networks
IEEE Transactions on Knowledge and Data Engineering
Information Sciences: an International Journal
Nearest neighbor pattern classification
IEEE Transactions on Information Theory
A unified data mining solution for authorship analysis in anonymous textual communications
Information Sciences: an International Journal
Clustering high dimensional data
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A study on service identification methods for software product lines
Proceedings of the 16th International Software Product Line Conference - Volume 2
A survey on unsupervised outlier detection in high-dimensional numerical data
Statistical Analysis and Data Mining
Two-factor face authentication using matrix permutation transformation and a user password
Information Sciences: an International Journal
Hi-index | 0.07 |
Previous work has shown that the Minkowski-p distance metrics are unsuitable for clustering very high dimensional document data. We extend this work. We frame statistical theory on the relationships between the Euclidean, cosine, and correlation distance metrics in terms of item neighborhoods. We discuss the differences between the cosine and correlation distance metrics and illustrate our discussion with an example from collaborative filtering. We introduce a family of normalized Minkowski metrics and test their use on both document data and synthetic data generated from the uniform distribution. We describe a range of criteria for testing neighborhood homogeneity relative to underlying latent classes. We discuss how these criteria are explicitly and implicitly linked to classification performance. By testing both normalized and non-normalized Minkowski-p metrics for multiple values of p, we separate out distance compression effects from normalization effects. For multi-class classification problems, we believe that distance compression on high dimensional data aids classification and data analysis. For document data, we find that the cosine (and normalized Euclidean), correlation, and proportioned city block metrics give strong neighborhood recovery. The proportioned city block metric gives particularly good results for nearest neighbors recovery and should be used when utilizing document data analysis techniques for which nearest neighborhood recovery is important. For data generated from the uniform distribution, neighborhood recovery improves as the value of p increases.