Learning similarity measures in non-orthogonal space

Authors:
Ning Liu;Benyu Zhang;Jun Yan;Qiang Yang;Shuicheng Yan;Zheng Chen;Fengshan Bai;Wei-Ying Ma
Affiliations:
Tsinghua University, Beijing, P.R. China;Microsoft Research Asia, Beijing, P.R. China;Peking University, Beijing, P.R. China;Hong Kong University of Science and Technology, Hong Kong;Microsoft Research Asia, Beijing, P.R. China;Microsoft Research Asia, Beijing, P.R. China;Tsinghua University, Beijing, P.R. China;Microsoft Research Asia, Beijing, P.R. China
Venue:
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Year:
2004

Citing 9
Cited 4

Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Latent semantic space: iterative scaling improves precision of inter-document similarity measurement

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Using LSI for text classification in the presence of background text

Proceedings of the tenth international conference on Information and knowledge management
Model-based feedback in the language modeling approach to information retrieval

Proceedings of the tenth international conference on Information and knowledge management
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Vulnerabilities in similarity search based systems

Proceedings of the eleventh international conference on Information and knowledge management
Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Machine Learning
Similarity-Based Operators and Query Optimization for Multimedia Database Systems

IDEAS '01 Proceedings of the International Database Engineering & Applications Symposium

Applying randomized projection to aid prediction algorithms in detecting high-dimensional rogue applications

Proceedings of the 47th Annual Southeast Regional Conference
Aiding prediction algorithms in detecting high-dimensional malicious applications using a randomized projection technique

Proceedings of the 48th Annual Southeast Regional Conference
Using randomized projection techniques to aid in detecting high-dimensional malicious applications

Proceedings of the 49th Annual Southeast Regional Conference
Semantic smoothing for text clustering

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many machine learning and data mining algorithms crucially rely on the similarity metrics. The Cosine similarity, which calculates the inner product of two normalized feature vectors, is one of the most commonly used similarity measures. However, in many practical tasks such as text categorization and document clustering, the Cosine similarity is calculated under the assumption that the input space is an orthogonal space which usually could not be satisfied due to synonymy and polysemy. Various algorithms such as Latent Semantic Indexing (LSI) were used to solve this problem by projecting the original data into an orthogonal space. However LSI also suffered from the high computational cost and data sparseness. These shortcomings led to increases in computation time and storage requirements for large scale realistic data. In this paper, we propose a novel and effective similarity metric in the non-orthogonal input space. The basic idea of our proposed metric is that the similarity of features should affect the similarity of objects, and vice versa. A novel iterative algorithm for computing non-orthogonal space similarity measures is then proposed. Experimental results on a synthetic data set, a real MSN search click-thru logs, and 20NG dataset show that our algorithm outperforms the traditional Cosine similarity and is superior to LSI.