Inductive learning algorithms and representations for text categorization
Proceedings of the seventh international conference on Information and knowledge management
Latent semantic space: iterative scaling improves precision of inter-document similarity measurement
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Using LSI for text classification in the presence of background text
Proceedings of the tenth international conference on Information and knowledge management
Model-based feedback in the language modeling approach to information retrieval
Proceedings of the tenth international conference on Information and knowledge management
Modern Information Retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Vulnerabilities in similarity search based systems
Proceedings of the eleventh international conference on Information and knowledge management
Similarity-Based Operators and Query Optimization for Multimedia Database Systems
IDEAS '01 Proceedings of the International Database Engineering & Applications Symposium
Proceedings of the 47th Annual Southeast Regional Conference
Proceedings of the 48th Annual Southeast Regional Conference
Using randomized projection techniques to aid in detecting high-dimensional malicious applications
Proceedings of the 49th Annual Southeast Regional Conference
Semantic smoothing for text clustering
Knowledge-Based Systems
Hi-index | 0.00 |
Many machine learning and data mining algorithms crucially rely on the similarity metrics. The Cosine similarity, which calculates the inner product of two normalized feature vectors, is one of the most commonly used similarity measures. However, in many practical tasks such as text categorization and document clustering, the Cosine similarity is calculated under the assumption that the input space is an orthogonal space which usually could not be satisfied due to synonymy and polysemy. Various algorithms such as Latent Semantic Indexing (LSI) were used to solve this problem by projecting the original data into an orthogonal space. However LSI also suffered from the high computational cost and data sparseness. These shortcomings led to increases in computation time and storage requirements for large scale realistic data. In this paper, we propose a novel and effective similarity metric in the non-orthogonal input space. The basic idea of our proposed metric is that the similarity of features should affect the similarity of objects, and vice versa. A novel iterative algorithm for computing non-orthogonal space similarity measures is then proposed. Experimental results on a synthetic data set, a real MSN search click-thru logs, and 20NG dataset show that our algorithm outperforms the traditional Cosine similarity and is superior to LSI.