Using Non-Zero Dimensions for the Cosine and Tanimoto Similarity Search Among Real Valued Vectors

Authors:
Marzena Kryszkiewicz
Affiliations:
Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland. mkr@ii.pw.edu.pl
Venue:
Fundamenta Informaticae - To Andrzej Skowron on His 70th Birthday
Year:
2013

Citing 13
Cited 0

Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
A vector space model for automatic indexing

Communications of the ACM
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)

Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)
Similarity Search: The Metric Space Approach (Advances in Database Systems)

Similarity Search: The Metric Space Approach (Advances in Database Systems)
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality

RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
A neighborhood-based clustering by means of the triangle inequality

IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
The anchors hierarchy: using the triangle inequality to survive high dimensional data

UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
Efficient determination of binary non-negative vector neighbors with regard to cosine similarity

IEA/AIE'12 Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence
Bounds on lengths of real valued vectors similar with regard to the tanimoto similarity

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

The cosine and Tanimoto similarity measures are typically applied in the area of chemical informatics, bio-informatics, information retrieval, text and web mining as well as in very large databases for searching sufficiently similar vectors. In the case of large sparse high dimensional data sets such as text or Web data sets, one typically applies inverted indices for identification of candidates for sufficiently similar vectors to a given vector. In this article, we offer new theoretical results on how the knowledge about non-zero dimensions of real valued vectors can be used to reduce the number of candidates for vectors sufficiently cosine and Tanimoto similar to a given one. We illustrate and discuss the usefulness of our findings on a sample collection of documents represented by a set of a few thousand real valued vectors with more than ten thousand dimensions.