Using Non-Zero Dimensions for the Cosine and Tanimoto Similarity Search Among Real Valued Vectors

  • Authors:
  • Marzena Kryszkiewicz

  • Affiliations:
  • Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland. mkr@ii.pw.edu.pl

  • Venue:
  • Fundamenta Informaticae - To Andrzej Skowron on His 70th Birthday
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The cosine and Tanimoto similarity measures are typically applied in the area of chemical informatics, bio-informatics, information retrieval, text and web mining as well as in very large databases for searching sufficiently similar vectors. In the case of large sparse high dimensional data sets such as text or Web data sets, one typically applies inverted indices for identification of candidates for sufficiently similar vectors to a given vector. In this article, we offer new theoretical results on how the knowledge about non-zero dimensions of real valued vectors can be used to reduce the number of candidates for vectors sufficiently cosine and Tanimoto similar to a given one. We illustrate and discuss the usefulness of our findings on a sample collection of documents represented by a set of a few thousand real valued vectors with more than ten thousand dimensions.