Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
A vector space model for automatic indexing
Communications of the ACM
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)
Similarity Search: The Metric Space Approach (Advances in Database Systems)
Similarity Search: The Metric Space Approach (Advances in Database Systems)
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality
RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
A neighborhood-based clustering by means of the triangle inequality
IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
The anchors hierarchy: using the triangle inequality to survive high dimensional data
UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
Efficient determination of binary non-negative vector neighbors with regard to cosine similarity
IEA/AIE'12 Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence
Bounds on lengths of real valued vectors similar with regard to the tanimoto similarity
ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part I
Hi-index | 0.00 |
The cosine and Tanimoto similarity measures are typically applied in the area of chemical informatics, bio-informatics, information retrieval, text and web mining as well as in very large databases for searching sufficiently similar vectors. In the case of large sparse high dimensional data sets such as text or Web data sets, one typically applies inverted indices for identification of candidates for sufficiently similar vectors to a given vector. In this article, we offer new theoretical results on how the knowledge about non-zero dimensions of real valued vectors can be used to reduce the number of candidates for vectors sufficiently cosine and Tanimoto similar to a given one. We illustrate and discuss the usefulness of our findings on a sample collection of documents represented by a set of a few thousand real valued vectors with more than ten thousand dimensions.