Efficient estimation for high similarities using odd sketches

Authors:
Michael Mitzenmacher;Rasmus Pagh;Ninh Pham
Affiliations:
Harvard University, Cambridge, MA, USA;IT University of Copenhagen, Copenhagen, Denmark;IT University of Copenhagen, Copenhagen, Denmark
Venue:
Proceedings of the 23rd international conference on World wide web
Year:
2014

Citing 18
Cited 0

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Probability and Computing: Randomized Algorithms and Probabilistic Analysis

Probability and Computing: Randomized Algorithms and Probabilistic Analysis
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Exploiting asymmetry in hierarchical topic extraction

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
Finding text reuse on the web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
An axiomatic approach for result diversification

Proceedings of the 18th international conference on World wide web
On compressing social networks

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Sketching techniques for collaborative filtering

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
Improved Consistent Sampling, Weighted Minhash and L1 Sketching

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
STRIP: stream learning of influence probabilities

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Estimating set similarity is a central problem in many computer applications. In this paper we introduce the Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. The exclusive-or of two sketches equals the sketch of the symmetric difference of the two sets. This means that Odd Sketches provide a highly space-efficient estimator for sets of high similarity, which is relevant in applications such as web duplicate detection, collaborative filtering, and association rule learning. The method extends to weighted Jaccard similarity, relevant e.g. for TF-IDF vector comparison. We present a theoretical analysis of the quality of estimation to guarantee the reliability of Odd Sketch-based estimators. Our experiments confirm this efficiency, and demonstrate the efficiency of Odd Sketches in comparison with $b$-bit minwise hashing schemes on association rule learning and web duplicate detection tasks.