Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Min-wise independent permutations
Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Finding Interesting Associations without Support Pruning
IEEE Transactions on Knowledge and Data Engineering
Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Probability and Computing: Randomized Algorithms and Probabilistic Analysis
Probability and Computing: Randomized Algorithms and Probabilistic Analysis
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Exploiting asymmetry in hierarchical topic extraction
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Google news personalization: scalable online collaborative filtering
Proceedings of the 16th international conference on World Wide Web
Tracking Web spam with HTML style similarities
ACM Transactions on the Web (TWEB)
Proceedings of the Second ACM International Conference on Web Search and Data Mining
An axiomatic approach for result diversification
Proceedings of the 18th international conference on World wide web
On compressing social networks
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Sketching techniques for collaborative filtering
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Proceedings of the 19th international conference on World wide web
Improved Consistent Sampling, Weighted Minhash and L1 Sketching
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
STRIP: stream learning of influence probabilities
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Hi-index | 0.00 |
Estimating set similarity is a central problem in many computer applications. In this paper we introduce the Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. The exclusive-or of two sketches equals the sketch of the symmetric difference of the two sets. This means that Odd Sketches provide a highly space-efficient estimator for sets of high similarity, which is relevant in applications such as web duplicate detection, collaborative filtering, and association rule learning. The method extends to weighted Jaccard similarity, relevant e.g. for TF-IDF vector comparison. We present a theoretical analysis of the quality of estimation to guarantee the reliability of Odd Sketch-based estimators. Our experiments confirm this efficiency, and demonstrate the efficiency of Odd Sketches in comparison with $b$-bit minwise hashing schemes on association rule learning and web duplicate detection tasks.