The power of two min-hashes for similarity search among hierarchical data objects

Authors:
Sreenivas Gollapudi;Rina Panigrahy
Affiliations:
Microsoft Research, Mountain View, CA, USA;Micsrosoft Research, Mountain View, CA, USA
Venue:
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2008

Citing 16
Cited 0

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Change detection in hierarchically structured information

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
New algorithm for ordered tree-to-tree correction problem

Journal of Algorithms
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Alignment of Trees - An Alternative to Tree Edit

CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Low distortion embeddings for edit distance

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
XML stream processing using tree-edit distance embeddings

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Approximate matching of hierarchical data using pq-grams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Entropy based nearest neighbor search in high dimensions

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Exploiting asymmetry in hierarchical topic extraction

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Approximate Joins for Data-Centric XML

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this study we propose sketching algorithms for computing similarities between hierarchical data. Specifically, we look at data objects that are represented using leaf-labeled trees denoting a set of elements at the leaves organized in a hierarchy. Such representations are richer alternatives to a set. For example, a document can be represented as a hierarchy of sets wherein chapters, sections, and paragraphs represent different levels in the hierarchy. Such a representation is richer than viewing the document simply as a set of words. We measure distance between trees using the best possible super-imposition that minimizes the number of mismatched leaf labels. Our distance measure is equivalent to an Earth Mover's Distance measure since the leaf-labeled trees of height one can be viewed as sets and can be recursively extended to trees of larger height by viewing them as set of sets. We compute sketches of arbitrary weighted trees and analyze them in the context of locality-sensitive hashing (LSH) where the probability of two sketches matching is high when two trees are similar and low when the two trees are far under the given distance measure. Specifically, we compute sketches of such trees by propagating min-hash computations up the tree. Furthermore, we show that propagating one min-hash results in poor sketch properties while propagating two min-hashes results in good sketches.