Efficient Similarity Search for Tree-Structured Data

Authors:
Guoliang Li;Xuhui Liu;Jianhua Feng;Lizhu Zhou
Affiliations:
Department of Computer Science and Technology, Tsinghua University, Beijing, China 100084;Department of Computer Science and Technology, Tsinghua University, Beijing, China 100084;Department of Computer Science and Technology, Tsinghua University, Beijing, China 100084;Department of Computer Science and Technology, Tsinghua University, Beijing, China 100084
Venue:
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Year:
2008

Citing 22
Cited 0

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Optimal multi-step k-nearest neighbor search

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Efficient and tumble similar set retrieval

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Approximate XML joins

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Computing the Edit-Distance between Unrooted Ordered Trees

ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Similarity evaluation on tree-structured data

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximate matching of hierarchical data using pq-grams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A survey on tree edit distance and related problems

Theoretical Computer Science
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tree-structured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. Although similarity search on textual data has been extensively studied, searching for similar trees is still an open problem due to the high complexity of computing the similarity between trees, especially for large numbers of tress. In this paper, we propose to transform tree-structured data into strings with a one-to-one mapping. We prove that the edit distance of the corresponding strings forms a bound for the similarity measures between trees, including tree edit distance, largest common subtrees and smallest common super-trees. Based on the theoretical analysis, we can employ any existing algorithm of approximate string search for effective similarity search on trees. Moreover, we embed the bound into a filter-and-refine framework for facilitating similarity search on tree-structured data. The experimental results show that our algorithm achieves high performance and outperforms state-of-the-art methods significantly. Our method is especially suitable for accelerating similarity query processing on large numbers of trees in massive datasets.