Similarity join on XML based on k-generation set distance

Authors:
Yue Wang;Hongzhi Wang;Yang Wang;Hong Gao
Affiliations:
The School of Computer Science and Technology, Harbin Institute of Technology, China;The School of Computer Science and Technology, Harbin Institute of Technology, China;The School of Computer Science and Technology, Harbin Institute of Technology, China;The School of Computer Science and Technology, Harbin Institute of Technology, China
Venue:
WAIM'11 Proceedings of the 2011 international conference on Web-Age Information Management
Year:
2011

Citing 8
Cited 0

Approximate XML joins

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Approximate matching of hierarchical data using pq-grams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A survey on tree edit distance and related problems

Theoretical Computer Science
Integrating XML data sources using approximate joins

ACM Transactions on Database Systems (TODS)
The pq-gram distance between ordered labeled trees

ACM Transactions on Database Systems (TODS)
Analysis of tree edit distance algorithms

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Approximate joins for XML using g-string

XSym'10 Proceedings of the 7th international XML database conference on Database and XML technologies
pq-hash: an efficient method for approximate XML joins

WAIM'10 Proceedings of the 2010 international conference on Web-age information management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity join is applied very widely nowadays since data items representing the same real-world objects may be different due to various conventions. Another reason for similarity join is that the efficiency of traditional methods is really low. Therefore, a method with both high efficiency and high join quality is in need. In the paper, we put forward two new edit operations (reversing and mapping) together with related algorithms concerning similarity join based on the new defined measure. In our method, computing tree edit distance is replaced by computing k-generation set distance between trees. The join process is simplified largely by applying the new method. The time complexity of our method is O(n2), where n is the tree size. We have proved that our method owns some advantages over others. And it can be scaled to large data sets as well.