Approximate top-k structural similarity search over XML documents

Authors:
Tao Xie;Chaofeng Sha;Xiaoling Wang;Aoying Zhou
Affiliations:
Department of Computer Science and Engineering, Fudan University, Shanghai, China;Department of Computer Science and Engineering, Fudan University, Shanghai, China;Department of Computer Science and Engineering, Fudan University, Shanghai, China;Department of Computer Science and Engineering, Fudan University, Shanghai, China
Venue:
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Year:
2006

Citing 8
Cited 3

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
On the editing distance between unordered labeled trees

Information Processing Letters
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Comparing Hierarchical Data in External Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Similarity evaluation on tree-structured data

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Measuring the structural similarity among XML documents and DTDs

Journal of Intelligent Information Systems
Clustering XML documents using structural summaries

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology

Web services discovery based on schema matching

ACSC '07 Proceedings of the thirtieth Australasian conference on Computer science - Volume 62
WSXplorer: searching for desired web services

CAiSE'07 Proceedings of the 19th international conference on Advanced information systems engineering
Web services discovery and rank: An information retrieval approach

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the development of XML applications, such as Digital Library, XML subscribe/publish system, and other XML repositories, top-k structural similarity search over XML documents is attracting more attention. The similarity of two XML documents can be measured by using the edit distance defined between XML trees in previous work. Since the computation of edit distances is time consuming, some recent work presented some approaches to calculate edit distance by using structural summaries to improve the algorithm performance. However, most existing algorithms for calculating edit distance between trees ignore the fact that nodes in a tree may be of different significance, and the same edit operation costs are assumed inappropriately for all nodes in XML document tree. This paper addresses this problem by proposing a summary structure which could be used to make the tree-based edit distance more rational; furthermore, a novel weighting scheme is proposed to indicate that some nodes are more important than others with respect for structural similarity. We introduce a new cost model for computing structural distance and takes weight information into account for nodes in distance computation in this paper. Compared with former techniques, our approach can approximately answer the top-k queries efficiently. We verify this approach through a series of experiments, and the results show that using weighted structural summaries for top-k queries is efficient and practical.