An approach for XML similarity join using tree serialization

Authors:
Lianzi Wen;Toshiyuki Amagasa;Hiroyuki Kitagawa
Affiliations:
Department of Computer Science, Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Ibaraki, Japan;Department of Computer Science, Graduate School of Systems and Information Engineering and Center for Computational Sciences, University of Tsukuba, Tsukuba, Ibaraki, Japan;Department of Computer Science, Graduate School of Systems and Information Engineering and Center for Computational Sciences, University of Tsukuba, Tsukuba, Ibaraki, Japan
Venue:
DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
Year:
2008

Citing 8
Cited 0

Tree pattern matching

Pattern matching algorithms
Indexing and Querying XML Data for Regular Path Expressions

Proceedings of the 27th International Conference on Very Large Data Bases
ViST: a dynamic index method for querying XML data by tree structures

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
PRIX: Indexing And Querying XML Using Prüfer Sequences

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Bloom Filter-Based XML Packets Filtering for Millions of Path Queries

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A Path-sequence Based Discrimination for Subtree Matching in Approximate XML Joins

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a scheme for similarity join over XML data based on XML data serialization and subsequent similarity matching over XML node subsequences. With the recent explosive diffusion of XML, great volumes of electronic data are now marked up with XML. As a consequence, a growing amount of XML data represents similar contents, but with dissimilar structures. To extract as much information as possible from this heterogeneous information, similarity join has been used. Our proposed similarity join for XML data can be summarized as follows: 1) we serialize XML data as XML node sequences; 2) we extract semantically/structurally coherent subsequences; 3) we filter out dissimilar subsequences using textual information; and 4) we extract pairs of subsequences as the final result by checking structural similarity. The above process is costly to execute. To make it scalable against large document sets, we use Bloom filter to speed up text similarity computation. We show the feasibility of the proposed scheme by experiments.