An approach for XML similarity join using tree serialization

  • Authors:
  • Lianzi Wen;Toshiyuki Amagasa;Hiroyuki Kitagawa

  • Affiliations:
  • Department of Computer Science, Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Ibaraki, Japan;Department of Computer Science, Graduate School of Systems and Information Engineering and Center for Computational Sciences, University of Tsukuba, Tsukuba, Ibaraki, Japan;Department of Computer Science, Graduate School of Systems and Information Engineering and Center for Computational Sciences, University of Tsukuba, Tsukuba, Ibaraki, Japan

  • Venue:
  • DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a scheme for similarity join over XML data based on XML data serialization and subsequent similarity matching over XML node subsequences. With the recent explosive diffusion of XML, great volumes of electronic data are now marked up with XML. As a consequence, a growing amount of XML data represents similar contents, but with dissimilar structures. To extract as much information as possible from this heterogeneous information, similarity join has been used. Our proposed similarity join for XML data can be summarized as follows: 1) we serialize XML data as XML node sequences; 2) we extract semantically/structurally coherent subsequences; 3) we filter out dissimilar subsequences using textual information; and 4) we extract pairs of subsequences as the final result by checking structural similarity. The above process is costly to execute. To make it scalable against large document sets, we use Bloom filter to speed up text similarity computation. We show the feasibility of the proposed scheme by experiments.