LAX: an efficient approximate XML join based on clustered leaf nodes for XML data integration

Authors:
Wenxin Liang;Haruo Yokota
Affiliations:
Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan;Global Scientific Information and Computer Center, Tokyo Institute of Technology, Tokyo, Japan
Venue:
BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation
Year:
2005

Citing 12
Cited 8

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Change detection in hierarchically structured information

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Meaningful change detection in structured data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Tree pattern matching

Pattern matching algorithms
On XML integrity constraints in the presence of DTDs

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Approximate XML joins

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
XClust: clustering XML schemas for effective integration

Proceedings of the eleventh international conference on Information and knowledge management
Change-Centric Management of Versions in an XML Warehouse

Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
A normal form for XML documents

ACM Transactions on Database Systems (TODS)
An Ontology-Based Framework for XML Semantic Integration

IDEAS '04 Proceedings of the International Database Engineering and Applications Symposium

Discovering Relations Among Entities from XML Documents

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
XML Data Integration Based on Content and Structure Similarity Using Keys

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems:
A system for detecting xml similarity in content and structure using relational database

Proceedings of the 18th ACM conference on Information and knowledge management
XML-SIM-CHANGE: structure and content semantic similarity detection among XML document versions

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
XML data clustering: An overview

ACM Computing Surveys (CSUR)
A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics

Web Semantics: Science, Services and Agents on the World Wide Web
Style-based similarity search for office XML documents

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
An Evaluation of Similarity Search Methods Blending Structures and Keywords in XML Documents

Proceedings of International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, more and more data are published and exchanged by XML on the Internet. However, different XML data sources might contain the same data but have different structures. Therefore, it requires an efficient method to integrate such XML data sources so that more complete and useful information can be conveniently accessed and acquired by users. The tree edit distance is regarded as an effective metric for evaluating the structural similarity in XML documents. However, its computational cost is extremely expensive and the traditional wisdom in join algorithms cannot be applied easily. In this paper, we propose LAX (Leaf-clustering based Approximate XML join algorithm), in which the two XML document trees are clustered into subtrees representing independent items and the similarity between them is determined by calculating the similarity degree based on the leaf nodes of each pair of subtrees. We also propose an effective algorithm for clustering the XML document for LAX. We show that it is easily to apply the traditional wisdom in join algorithms to LAX and the join result contains complete information of the two documents. We then do experiments to compare LAX with the tree edit distance and evaluate its performance using both synthetic and real data sets. Our experimental results show that LAX is more efficient in performance and more effective for measuring the approximate similarity between XML documents than the tree edit distance.