Simple fast algorithms for the editing distance between trees and related problems
SIAM Journal on Computing
Change detection in hierarchically structured information
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Meaningful change detection in structured data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Pattern matching algorithms
On XML integrity constraints in the presence of DTDs
PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Reconciling schemas of disparate data sources: a machine-learning approach
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
XClust: clustering XML schemas for effective integration
Proceedings of the eleventh international conference on Information and knowledge management
Change-Centric Management of Versions in an XML Warehouse
Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
A normal form for XML documents
ACM Transactions on Database Systems (TODS)
An Ontology-Based Framework for XML Semantic Integration
IDEAS '04 Proceedings of the International Database Engineering and Applications Symposium
Discovering Relations Among Entities from XML Documents
MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
XML Data Integration Based on Content and Structure Similarity Using Keys
OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems:
A system for detecting xml similarity in content and structure using relational database
Proceedings of the 18th ACM conference on Information and knowledge management
XML-SIM-CHANGE: structure and content semantic similarity detection among XML document versions
OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
XML data clustering: An overview
ACM Computing Surveys (CSUR)
Web Semantics: Science, Services and Agents on the World Wide Web
Style-based similarity search for office XML documents
Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
An Evaluation of Similarity Search Methods Blending Structures and Keywords in XML Documents
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Hi-index | 0.00 |
Recently, more and more data are published and exchanged by XML on the Internet. However, different XML data sources might contain the same data but have different structures. Therefore, it requires an efficient method to integrate such XML data sources so that more complete and useful information can be conveniently accessed and acquired by users. The tree edit distance is regarded as an effective metric for evaluating the structural similarity in XML documents. However, its computational cost is extremely expensive and the traditional wisdom in join algorithms cannot be applied easily. In this paper, we propose LAX (Leaf-clustering based Approximate XML join algorithm), in which the two XML document trees are clustered into subtrees representing independent items and the similarity between them is determined by calculating the similarity degree based on the leaf nodes of each pair of subtrees. We also propose an effective algorithm for clustering the XML document for LAX. We show that it is easily to apply the traditional wisdom in join algorithms to LAX and the join result contains complete information of the two documents. We then do experiments to compare LAX with the tree edit distance and evaluate its performance using both synthetic and real data sets. Our experimental results show that LAX is more efficient in performance and more effective for measuring the approximate similarity between XML documents than the tree edit distance.