A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics

Authors:
Joe Tekli;Richard Chbeir
Affiliations:
ICMC Computer Science and Statistics Institute, University of Sao Paulo, 13566-590 Sao Carlos, SP, Brazil;LE2I Laboratory UMR-CNRS, University of Bourgogne, 21078 Dijon Cedex, France
Venue:
Web Semantics: Science, Services and Agents on the World Wide Web
Year:
2012

Citing 52
Cited 4

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Change detection in hierarchically structured information

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The String-to-String Correction Problem

Journal of the ACM (JACM)
Bounds on the Complexity of the Longest Common Subsequence Problem

Journal of the ACM (JACM)
Bounds for the String Editing Problem

Journal of the ACM (JACM)
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
XIRQL: a query language for information retrieval in XML documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Querying and ranking XML documents

Journal of the American Society for Information Science and Technology - XML
Information Retrieval

Information Retrieval
Approximate XML joins

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A system for knowledge management in bioinformatics

Proceedings of the eleventh international conference on Information and knowledge management
Tamino - A DBMS designed for XML

Proceedings of the 17th International Conference on Data Engineering
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Comparing Hierarchical Data in External Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
Adding Relevance to XML

Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases
Detecting Changes in XML Documents

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Clustering Algorithms and Validity Measures

SSDBM '01 Proceedings of the 13th International Conference on Scientific and Statistical Database Management
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

IEEE Transactions on Knowledge and Data Engineering
A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications

Information Systems - Special issue on web data integration
Verbs semantics and lexical selection

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Word-sense disambiguation using statistical models of Roget's categories trained on large corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
FleXPath: flexible structure and full-text querying for XML

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Semantic Similarity Search on Semistructured Data with the XXL Search Engine

Information Retrieval
Algorithmic detection of semantic similarity

WWW '05 Proceedings of the 14th international conference on World Wide Web
Bootstrapping ontology alignment methods with APFEL

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A survey on tree edit distance and related problems

Theoretical Computer Science
Content and Structure Based Approach For XML Similarity

CIT '05 Proceedings of the The Fifth International Conference on Computer and Information Technology
Finding Syntactic Similarities Between XML Documents

DEXA '06 Proceedings of the 17th International Conference on Database and Expert Systems Applications
Matching large schemas: Approaches and evaluation

Information Systems
Structural similarity in geographical queries to improve query answering

Proceedings of the 2007 ACM symposium on Applied computing
COMA: a system for flexible combination of schema matching approaches

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Measuring the structural similarity of semistructured documents using entropy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Boosting Schema Matchers

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems:
A hybrid similarity matching algorithm for mapping and rading ontologies via a multi-agent system

ICCOMP'08 Proceedings of the 12th WSEAS international conference on Computers
Semantic web services discovery based on structural ontology matching

International Journal of Web and Grid Services
Improving XML schema matching performance using Prüfer sequences

Data & Knowledge Engineering
Poster Session: An Indexing Structure for Automatic Schema Matching

ICDEW '07 Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop
SenseRelate targetword: a generalized framework for word sense disambiguation

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 4
Graph connectivity measures for unsupervised word sense disambiguation

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
XML Schema Element Similarity Measures: A Schema Matching Context

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
A methodology for clustering XML documents by structure

Information Systems
A fine-grained XML structural comparison approach

ER'07 Proceedings of the 26th international conference on Conceptual modeling
Transforming XML trees for efficient classification and clustering

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
LAX: an efficient approximate XML join based on clustered leaf nodes for XML data integration

BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation
Approximate subtree identification in heterogeneous XML documents collections

XSym'05 Proceedings of the Third international conference on Database and XML Technologies
Survey: An overview on XML similarity: Background, current trends and future directions

Computer Science Review

Exploring dictionary-based semantic relatedness in labeled tree data

Information Sciences: an International Journal
A visual programming language for XML manipulation

Journal of Visual Languages and Computing
Structural and semantic similarity for XML comparison

Proceedings of the Fifth International Conference on Management of Emergent Digital EcoSystems
Semantic to intelligent web era: building blocks, applications, and current trends

Proceedings of the Fifth International Conference on Management of Emergent Digital EcoSystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML similarity evaluation has become a central issue in the database and information communities, its applications ranging over document clustering, version control, data integration and ranked retrieval. Various algorithms for comparing hierarchically structured data, XML documents in particular, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being commonly modeled as Ordered Labeled Trees. Yet, a thorough investigation of current approaches led us to identify several similarity aspects, i.e., sub-tree related structural and semantic similarities, which are not sufficiently addressed while comparing XML documents. In this paper, we provide an integrated and fine-grained comparison framework to deal with both structural and semantic similarities in XML documents (detecting the occurrences and repetitions of structurally and semantically similar sub-trees), and to allow the end-user to adjust the comparison process according to her requirements. Our framework consists of four main modules for (i) discovering the structural commonalities between sub-trees, (ii) identifying sub-tree semantic resemblances, (iii) computing tree-based edit operations costs, and (iv) computing tree edit distance. Experimental results demonstrate higher comparison accuracy with respect to alternative methods, while timing experiments reflect the impact of semantic similarity on overall system performance.