Finding Syntactic Similarities Between XML Documents

Authors:
Davood Rafiei;Daniel L. Moise;Dabo Sun
Affiliations:
University of Alberta, Canada;University of Alberta, Canada;University of Alberta, Canada
Venue:
DEXA '06 Proceedings of the 17th International Conference on Database and Expert Systems Applications
Year:
2006

Citing 0
Cited 11

Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
GAIML: A new language for verbal and graphical interaction in chatbots

Mobile Information Systems - Information Assurance and Advanced Human-Computer Interfaces
Towards language-independent web genre detection

Proceedings of the 18th international conference on World wide web
Reducing metadata complexity for faster table summarization

Proceedings of the 13th International Conference on Extending Database Technology
Structural and semantic aspects of similarity of Document Type Definitions and XML schemas

Information Sciences: an International Journal
GRAMS3: an efficient framework for XML structural similarity search

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
XML data clustering: An overview

ACM Computing Surveys (CSUR)
A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics

Web Semantics: Science, Services and Agents on the World Wide Web
Clustering XML documents by structure

ADBIS'09 Proceedings of the 13th East European conference on Advances in Databases and Information Systems
Mining frequent association tag sequences for clustering XML documents

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Survey: An overview on XML similarity: Background, current trends and future directions

Computer Science Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detecting structural similarities between XML documents has been the subject of several recent work, and the proposed algorithms mostly use tree edit distance between the corresponding trees of XML documents. However, evaluating a tree edit distance is computationally expensive and does not easily scale up to large collections. We show in this paper that a tree edit distance computation often is not necessary and can be avoided. In particular, we propose a concise structural summary of XML documents and show that a comparison based on this summary is both fast and effective. Our experimental evaluation shows that this method does an excellent job of grouping documents generated by the same DTD, outperforming some of the previously proposed solutions based on a tree comparison. Furthermore, the time complexity of the algorithm is linear on the size of the structural description.