Structural similarity between XML documents and DTDs

Authors:
Patrick K. L. Ng;Vincent T. Y. Ng
Affiliations:
Department of Computing, the Hong Kong Polytechnic University, Hong Kong;Department of Computing, the Hong Kong Polytechnic University, Hong Kong
Venue:
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Year:
2003

Citing 5
Cited 4

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
A Four Russians algorithm for regular expression pattern matching

Journal of the ACM (JACM)
A subquadratic algorithm for approximate regular expression matching

Journal of Algorithms
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)

Similarity Measurement of XML Documents Based on Structure and Contents

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Equivalence of XSD Constructs and Its Exploitation in Similarity Evaluation

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part II on On the Move to Meaningful Internet Systems
Structural and semantic aspects of similarity of Document Type Definitions and XML schemas

Information Sciences: an International Journal
Intuitionistic fuzzy XML query matching

FQAS'11 Proceedings of the 9th international conference on Flexible Query Answering Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The use of XML documents in the Internet continues to grow. Need for the analysis of XML documents from heterogeneous sources is arisen, in which documents would conform to different DTDs. In this paper, we propose a measure on the structural similarity among XML documents and DTDs, which is natural to understand and fast to calculate. The measure is defined as a weighted sum of the local measures of document elements with a weighting scheme based on their subtree sizes. While the local measure of an element is defined as its edit distance against its declaration, viewed as regular expression, in the DTD. Based on our definition, an algorithm for edit distance calculation between a string and a regular expression is proposed, which is modified from the algorithm applied in the regular expression matching problem. The advantage of the measure comes with its natural definition and linear complexity.