Similarity computation for XML documents by XML element sequence patterns

  • Authors:
  • Haiwei Zhang;Xiaojie Yuan;Na Yang;Zhongqi Liu

  • Affiliations:
  • Department of Computer Science and Technology, Nankai University, Tianjin, China PR;Department of Computer Science and Technology, Nankai University, Tianjin, China PR;Department of Computer Science and Technology, Nankai University, Tianjin, China PR;Department of Computer Science and Technology, Nankai University, Tianjin, China PR

  • Venue:
  • APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Measuring the similarity between XML documents is the fundamental task of finding clusters in XML documents collection. In this paper, XML document is modeled as XML Element Sequence Pattern (XESP) and XESP can be extracted using less time and space than extracing other models such as tree model and frequent paths model. Similarity between XML documents will be measured based on XESPs. In view of the deficiencies encountered by ignoring the hierarchical information in frequent paths pattern models and semantic information in tree models, semantics of the elements and the hierarchical structure of the document will be taken into account when computing the similarity between XML documents by XESPs. Experimental results show that perfect clustering will be obtained with proper threshold of similarity computed by XESPs.