Similarity computation for XML documents by XML element sequence patterns

Authors:
Haiwei Zhang;Xiaojie Yuan;Na Yang;Zhongqi Liu
Affiliations:
Department of Computer Science and Technology, Nankai University, Tianjin, China PR;Department of Computer Science and Technology, Nankai University, Tianjin, China PR;Department of Computer Science and Technology, Nankai University, Tianjin, China PR;Department of Computer Science and Technology, Nankai University, Tianjin, China PR
Venue:
APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Year:
2008

Citing 5
Cited 2

Preparations for Semantics-Based XML Mining

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
DTD-Miner: A Tool for Mining DTD from XML Documents

WECWIS '00 Proceedings of the Second International Workshop on Advance Issues of E-Commerce and Web-Based Information Systems (WECWIS 2000)
XML Document Clustering Using Common XPath

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Report on the XML mining track at INEX 2005 and INEX 2006: categorization and clustering of XML documents

ACM SIGIR Forum
A methodology for clustering XML documents by structure

Information Systems

Similarity Evaluation of XML Documents Based on Weighted Element Tree Model

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Structural similarity evaluation of XML documents based on basic statistics

WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Measuring the similarity between XML documents is the fundamental task of finding clusters in XML documents collection. In this paper, XML document is modeled as XML Element Sequence Pattern (XESP) and XESP can be extracted using less time and space than extracing other models such as tree model and frequent paths model. Similarity between XML documents will be measured based on XESPs. In view of the deficiencies encountered by ignoring the hierarchical information in frequent paths pattern models and semantic information in tree models, semantics of the elements and the hierarchical structure of the document will be taken into account when computing the similarity between XML documents by XESPs. Experimental results show that perfect clustering will be obtained with proper threshold of similarity computed by XESPs.