Structure and content similarity for clustering XML documents

Authors:
Lijun Zhang;Zhanhuai Li;Qun Chen;Ning Li
Affiliations:
School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China;School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China;School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China;School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China
Venue:
WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Year:
2010

Citing 8
Cited 1

On the editing distance between unordered labeled trees

Information Processing Letters
A semi-structured document model for text mining

Journal of Computer Science and Technology
A bag of paths model for measuring structural similarity in Web documents

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Integrating Element and Term Semantics for Similarity-Based XML Document Clustering

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
XML Document Clustering Using Common XPath

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Semantic Structural Similarity for Clustering XML Documents

ICHIT '08 Proceedings of the 2008 International Conference on Convergence and Hybrid Information Technology
Clustering XML documents based on structural similarity

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
Clustering XML documents using structural summaries

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology

FXProj: a fuzzy XML documents projected clustering based on structure and content

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML has been extensively used in many information retrieval related applications. As an important data mining technique, clustering has been used to analyze XML data. The key issue of XML clustering is how to measure the similarity between XML documents. Traditionally, document clustering methods use the content information to measure the document similarity, the structural information contained in XML documents is ignored. In this paper, we propose a model called Structure and Content Vector Model (SCVM) to represent the structure and content information in XML documents. Based on the model, we define similarity measure that can be used to cluster XML documents. Our experimental results show that the proposed model and similarity measure are effective in identifying similar documents when the structure information contained in XML documents is meaningful. This method can be used to improve the precision and efficiency in XML information retrieval.