Structure and content similarity for clustering XML documents

  • Authors:
  • Lijun Zhang;Zhanhuai Li;Qun Chen;Ning Li

  • Affiliations:
  • School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China;School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China;School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China;School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an, China

  • Venue:
  • WAIM'10 Proceedings of the 2010 international conference on Web-age information management
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

XML has been extensively used in many information retrieval related applications. As an important data mining technique, clustering has been used to analyze XML data. The key issue of XML clustering is how to measure the similarity between XML documents. Traditionally, document clustering methods use the content information to measure the document similarity, the structural information contained in XML documents is ignored. In this paper, we propose a model called Structure and Content Vector Model (SCVM) to represent the structure and content information in XML documents. Based on the model, we define similarity measure that can be used to cluster XML documents. Our experimental results show that the proposed model and similarity measure are effective in identifying similar documents when the structure information contained in XML documents is meaningful. This method can be used to improve the precision and efficiency in XML information retrieval.