Clustering XML Documents Using Closed Frequent Subtrees: A Structural Similarity Approach

  • Authors:
  • Sangeetha Kutty;Tien Tran;Richi Nayak;Yuefeng Li

  • Affiliations:
  • Faculty of Information Technology, Queensland University of Technology, Brisbane, Australia;Faculty of Information Technology, Queensland University of Technology, Brisbane, Australia;Faculty of Information Technology, Queensland University of Technology, Brisbane, Australia;Faculty of Information Technology, Queensland University of Technology, Brisbane, Australia

  • Venue:
  • Focused Access to XML Documents
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents the experimental study conducted over the INEX 2007 Document Mining Challenge corpus employing a frequent subtree-based incremental clustering approach. Using the structural information of the XML documents, the closed frequent subtrees are generated. A matrix is then developed representing the closed frequent subtree distribution in documents. This matrix is used to progressively cluster the XML documents. In spite of the large number of documents in INEX 2007 Wikipedia dataset, the proposed frequent subtree-based incremental clustering approach was successful in clustering the documents.