Clustering XML Documents Using Closed Frequent Subtrees: A Structural Similarity Approach

Authors:
Sangeetha Kutty;Tien Tran;Richi Nayak;Yuefeng Li
Affiliations:
Faculty of Information Technology, Queensland University of Technology, Brisbane, Australia;Faculty of Information Technology, Queensland University of Technology, Brisbane, Australia;Faculty of Information Technology, Queensland University of Technology, Brisbane, Australia;Faculty of Information Technology, Queensland University of Technology, Brisbane, Australia
Venue:
Focused Access to XML Documents
Year:
2008

Citing 6
Cited 3

Frequent Subtree Mining - An Overview

Fundamenta Informaticae - Advances in Mining Graphs, Trees and Sequences
Investigating Semantic Measures in XML Clustering

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Xproj: a framework for projected structural clustering of xml documents

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
PCITMiner: prefix-based closed induced tree miner for finding closed induced frequent subtrees

AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
A methodology for clustering XML documents by structure

Information Systems
Clustering XML documents based on structural similarity

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications

Document Clustering Using Incremental and Pairwise Approaches

Focused Access to XML Documents
Collaborative clustering of XML documents

Journal of Computer and System Sciences
Discovering interesting information with advances in web technology

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents the experimental study conducted over the INEX 2007 Document Mining Challenge corpus employing a frequent subtree-based incremental clustering approach. Using the structural information of the XML documents, the closed frequent subtrees are generated. A matrix is then developed representing the closed frequent subtree distribution in documents. This matrix is used to progressively cluster the XML documents. In spite of the large number of documents in INEX 2007 Wikipedia dataset, the proposed frequent subtree-based incremental clustering approach was successful in clustering the documents.