On Effective XML Clustering by Path Commonality: An Efficient and Scalable Algorithm

Authors:
Gianni Costa;Riccardo Ortale
Affiliations:
-;-
Venue:
ICTAI '12 Proceedings of the 2012 IEEE 24th International Conference on Tools with Artificial Intelligence - Volume 01
Year:
2012

Citing 0
Cited 2

X-Class: Associative Classification of XML Documents by Structure

ACM Transactions on Information Systems (TOIS)
Hierarchical clustering of XML documents focused on structural components

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML clustering by structure is, in its most general form, the process of partitioning a corpus of XML documents into disjoint clusters, such that intra-cluster structural homogeneity is high and inter-cluster structural homogeneity is low. In this paper, we propose an algorithm that implements a partitioning strategy, in which root-to-leaf paths are used to separate the XML documents. Paths are discriminatory substructures and, thus, the effectiveness of our algorithm is accordingly high. Moreover, a suitable encoding is adopted for representing and testing the occurrence of the individual paths within each XML document independently of the length of such paths. Not only this expedites clustering, but it also makes our algorithm scalable to process large-scale corpora of XML documents. A comparative evaluation over several standard (real-word and synthetic) XML corpora reveals that our algorithm outperforms all of its competitors in efficiency and scalability, while being as effective as the top-notch competitors. One especially appealing property of the proposed algorithm is that it achieves these levels of performance by automatically establishing a natural number of clusters to be discovered in the underlying XML corpus.