On Effective XML Clustering by Path Commonality: An Efficient and Scalable Algorithm

  • Authors:
  • Gianni Costa;Riccardo Ortale

  • Affiliations:
  • -;-

  • Venue:
  • ICTAI '12 Proceedings of the 2012 IEEE 24th International Conference on Tools with Artificial Intelligence - Volume 01
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

XML clustering by structure is, in its most general form, the process of partitioning a corpus of XML documents into disjoint clusters, such that intra-cluster structural homogeneity is high and inter-cluster structural homogeneity is low. In this paper, we propose an algorithm that implements a partitioning strategy, in which root-to-leaf paths are used to separate the XML documents. Paths are discriminatory substructures and, thus, the effectiveness of our algorithm is accordingly high. Moreover, a suitable encoding is adopted for representing and testing the occurrence of the individual paths within each XML document independently of the length of such paths. Not only this expedites clustering, but it also makes our algorithm scalable to process large-scale corpora of XML documents. A comparative evaluation over several standard (real-word and synthetic) XML corpora reveals that our algorithm outperforms all of its competitors in efficiency and scalability, while being as effective as the top-notch competitors. One especially appealing property of the proposed algorithm is that it achieves these levels of performance by automatically establishing a natural number of clusters to be discovered in the underlying XML corpus.