A tree-based approach to clustering XML documents by structure

  • Authors:
  • Gianni Costa;Giuseppe Manco;Riccardo Ortale;Andrea Tagarelli

  • Affiliations:
  • ICAR-CNR - Institute of Italian National Research Council, Via Pietro Bucci 41c, 87036 Rende (CS), Italy;ICAR-CNR - Institute of Italian National Research Council, Via Pietro Bucci 41c, 87036 Rende (CS), Italy;DEIS, University of Calabria, Via Pietro Bucci 41c, 87036 Rende (CS), Italy;DEIS, University of Calabria, Via Pietro Bucci 41c, 87036 Rende (CS), Italy

  • Venue:
  • PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a novel methodology for clustering XML documents on the basis of their structural similarities. The idea is to equip each cluster with an XML cluster representative, i.e. an XML document subsuming the most typical structural specifics of a set of XML documents. Clustering is essentially accomplished by comparing cluster representatives, and updating the representatives as soon as new clusters are detected. We present an algorithm for the computation of an XML representative based on suitable techniques for identifying significant node matchings and for reliably merging and pruning XML trees. Experimental evaluation performed on both synthetic and real data shows the effectiveness of our approach.