Collaborative clustering of XML documents

  • Authors:
  • Sergio Greco;Francesco Gullo;Giovanni Ponti;Andrea Tagarelli

  • Affiliations:
  • Dept. of Electronics, Computer and Systems Sciences (DEIS), University of Calabria, Via P. Bucci, 41C, 87036 Arcavacata di Rende (CS), Italy;Dept. of Electronics, Computer and Systems Sciences (DEIS), University of Calabria, Via P. Bucci, 41C, 87036 Arcavacata di Rende (CS), Italy;Dept. of Electronics, Computer and Systems Sciences (DEIS), University of Calabria, Via P. Bucci, 41C, 87036 Arcavacata di Rende (CS), Italy;Dept. of Electronics, Computer and Systems Sciences (DEIS), University of Calabria, Via P. Bucci, 41C, 87036 Arcavacata di Rende (CS), Italy

  • Venue:
  • Journal of Computer and System Sciences
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clustering XML documents is extensively used to organize large collections of XML documents in groups that are coherent according to structure and/or content features. The growing availability of distributed XML sources and the variety of high-demand environments raise the need for clustering approaches that can exploit distributed processing techniques. Nevertheless, existing methods for clustering XML documents are designed to work in a centralized way. In this paper, we address the problem of clustering XML documents in a collaborative distributed framework. XML documents are first decomposed based on semantically cohesive subtrees, then modeled as transactional data that embed both XML structure and content information. The proposed clustering framework employs a centroid-based partitional clustering method that has been developed for a peer-to-peer network. Each peer in the network is allowed to compute a local clustering solution over its own data, and to exchange its cluster representatives with other peers. The exchanged representatives are used to compute representatives for the global clustering solution in a collaborative way. We evaluated effectiveness and efficiency of our approach on real XML document collections varying the number of peers. Results have shown that major advantages with respect to the corresponding centralized clustering setting are obtained in terms of runtime behavior, although clustering solutions can still be accurate with a moderately low number of nodes in the network. Moreover, the collaborativeness characteristic of our approach has revealed to be a convenient feature in distributed clustering as found in a comparative evaluation with a distributed non-collaborative clustering method.