X-Class: Associative Classification of XML Documents by Structure
ACM Transactions on Information Systems (TOIS)
Hierarchical clustering of XML documents focused on structural components
Data & Knowledge Engineering
Hi-index | 0.00 |
XML clustering by structure is, in its most general form, the process of partitioning a corpus of XML documents into disjoint clusters, such that intra-cluster structural homogeneity is high and inter-cluster structural homogeneity is low. In this paper, we propose an algorithm that implements a partitioning strategy, in which root-to-leaf paths are used to separate the XML documents. Paths are discriminatory substructures and, thus, the effectiveness of our algorithm is accordingly high. Moreover, a suitable encoding is adopted for representing and testing the occurrence of the individual paths within each XML document independently of the length of such paths. Not only this expedites clustering, but it also makes our algorithm scalable to process large-scale corpora of XML documents. A comparative evaluation over several standard (real-word and synthetic) XML corpora reveals that our algorithm outperforms all of its competitors in efficiency and scalability, while being as effective as the top-notch competitors. One especially appealing property of the proposed algorithm is that it achieves these levels of performance by automatically establishing a natural number of clusters to be discovered in the underlying XML corpus.