Combining structure and content similarities for XML document clustering

Authors:
Tien Tran;Richi Nayak;Peter Bruza
Affiliations:
Queensland University of Technology, Brisbane QLD, Australia;Queensland University of Technology, Brisbane QLD, Australia;Queensland University of Technology, Brisbane QLD, Australia
Venue:
AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
Year:
2008

Citing 7
Cited 3

Data mining: concepts and techniques

Data mining: concepts and techniques
XClust: clustering XML schemas for effective integration

Proceedings of the eleventh international conference on Information and knowledge management
Latent Semantic Kernels

Journal of Intelligent Information Systems
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

IEEE Transactions on Knowledge and Data Engineering
XML Clustering by Principal Component Analysis

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
XML Document Clustering Using Common XPath

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
An intelligent grading system using heterogeneous linguistic resources

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning

XML data clustering: An overview

ACM Computing Surveys (CSUR)
Collaborative clustering of XML documents

Journal of Computer and System Sciences
Discovering interesting information with advances in web technology

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a clustering approach that explores both the content and the structure of XML documents for determining similarity among them. Assuming that the content and the structure of XML documents play different roles and importance depending on the use and purpose of a dataset, the content and structure information of the documents are handled using two different similarity measuring methods. The similarity values produced from these two methods are then combined with weightings to measure the overall document similarity. The effect of structure similarity and content similarity on the clustering solution is thoroughly analysed. The experiments prove that clustering of the text-centric XML documents based on the content-only information produces a better solution in a homogeneous environment, documents that derived from one structural definition; however, in a heterogeneous environment, documents that derived from two or more structural definitions, clustering of the text-centric XML documents produces a better result when the structure and the content similarities of the documents are combined with different strengths.