XML Clustering by Principal Component Analysis

Authors:
Jianghui Liu;Jason T. L. Wang;Wynne Hsu;Katherine G. Herbert
Affiliations:
New Jersey Institute of Technology;New Jersey Institute of Technology;National University of Singapore;Montclair State University
Venue:
ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
Year:
2004

Citing 0
Cited 10

Multilevel Conditional Fuzzy C-Means Clustering of XML Documents

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
XML documents clustering based on representative path

ICCOMP'09 Proceedings of the WSEAES 13th international conference on Computers
Semantics-guided clustering of heterogeneous XML schemas

Journal on data semantics IX
A weighted common structure based clustering technique for XML documents

Journal of Systems and Software
XML clustering by bit vector

ICCOMP'10 Proceedings of the 14th WSEAS international conference on Computers: part of the 14th WSEAS CSCC multiconference - Volume I
A complete path representation method with a modified inverted index for efficient retrieval of XML documents

WSEAS Transactions on Computers
A flexible structured-based representation for XML document mining

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
XML document clustering by independent component analysis

KDXD'06 Proceedings of the First international conference on Knowledge Discovery from XML Documents
Combining structure and content similarities for XML document clustering

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
Fractal self-similarity measurements based clustering technique for SOAP Web messages

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for storing, querying, indexing and accessing XML documents. In this paper we propose a new approach to clustering XML data. In contrast to previous work, which focused on documents defined by different DTDs, the proposed method works for documents with the same DTD. Our approach is to extract features from documents, modeled by ordered labeled trees, and transform the documents to vectors in a high-dimensional Euclidean space based on the occurrences of the features in the documents. We then reduce the dimensionality of the vectors by principal component analysis (PCA) and cluster the vectors in the reduced dimensional space. The PCA enables one to identify vectors with co-occurrent features, thereby enhancing the accuracy of the clustering. Experimental results based on documents obtained from Wisconsinýs XML data bank show the effectiveness and good performance of the proposed techniques.