Multilevel Conditional Fuzzy C-Means Clustering of XML Documents
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
XML documents clustering based on representative path
ICCOMP'09 Proceedings of the WSEAES 13th international conference on Computers
Semantics-guided clustering of heterogeneous XML schemas
Journal on data semantics IX
A weighted common structure based clustering technique for XML documents
Journal of Systems and Software
ICCOMP'10 Proceedings of the 14th WSEAS international conference on Computers: part of the 14th WSEAS CSCC multiconference - Volume I
WSEAS Transactions on Computers
A flexible structured-based representation for XML document mining
INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
XML document clustering by independent component analysis
KDXD'06 Proceedings of the First international conference on Knowledge Discovery from XML Documents
Combining structure and content similarities for XML document clustering
AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
Fractal self-similarity measurements based clustering technique for SOAP Web messages
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for storing, querying, indexing and accessing XML documents. In this paper we propose a new approach to clustering XML data. In contrast to previous work, which focused on documents defined by different DTDs, the proposed method works for documents with the same DTD. Our approach is to extract features from documents, modeled by ordered labeled trees, and transform the documents to vectors in a high-dimensional Euclidean space based on the occurrences of the features in the documents. We then reduce the dimensionality of the vectors by principal component analysis (PCA) and cluster the vectors in the reduced dimensional space. The PCA enables one to identify vectors with co-occurrent features, thereby enhancing the accuracy of the clustering. Experimental results based on documents obtained from Wisconsinýs XML data bank show the effectiveness and good performance of the proposed techniques.