XML document clustering by independent component analysis

Authors:
Tong Wang;Da-Xin Liu;Xuan-Zuo Lin
Affiliations:
Department of Computer Science and Technology, Harbin Engineering University, China;Department of Computer Science and Technology, Harbin Engineering University, China;Northeast Agriculture University, Harbin, China
Venue:
KDXD'06 Proceedings of the First international conference on Knowledge Discovery from XML Documents
Year:
2006

Citing 12
Cited 2

A survey of information retrieval and filtering methods

A survey of information retrieval and filtering methods
Computational experience on four algorithms for the hard clustering problem

Pattern Recognition Letters
A fast fixed-point algorithm for independent component analysis

Neural Computation
Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
Independent component analysis: algorithms and applications

Neural Networks
Modern Information Retrieval

Modern Information Retrieval
Topic Identification in Dynamical Text by Complexity Pursuit

Neural Processing Letters
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
XML Clustering by Principal Component Analysis

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
A tree-based approach to clustering XML documents by structure

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality

IEEE Transactions on Pattern Analysis and Machine Intelligence

Return specification inference and result clustering for keyword search on XML

ACM Transactions on Database Systems (TODS)
Improving XML search by generating and utilizing informative result snippets

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

When XML documents are clustered, the high dimensionality problem will occur. Independent Component Analysis (ICA) can reduce dimensionality and in the meanwhile find the underlying latent variables of XML structures to improve the quality of the clustering. This paper proposes a novel strategy to cluster XML documents based on ICA. According to D_path extracted from XML trees, the document was at first represented as Vector Space Model (VSM).Then ICA is applied to reduce the dimensionality of document vectors. Furthermore, document vectors are clustered on this reduced Euclidean Space spanned by the independent components. The experiments show that ICA can enhance the accuracy of the clustering with stable performance.