Subspace clustering of text documents with feature weighting k-means algorithm

Authors:
Liping Jing;Michael K. Ng;Jun Xu;Joshua Zhexue Huang
Affiliations:
Department of Mathematics, The University of Hong Kong, HongKong, China;Department of Mathematics, The University of Hong Kong, HongKong, China;E-Business Technology Institute, The University of Hong Kong, Hong Kong, China;E-Business Technology Institute, The University of Hong Kong, Hong Kong, China
Venue:
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Year:
2005

Citing 4
Cited 17

Concept decompositions for large sparse text data using clustering

Machine Learning
Machine Learning

Machine Learning
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets

An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data

IEEE Transactions on Knowledge and Data Engineering
A clustering framework based on subjective and objective validity criteria

ACM Transactions on Knowledge Discovery from Data (TKDD)
Enhanced soft subspace clustering integrating within-cluster and between-cluster information

Pattern Recognition
ISMCS: an intelligent instruction sequence based malware categorization system

ASID'09 Proceedings of the 3rd international conference on Anti-Counterfeiting, security, and identification in communication
SKM-SNP: SNP markers detection method

Journal of Biomedical Informatics
Document clustering using synthetic cluster prototypes

Data & Knowledge Engineering
An entropy weighting mixture model for subspace clustering of high-dimensional data

Pattern Recognition Letters
EEW-SC: Enhanced Entropy-Weighting Subspace Clustering for high dimensional gene expression data clustering analysis

Applied Soft Computing
Document clustering based on maximal frequent sequences

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
On the performance of feature weighting K-means for text subspace clustering

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Text clustering with limited user feedback under local metric learning

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Feature interaction in subspace clustering using the Choquet integral

Pattern Recognition
Partitive clustering (K-means family)

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Deriving group profiles from social media to facilitate the design of simulated environments for learning

Proceedings of the 2nd International Conference on Learning Analytics and Knowledge
The dictionary-based quantified conceptual relations for hard and soft Chinese text clustering

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Novel soft subspace clustering with multi-objective evolutionary approach for high-dimensional data

Pattern Recognition
Fuzzy partition based soft subspace clustering and its applications in high dimensional data

Information Sciences: an International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents a new method to solve the problem of clustering large and complex text data. The method is based on a new subspace clustering algorithm that automatically calculates the feature weights in the k-means clustering process. In clustering sparse text data the feature weights are used to discover clusters from subspaces of the document vector space and identify key words that represent the semantics of the clusters. We present a modification of the published algorithm to solve the sparsity problem that occurs in text clustering. Experimental results on real-world text data have shown that the new method outperformed the Standard KMeans and Bisection-KMeans algorithms, while still maintaining efficiency of the k-means clustering process.