Features for unsupervised document classification

Authors:
S. H. Srinivasan
Affiliations:
Satyam Computer Services Ltd., Bangalore, India
Venue:
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Year:
2002

Citing 10
Cited 3

Using latent semantic analysis to improve access to textual information

CHI '88 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Matrices, Vector Spaces, and Information Retrieval

SIAM Review
Partitioning-based clustering for Web document categorization

Decision Support Systems - Special issue on WITS '97
Concept decompositions for large sparse text data using clustering

Machine Learning
Bipartite graph partitioning and data clustering

Proceedings of the tenth international conference on Information and knowledge management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Normalized Cuts and Image Segmentation

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
On clusterings-good, bad and spectral

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Fast and robust fixed-point algorithms for independent component analysis

IEEE Transactions on Neural Networks

Question classification with support vector machines and error correcting codes

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Nonsmooth Nonnegative Matrix Factorization (nsNMF)

IEEE Transactions on Pattern Analysis and Machine Intelligence
Feature diversity in cluster ensembles for robust document clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unsupervised document classification is an important problem in practical text mining since training data is seldom available. In this paper we study the problem of term selection and the performance of various features for unsupervised text classification. The features studied are: principal components, independent components, and non-negative components. The clustering algorithm used is based on bipartite graph partitioning (Zha et al., 2001). The evaluation is performed using the newsgroups corpus.