Non-negative matrix factorization based text mining: feature extraction and classification

Authors:
P. C. Barman;Nadeem Iqbal;Soo-Young Lee
Affiliations:
Department of BioSystems, Korea Advanced Institute of Science and Technology, Brain Science Research Center and Computational NeuroSystems Lab, Daejeon, Republic of Korea;Department of BioSystems, Korea Advanced Institute of Science and Technology, Brain Science Research Center and Computational NeuroSystems Lab, Daejeon, Republic of Korea;Department of BioSystems, Korea Advanced Institute of Science and Technology, Brain Science Research Center and Computational NeuroSystems Lab, Daejeon, Republic of Korea
Venue:
ICONIP'06 Proceedings of the 13th international conference on Neural Information Processing - Volume Part II
Year:
2006

Citing 6
Cited 4

Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Bipartite graph partitioning and data clustering

Proceedings of the tenth international conference on Information and knowledge management
Document clustering with cluster refinement and model selection capabilities

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Document clustering using nonnegative matrix factorization

Information Processing and Management: an International Journal

PCA document reconstruction for email classification

Computational Statistics & Data Analysis
Supervised input space scaling for non-negative matrix factorization

Signal Processing
Document categorization based on minimum loss of reconstruction information

MICAI'12 Proceedings of the 11th Mexican international conference on Advances in Computational Intelligence - Volume Part II
Minimizer of the Reconstruction Error for multi-class document categorization

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The unlabeled document or text collections are becoming larger and larger which is common and obvious; mining such data sets are a challenging task. Using the simple word-document frequency matrix as feature space the mining process is becoming more complex. The text documents are often represented as high dimensional about few thousand sparse vectors with sparsity about 95 to 99% which significantly affects the efficiency and the results of the mining process. In this paper, we propose the two-stage Non-negative Matrix Factorization (NMF): in the first stage we tried to extract the uncorrelated basis probabilistic document feature vectors by significantly reducing the dimension of the feature vectors of the word-document frequency from few thousand to few hundred, and in the second stage for clustering or classification. In our propose approach it has been observed that the clustering or classification performance with more than 98.5% accuracy. The dimension reduction and classification performance has observed for the Classic3 dataset.