Non-negative matrix factorization based text mining: feature extraction and classification

  • Authors:
  • P. C. Barman;Nadeem Iqbal;Soo-Young Lee

  • Affiliations:
  • Department of BioSystems, Korea Advanced Institute of Science and Technology, Brain Science Research Center and Computational NeuroSystems Lab, Daejeon, Republic of Korea;Department of BioSystems, Korea Advanced Institute of Science and Technology, Brain Science Research Center and Computational NeuroSystems Lab, Daejeon, Republic of Korea;Department of BioSystems, Korea Advanced Institute of Science and Technology, Brain Science Research Center and Computational NeuroSystems Lab, Daejeon, Republic of Korea

  • Venue:
  • ICONIP'06 Proceedings of the 13th international conference on Neural Information Processing - Volume Part II
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The unlabeled document or text collections are becoming larger and larger which is common and obvious; mining such data sets are a challenging task. Using the simple word-document frequency matrix as feature space the mining process is becoming more complex. The text documents are often represented as high dimensional about few thousand sparse vectors with sparsity about 95 to 99% which significantly affects the efficiency and the results of the mining process. In this paper, we propose the two-stage Non-negative Matrix Factorization (NMF): in the first stage we tried to extract the uncorrelated basis probabilistic document feature vectors by significantly reducing the dimension of the feature vectors of the word-document frequency from few thousand to few hundred, and in the second stage for clustering or classification. In our propose approach it has been observed that the clustering or classification performance with more than 98.5% accuracy. The dimension reduction and classification performance has observed for the Classic3 dataset.