Fast dimension reduction for document classification based on imprecise spectrum analysis

  • Authors:
  • Hu Guan;Jingyu Zhou;Bin Xiao;Minyi Guo;Tao Yang

  • Affiliations:
  • Department of Computer Science, SJTU, Dongchuan 800, Shanghai, PR China;Department of Computer Science, SJTU, Dongchuan 800, Shanghai, PR China;Department of Computing, Hong Kong Polytechnic University, Hong Kong;Department of Computer Science, SJTU, Dongchuan 800, Shanghai, PR China;Department of Computer Science, University of California at Santa Barbara, USA

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2013

Quantified Score

Hi-index 0.07

Visualization

Abstract

Latent Semantic Indexing (LSI) with Singular Value Decomposition (SVD) is an effective dimension reduction method for document classification and other information analysis tasks. The computational overhead of SVD is known to be a bottleneck in dealing with large data sets, and faster dimension reduction with competitive accuracy is desired in such a setting. This paper presents Imprecise Spectrum Analysis (ISA) to carry out fast dimension reduction for document classification. ISA follows the one-sided Jacobi method for computing SVD and simplifies its intensive orthogonality computation. It uses a representative matrix composed of top-k column vectors derived from the original feature vector space and reduces the dimension of a feature vector by computing its product with this representative matrix. The paper provides an analysis to show the approximation error and the rationale behind such a dimension reduction method. To further improve classification accuracy, this paper also presents a feature selection method in building the initial feature matrix and augments the representative matrix by including centroid vectors. Our extensive experimental results show that ISA is fast in handling large term-document feature matrices while delivering better or competitive classification accuracy for the tested benchmarks compared to LSI with SVD.