Theme word subspace method for text document categorization

  • Authors:
  • Zhou Xiaofei;Guo Li;Tan Jianlong;Jiang Wenhan

  • Affiliations:
  • Institute of Information, Engineering Chinese, Academy of Sciences, Beijing, China;Institute of Information, Engineering Chinese, Academy of Sciences, Beijing, China;Institute of Information, Engineering Chinese, Academy of Sciences, Beijing, China;First Research Institute of Ministry of Public Security, Beijing, China

  • Venue:
  • DM-IKM '12 Proceedings of the Data Mining and Intelligent Knowledge Management Workshop
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, a text document categorization method called Theme Word Subspace (TWS) learning is presented, which utilizes theme words jointly express class-semantic information for document classification. In a class corpus, the theme words with high probability distribution in topic structure are extracted firstly, and then these words as important theme element span class subspaces to jointly represent semantic and distribution of the class. For document categorization processing, a text document is belonged to the nearest subspace whose theme words have the best representation for test document. In our TWS, L1, L2 norm are separately used for measuring the distances of a test document to subspaces. Experiments on a large Chinese text corpus, the proposed TWS learning methods exhibit comparable performances for text document category.