Automatic word clustering for text categorization using global information

  • Authors:
  • Chen Wenliang;Chang Xingzhi;Wang Huizhen;Zhu Jingbo;Yao Tianshun

  • Affiliations:
  • Natural Language Processing Lab, Northeastern University, Shenyang, China;Natural Language Processing Lab, Northeastern University, Shenyang, China;Natural Language Processing Lab, Northeastern University, Shenyang, China;Natural Language Processing Lab, Northeastern University, Shenyang, China;Natural Language Processing Lab, Northeastern University, Shenyang, China

  • Venue:
  • AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

High dimensionality of feature space and short of training documents are the crucial obstacles for text categorization. In order to overcome these obstacles, this paper presents a cluster-based text categorization system which uses class distributional clustering of words. We propose a new clustering model which considers the global information over all the clusters. The model can be understood as the balance of all the clusters according to the number of words in them. It can group words into clusters based on the distribution of class labels associated with each word. Using these learned clusters as features, we develop a cluster-based classifier. We present several experimental results to show that our proposed method performs better than the other three text classifiers. The proposed model has better results than the model which only considers the information of the two related clusters. Specially, it can maintain good performance when the number of features is small and the size of training corpus is small.