Distributional word clusters vs. words for text categorization

  • Authors:
  • Ron Bekkerman;Ran El-Yaniv;Naftali Tishby;Yoad Winter

  • Affiliations:
  • Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel;Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel;School of Computer Science and Engineering and Center for Neural Computation, The Hebrew University, Jerusalem 91904, Israel;Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel

  • Venue:
  • The Journal of Machine Learning Research
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and efficient representation of documents. When combined with the classification power of the SVM, this method yields high performance in text categorization. This novel combination of SVM with word-cluster representation is compared with SVM-based categorization using the simpler bag-of-words (BOW) representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method based on word clusters significantly outperforms the word-based representation in terms of categorization accuracy or representation efficiency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the word-cluster representation. We investigate the potential reasons for this behavior and relate it to structural differences between the datasets.