Text classification based on the bias of word frequency over categories

  • Authors:
  • Makoto Suzuki

  • Affiliations:
  • Department of Information Science, Shonan Institute of Technology, Fujisawa, Kanagawa, Japan

  • Venue:
  • AIA'06 Proceedings of the 24th IASTED international conference on Artificial intelligence and applications
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In automatic text classification, for example, for classifying newspaper articles into predefined categories such as politics and sports, the crucial step is how to select appropriate keywords. With traditional classification methods based on the vector space model, frequent words are emphasized and therefore low-frequency words tend to be disregarded. However, there often exist low-frequency words that are effective for classification. For instance, technical terms appear in specific categories so their frequencies are generally low, even though they are effective keywords. In this paper, we propose two text classification methods, namely, NDF method and accumulation method, that are based on the bias of word frequency distribution over categories. Our experiments show that our accumulation method outperforms a traditional method based on the vector space model.