Effects of Term Distributions on Binary Classification

  • Authors:
  • Verayuth Lertnattee;Thanaruk Theeramunkong

  • Affiliations:
  • -;-

  • Venue:
  • IEICE - Transactions on Information and Systems
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In order to support decision making, text classification is an important tool. Recently, in addition to term frequency and inverse document frequency, term distributions have been shown to be useful to improve classification accuracy in multi-class classification. This paper investigates the performance of these term distributions on binary classification using a centroid-based approach. In such one-against-the-rest, there are only two classes, the positive (focused) class and the negative class. To improve the performance, a so-called hierarchical EM method is applied to cluster the negative class, which is usually much larger and more diverse than the positive one, into several homogeneous groups. The experimental results on two collections of web pages, namely Drug Information (DI) and WebKB, show the merits of term distributions and clustering on binary classification. The performance of the proposed method is also investigated using the Thai Herbal collection where the texts are written in Thai language.