Effect of term distributions on centroid-based text categorization

  • Authors:
  • Verayuth Lertnattee;Thanaruk Theeramunkong

  • Affiliations:
  • Information Technology Program, Sirindhorn International Institute of Technology, Bangkadi Campus, 131 Moo 5 Tiwanont Road, Bangkadi, Muang, Pathumthani 12000, Thailand;Information Technology Program, Sirindhorn International Institute of Technology, Bangkadi Campus, 131 Moo 5 Tiwanont Road, Bangkadi, Muang, Pathumthani 12000, Thailand

  • Venue:
  • Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most of traditional text categorization approaches utilize term frequency (tf) and inverse document frequency (idf) for representing importance of words and/or terms in classifying a text document. This paper describes an approach to apply term distributions, in addition to tf and idf, to improve performance of centroid-based text categorization. Three types of term distributions, called inter-class, intra-class and in-collection distributions, are introduced. These distributions are useful to increase classification accuracy by exploiting information of (1) term distribution among classes, (2) term distribution within a class and (3) term distribution in the whole collection of training data. In addition, this paper investigates how these term distributions contribute to weight each term in documents, e.g., a high term distribution of a word promotes or demotes importance or classification power of that word. To this end, several centroid-based classifiers are constructed with different term weightings. Using various data sets, their performances are investigated and compared to a standard centroid-based classifier (TDIDF) and a centroid-based classifier modified with information gain. Moreover, we also compare them to two well-known methods: k-NN and naïve Bayes. In addition to a unigram model of document representation, a bigram model is also explored. Finally, the effectiveness of term distributions to improve classification accuracy is explored with regard to the training set size and the number of classes.