Class normalization in centroid-based text categorization

  • Authors:
  • Verayuth Lertnattee;Thanaruk Theeramunkong

  • Affiliations:
  • Faculty of Pharmacy, Silpakorn University, Sanamchandra Campus, Muang, Nakorn Pathom 73000, Thailand;Information Technology Program, Sirindhorn International, Institute of Technology, 131 Moo 5 Tiwanont Rd, Bangkadi, Muang, Pathum Thani 12000, Thailand

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2006

Quantified Score

Hi-index 0.07

Visualization

Abstract

Centroid-based categorization is one of the most popular algorithms in text classification. In this approach, normalization is an important factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes and/or the numbers of documents in classes are unbalanced. In the past, most researchers applied document normalization, e.g., document-length normalization, while some consider a simple kind of class normalization, so-called class-length normalization, to solve the unbalancedness problem. However, there is no intensive work that clarifies how these normalizations affect classification performance and whether there are any other useful normalizations. The purpose of this paper is three folds; (1) to investigate the effectiveness of document- and class-length normalizations on several data sets, (2) to evaluate a number of commonly used normalization functions and (3) to introduce a new type of class normalization, called term-length normalization, which exploits term distribution among documents in the class. The experimental results show that a classifier with weight-merge-normalize approach (class-length normalization) performs better than one with weight-normalize-merge approach (document-length normalization) for the data sets with unbalanced numbers of documents in classes, and is quite competitive for those with balanced numbers of documents. For normalization functions, the normalization based on term weighting performs better than the others on average. For term-length normalization, it is useful for improving classification accuracy. The combination of term- and class-length normalizations outperforms pure class-length normalization and pure term-length normalization as well as unnormalization with the gaps of 4.29%, 11.50%, 30.09%, respectively.