A comparative study on text representation schemes in text categorization

  • Authors:
  • Fengxi Song;Shuhai Liu;Jingyu Yang

  • Affiliations:
  • Nanjing University of Science and Technology, Department of Computer Science, China;Nanjing University of Science and Technology, Department of Computer Science, China;Nanjing University of Science and Technology, Department of Computer Science, China

  • Venue:
  • Pattern Analysis & Applications
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

It is well known that the classification effectiveness of the text categorization system is not simply a matter of learning algorithms. Text representation factors are also at work. This paper will consider the ways in which the effectiveness of text classifiers is linked to the five text representation factors: “stop words removal”, “word stemming”, “indexing”, “weighting”, and “normalization”. Statistical analyses of experimental results show that performing “normalization” can always promote effectiveness of text classifiers significantly. The effects of the other factors are not as great as expected. Contradictory to common sense, a simple binary indexing method can sometimes be helpful for text categorization.