A comparative study on text representation schemes in text categorization

Authors:
Fengxi Song;Shuhai Liu;Jingyu Yang
Affiliations:
Nanjing University of Science and Technology, Department of Computer Science, China;Nanjing University of Science and Technology, Department of Computer Science, China;Nanjing University of Science and Technology, Department of Computer Science, China
Venue:
Pattern Analysis & Applications
Year:
2005

Citing 0
Cited 5

A study of local and global thresholding techniques in text categorization

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
A new feature selection algorithm based on binomial hypothesis testing for spam filtering

Knowledge-Based Systems
Phoneme Based Representation for Vietnamese Web Page Classification

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
An empirical study on various text classifiers

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
The impact of preprocessing on text classification

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is well known that the classification effectiveness of the text categorization system is not simply a matter of learning algorithms. Text representation factors are also at work. This paper will consider the ways in which the effectiveness of text classifiers is linked to the five text representation factors: “stop words removal”, “word stemming”, “indexing”, “weighting”, and “normalization”. Statistical analyses of experimental results show that performing “normalization” can always promote effectiveness of text classifiers significantly. The effects of the other factors are not as great as expected. Contradictory to common sense, a simple binary indexing method can sometimes be helpful for text categorization.