The impact of preprocessing on text classification

Authors:
Alper Kursat Uysal;Serkan Gunal
Affiliations:
-;-
Venue:
Information Processing and Management: an International Journal
Year:
2014

Citing 20
Cited 0

A vector space model for automatic indexing

Communications of the ACM
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
A comparative study on text representation schemes in text categorization

Pattern Analysis & Applications
Information retrieval on Turkish texts

Journal of the American Society for Information Science and Technology
Subspace based feature selection for pattern recognition

Information Sciences: an International Journal
Feature selection with dynamic mutual information

Pattern Recognition
Feature reduction techniques for Arabic text categorization

Journal of the American Society for Information Science and Technology
Information gain and divergence-based feature selection for machine learning-based text categorization

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
A comparison study on multiple binary-class SVM methods for unilabel text categorization

Pattern Recognition Letters
A Web page classification system based on a genetic algorithm using tagged-terms as features

Expert Systems with Applications: An International Journal
Using chi-square statistics to measure similarities for text categorization

Expert Systems with Applications: An International Journal
Adapting centroid classifier for document categorization

Expert Systems with Applications: An International Journal
A Bayesian feature selection paradigm for text classification

Information Processing and Management: an International Journal
On feature extraction for spam e-mail detection

MRCS'06 Proceedings of the 2006 international conference on Multimedia Content Representation, Classification and Security
Automated text classification using a dynamic artificial neural network model

Expert Systems with Applications: An International Journal
Author gender identification from text

Digital Investigation: The International Journal of Digital Forensics & Incident Response
A lexicon model for deep sentiment analysis and opinion mining applications

Decision Support Systems
A novel probabilistic feature selection method for text classification

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Preprocessing is one of the key components in a typical text classification framework. This paper aims to extensively examine the impact of preprocessing on text classification in terms of various aspects such as classification accuracy, text domain, text language, and dimension reduction. For this purpose, all possible combinations of widely used preprocessing tasks are comparatively evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. In this way, contribution of the preprocessing tasks to classification success at various feature dimensions, possible interactions among these tasks, and also dependency of these tasks to the respective languages and domains are comprehensively assessed. Experimental analysis on benchmark datasets reveals that choosing appropriate combinations of preprocessing tasks, rather than enabling or disabling them all, may provide significant improvement on classification accuracy depending on the domain and language studied on.