Text Sampling and Re-sampling for Imbalanced Authorship Identification Cases

Authors:
Efstathios Stamatatos
Affiliations:
Dept. of Information and Communication Systems Eng., University of the Aegean, 83200, Karlovassi, Greece, email: stamatatos@aegean.gr
Venue:
Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
Year:
2006

Citing 4
Cited 0

Automatic text categorization in terms of genre and author

Computational Linguistics
Applying Authorship Analysis to Extremist-Group Web Forum Messages

IEEE Intelligent Systems
The class imbalance problem: A systematic study

Intelligent Data Analysis
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Authorship identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors. In this paper, we present methods to handle imbalanced multi-class textual datasets. The main idea is to segment the training texts into sub-samples according to the size of the class. Hence, minority classes can be segmented into many short samples and majority classes into less and longer samples. Moreover, we explore text re-sampling in order to construct a training set according to a desirable distribution over the classes. Essentially, text re-sampling can be viewed as providing new synthetic data that increase the training size of a class. Based on a corpus of newswire stories in English we present authorship identification experiments on various multi-class imbalanced cases.