Author identification: Using text sampling to handle the class imbalance problem

Authors:
Efstathios Stamatatos
Affiliations:
Department of Information and Communication Systems Engineering, University of the Aegean, Karlovassi, Samos 83200, Greece
Venue:
Information Processing and Management: an International Journal
Year:
2008

Citing 14
Cited 11

Discrimination of authorship using visualization

Information Processing and Management: an International Journal
The nature of statistical learning theory

The nature of statistical learning theory
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Mining e-mail content for author identification forensics

ACM SIGMOD Record
Text classification using string kernels

The Journal of Machine Learning Research
Style mining of electronic messages for multiple authorship discrimination: first results

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic text categorization in terms of genre and author

Computational Linguistics
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Language independent authorship attribution using character level language models

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Applying Authorship Analysis to Extremist-Group Web Forum Messages

IEEE Intelligent Systems
A framework for authorship identification of online messages: Writing-style features and classification techniques

Journal of the American Society for Information Science and Technology
Linguistic profiling for author recognition and verification

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
The class imbalance problem: A systematic study

Intelligent Data Analysis
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

Tensor Space Models for Authorship Identification

SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Cuisine: Classification using stylistic feature sets and-or name-based feature sets

Journal of the American Society for Information Science and Technology
Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets

Expert Systems with Applications: An International Journal
Exploring discrepancies in findings obtained with the KDD Cup '99 data set

Intelligent Data Analysis
Combining integrated sampling with SVM ensembles for learning from imbalanced datasets

Information Processing and Management: an International Journal
Author identification in bengali literary works

PReMI'11 Proceedings of the 4th international conference on Pattern recognition and machine intelligence
Comparing alternative classifiers for database marketing: The case of imbalanced datasets

Expert Systems with Applications: An International Journal
A new document author representation for authorship attribution

MCPR'12 Proceedings of the 4th Mexican conference on Pattern Recognition
A novel probabilistic feature selection method for text classification

Knowledge-Based Systems
The use of orthogonal similarity relations in the prediction of authorship

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Authorship analysis of electronic texts assists digital forensics and anti-terror investigation. Author identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors or there is a significant variation in the text-length among the available training texts of the candidate authors. Moreover, in this task usually there is no similarity between the distribution of training and test texts over the classes, that is, a basic assumption of inductive learning does not apply. In this paper, we present methods to handle imbalanced multi-class textual datasets. The main idea is to segment the training texts into text samples according to the size of the class, thus producing a fairer classification model. Hence, minority classes can be segmented into many short samples and majority classes into less and longer samples. We explore text sampling methods in order to construct a training set according to a desirable distribution over the classes. Essentially, by text sampling we provide new synthetic data that artificially increase the training size of a class. Based on two text corpora of two languages, namely, newswire stories in English and newspaper reportage in Arabic, we present a series of authorship identification experiments on various multi-class imbalanced cases that reveal the properties of the presented methods.