Authorship attribution and verification with many authors and limited data

Authors:
Kim Luyckx;Walter Daelemans
Affiliations:
University of Antwerp, Antwerp, Belgium;University of Antwerp, Antwerp, Belgium
Venue:
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Year:
2008

Citing 13
Cited 10

Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Authorship Attribution with Support Vector Machines

Applied Intelligence
Style mining of electronic messages for multiple authorship discrimination: first results

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Authorship attribution with thousands of candidate authors

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Memory-Based Language Processing (Studies in Natural Language Processing)

Memory-Based Language Processing (Studies in Natural Language Processing)
Linguistic profiling for author recognition and verification

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Linguistic correlates of style: authorship classification with deep linguistic analysis features

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Stylistic text classification using functional lexical features: Research Articles

Journal of the American Society for Information Science and Technology
Author Identification Using Imbalanced and Limited Training Texts

DEXA '07 Proceedings of the 18th International Conference on Database and Expert Systems Applications
Measuring Differentiability: Unmasking Pseudonymous Authors

The Journal of Machine Learning Research
Authorship attribution using word sequences

CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications

Particle Swarm Model Selection for Authorship Verification

CIARP '09 Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Authorship attribution using probabilistic context-free grammars

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Authorship attribution in the wild

Language Resources and Evaluation
Authorship classification: a discriminative syntactic tree mining approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Authorship attribution with latent Dirichlet allocation

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
A weighted profile intersection measure for profile-based authorship attribution

MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Using psycholinguistic features for profiling first language of authors

Journal of the American Society for Information Science and Technology
On the use of homogenous sets of subjects in deceptive language analysis

EACL 2012 Proceedings of the Workshop on Computational Approaches to Deception Detection
Characterizing stylistic elements in syntactic structure

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
The use of orthogonal similarity relations in the prediction of authorship

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.01

Visualization

Abstract

Most studies in statistical or machine learning based authorship attribution focus on two or a few authors. This leads to an overestimation of the importance of the features extracted from the training data and found to be discriminating for these small sets of authors. Most studies also use sizes of training data that are unrealistic for situations in which stylometry is applied (e.g., forensics), and thereby overestimate the accuracy of their approach in these situations. A more realistic interpretation of the task is as an authorship verification problem that we approximate by pooling data from many different authors as negative examples. In this paper, we show, on the basis of a new corpus with 145 authors, what the effect is of many authors on feature selection and learning, and show robustness of a memory-based learning approach in doing authorship attribution and verification with many authors and limited training data when compared to eager learning methods such as SVMs and maximum entropy learning.