Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation

Authors:
Conrad Sanderson;Simon Guenter
Affiliations:
Australian National University, Canberra, Australia and National ICT Australia, Australia;Australian National University, Canberra, Australia and National ICT Australia, Australia
Venue:
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Year:
2006

Citing 12
Cited 7

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
Composite Kernels for Hypertext Categorisation

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Authorship Attribution with Support Vector Machines

Applied Intelligence
Word sequence kernels

The Journal of Machine Learning Research
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Authorship verification as a one-class classification problem

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Mismatch string kernels for discriminative protein classification

Bioinformatics
Linguistic correlates of style: authorship classification with deep linguistic analysis features

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
On transforming statistical models for non-frontal face verification

Pattern Recognition
Gene extraction for cancer diagnosis by support vector machines

ICANN'05 Proceedings of the 15th international conference on Artificial Neural Networks: biological Inspirations - Volume Part I

A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Authorship classification: a syntactic tree mining approach

Proceedings of the ACM SIGKDD Workshop on Useful Patterns
Intrinsic plagiarism analysis

Language Resources and Evaluation
Authorship classification: a discriminative syntactic tree mining approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Predicting age and gender in online social networks

Proceedings of the 3rd international workshop on Search and mining user-generated contents
An efficient alternative to SVM based recursive feature elimination with applications in natural language processing and bioinformatics

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
Explanation in computational stylometry

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an investigation of recently proposed character and word sequence kernels for the task of authorship attribution based on relatively short texts. Performance is compared with two corresponding probabilistic approaches based on Markov chains. Several configurations of the sequence kernels are studied on a relatively large dataset (50 authors), where each author covered several topics. Utilising Moffat smoothing, the two probabilistic approaches obtain similar performance, which in turn is comparable to that of character sequence kernels and is better than that of word sequence kernels. The results further suggest that when using a realistic setup that takes into account the case of texts which are not written by any hypothesised authors, the amount of training material has more influence on discrimination performance than the amount of test material. Moreover, we show that the recently proposed author unmasking approach is less useful when dealing with short texts.