Using relative entropy for authorship attribution

Authors:
Ying Zhao;Justin Zobel;Phil Vines
Affiliations:
School of Computer Science and Information Technology, RMIT University, Melbourne, Australia;School of Computer Science and Information Technology, RMIT University, Melbourne, Australia;School of Computer Science and Information Technology, RMIT University, Melbourne, Australia
Venue:
AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Year:
2006

Citing 14
Cited 7

Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
Learning Bayesian Networks: The Combination of Knowledge and Statistical Data

Machine Learning
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Authorship Attribution with Support Vector Machines

Applied Intelligence
The disputed federalist papers: SVM feature selection via concave minimization

Proceedings of the 2003 conference on Diversity in computing
Automatic authorship attribution

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Authorship verification as a one-class classification problem

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Language independent authorship attribution using character level language models

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Language and task independent text categorization with simple language models

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Effective and scalable authorship attribution using function words

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology

Searching with style: authorship attribution in classic literature

ACSC '07 Proceedings of the thirtieth Australasian conference on Computer science - Volume 62
Application of Information Retrieval Techniques for Source Code Authorship Attribution

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Entropy-based authorship search in large document collections

ECIR'07 Proceedings of the 29th European conference on IR research
Authorship attribution via combination of evidence

ECIR'07 Proceedings of the 29th European conference on IR research
Authorship classification: a syntactic tree mining approach

Proceedings of the ACM SIGKDD Workshop on Useful Patterns
Authorship classification: a discriminative syntactic tree mining approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Supervised language modeling for temporal resolution of texts

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Authorship attribution is the task of deciding who wrote a particular document. Several attribution approaches have been proposed in recent research, but none of these approaches is particularly satisfactory; some of them are ad hoc and most have defects in terms of scalability, effectiveness, and efficiency. In this paper, we propose a principled approach motivated from information theory to identify authors based on elements of writing style. We make use of the Kullback-Leibler divergence, a measure of how different two distributions are, and explore several different approaches to tokenizing documents to extract style markers. We use several data collections to examine the performance of our approach. We have found that our proposed approach is as effective as the best existing attribution methods for two class attribution, and is superior for multi-class attribution. It has lower computational cost and is cheaper to train. Finally, our results suggest this approach is a promising alternative for other categorization problems.