Forensic Authorship Attribution Using Compression Distances to Prototypes

Authors:
Maarten Lambers;Cor J. Veenman
Affiliations:
Digital Technology & Biometrics Department, Netherlands Forensic Institute, The Hague, The Netherlands;Intelligent Systems Lab, University of Amsterdam, Amsterdam, The Netherlands and Digital Technology & Biometrics Department, Netherlands Forensic Institute, The Hague, The Netherlands
Venue:
IWCF '09 Proceedings of the 3rd International Workshop on Computational Forensics
Year:
2009

Citing 26
Cited 2

An introduction to Kolmogorov complexity and its applications (2nd ed.)

An introduction to Kolmogorov complexity and its applications (2nd ed.)
Relational discriminant analysis

Pattern Recognition Letters - Special issue on pattern recognition in practice VI
Machine Learning

Machine Learning
Using Literal and Grammatical Statistics for Authorship Attribution

Problems of Information Transmission
Mining e-mail content for author identification forensics

ACM SIGMOD Record
Combining Fisher Linear Discriminants for Dissimilarity Representations

MCS '00 Proceedings of the First International Workshop on Multiple Classifier Systems
A repetition based measure for verification of text collections and for text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Guide to Biometrics

Guide to Biometrics
Style mining of electronic messages for multiple authorship discrimination: first results

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Applying Authorship Analysis to Extremist-Group Web Forum Messages

IEEE Intelligent Systems
A framework for authorship identification of online messages: Writing-style features and classification techniques

Journal of the American Society for Information Science and Technology
From fingerprint to writeprint

Communications of the ACM - Supporting exploratory search
Linguistic correlates of style: authorship classification with deep linguistic analysis features

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Prototype selection for dissimilarity-based classifiers

Pattern Recognition
Visual Analytics: Normalized compression distance for visual analysis of document collections

Computers and Graphics
On using prototype reduction schemes to optimize dissimilarity-based classification

Pattern Recognition
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Effective and scalable authorship attribution using function words

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Visualizing authorship for identification

ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics
A comparative study of language models for book and author recognition

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
On compression-based text classification

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Shared information and program plagiarism detection

IEEE Transactions on Information Theory
The similarity metric

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory

Local histograms of character N-grams for authorship attribution

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
A weighted profile intersection measure for profile-based authorship attribution

MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I

Quantified Score

Hi-index	0.01

Visualization

Abstract

In several situations authors prefer to hide their identity. In forensic applications, one can think of extortion and threats in emails and forum messages. These types of messages can easily be adjusted, such that meta data referring to names and addresses is at least unreliable. In this paper, we propose a method to identify authors of short informal messages solely based on the text content. The method uses compression distances between texts as features. Using these features a supervised classifier is learned on a training set of known authors. For the experiments, we prepared a dataset from Dutch newsgroup texts. We compared several state-of-the-art methods to our proposed method for the identification of messages from up to 50 authors. Our method clearly outperformed the other methods. In 65% of the cases the author could be correctly identified, while in 88% of the cases the true author was in the top 5 of the produced ranked list.