Authorship attribution based on a probabilistic topic model

Authors:
Jacques Savoy
Affiliations:
Computer Science Department, University of Neuchatel, Rue Emile Argand 11, 2000 Neuchítel, Switzerland
Venue:
Information Processing and Management: an International Journal
Year:
2013

Citing 23
Cited 1

Foundations of statistical natural language processing

Foundations of statistical natural language processing
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A statistical learning learning model of text classification for support vector machines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Report on CLEF-2001 Experiments: Effective Combined Query-Translation Approach

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Latent dirichlet allocation

The Journal of Machine Learning Research
Probabilistic author-topic models for information discovery

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
The author-topic model for authors and documents

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
A framework for authorship identification of online messages: Writing-style features and classification techniques

Journal of the American Society for Information Science and Technology
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to the special topic section on the computational analysis of style: Special Topic Section on Computational Analysis of Style

Journal of the American Society for Information Science and Technology
Comparative Evaluation of Multilingual Information Access Systems: 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003, Trondheim, Norway, August ... Papers (Lecture Notes in Computer Science)

Comparative Evaluation of Multilingual Information Access Systems: 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003, Trondheim, Norway, August ... Papers (Lecture Notes in Computer Science)
Computational methods in authorship attribution

Journal of the American Society for Information Science and Technology
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Entropy-based authorship search in large document collections

ECIR'07 Proceedings of the 29th European conference on IR research
Towards query log based personalization using topic models

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
The shifting sands of disciplinary development: Analyzing North American Library and Information Science dissertations using latent Dirichlet allocation

Journal of the American Society for Information Science and Technology
Authorship attribution with latent Dirichlet allocation

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Herbert west: deanonymizer

HotSec'11 Proceedings of the 6th USENIX conference on Hot topics in security
Effective and scalable authorship attribution using function words

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Divergence measures based on the Shannon entropy

IEEE Transactions on Information Theory

Feature selections for authorship attribution

Proceedings of the 28th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes, evaluates and compares the use of Latent Dirichlet allocation (LDA) as an approach to authorship attribution. Based on this generative probabilistic topic model, we can model each document as a mixture of topic distributions with each topic specifying a distribution over words. Based on author profiles (aggregation of all texts written by the same writer) we suggest computing the distance with a disputed text to determine its possible writer. This distance is based on the difference between the two topic distributions. To evaluate different attribution schemes, we carried out an experiment based on 5408 newspaper articles (Glasgow Herald) written by 20 distinct authors. To complement this experiment, we used 4326 articles extracted from the Italian newspaper La Stampa and written by 20 journalists. This research demonstrates that the LDA-based classification scheme tends to outperform the Delta rule, and the @g^2 distance, two classical approaches in authorship attribution based on a restricted number of terms. Compared to the Kullback-Leibler divergence, the LDA-based scheme can provide better effectiveness when considering a larger number of terms.