Foundations of statistical natural language processing
Foundations of statistical natural language processing
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A statistical learning learning model of text classification for support vector machines
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Report on CLEF-2001 Experiments: Effective Combined Query-Translation Approach
CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
The Journal of Machine Learning Research
Probabilistic author-topic models for information discovery
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
The author-topic model for authors and documents
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Journal of the American Society for Information Science and Technology
LDA-based document models for ad-hoc retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Journal of the American Society for Information Science and Technology
Comparative Evaluation of Multilingual Information Access Systems: 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003, Trondheim, Norway, August ... Papers (Lecture Notes in Computer Science)
Computational methods in authorship attribution
Journal of the American Society for Information Science and Technology
A survey of modern authorship attribution methods
Journal of the American Society for Information Science and Technology
Entropy-based authorship search in large document collections
ECIR'07 Proceedings of the 29th European conference on IR research
Towards query log based personalization using topic models
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Journal of the American Society for Information Science and Technology
Authorship attribution with latent Dirichlet allocation
CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
HotSec'11 Proceedings of the 6th USENIX conference on Hot topics in security
Effective and scalable authorship attribution using function words
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Divergence measures based on the Shannon entropy
IEEE Transactions on Information Theory
Feature selections for authorship attribution
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Hi-index | 0.00 |
This paper describes, evaluates and compares the use of Latent Dirichlet allocation (LDA) as an approach to authorship attribution. Based on this generative probabilistic topic model, we can model each document as a mixture of topic distributions with each topic specifying a distribution over words. Based on author profiles (aggregation of all texts written by the same writer) we suggest computing the distance with a disputed text to determine its possible writer. This distance is based on the difference between the two topic distributions. To evaluate different attribution schemes, we carried out an experiment based on 5408 newspaper articles (Glasgow Herald) written by 20 distinct authors. To complement this experiment, we used 4326 articles extracted from the Italian newspaper La Stampa and written by 20 journalists. This research demonstrates that the LDA-based classification scheme tends to outperform the Delta rule, and the @g^2 distance, two classical approaches in authorship attribution based on a restricted number of terms. Compared to the Kullback-Leibler divergence, the LDA-based scheme can provide better effectiveness when considering a larger number of terms.