Language independent authorship attribution using character level language models

Authors:
Fuchun Peng;Dale Schuurmans;Shaojun Wang;Vlado Keselj
Affiliations:
University of Waterloo, Canada;University of Waterloo, Canada;University of Waterloo, Canada;Dalhousie University, Canada
Venue:
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Year:
2003

Citing 6
Cited 21

Text compression

Text compression
Towards language independent automated learning of text categorization models

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Feature Engineering for Text Classification

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Text Mining: A New Frontier for Lossless Compression

DCC '99 Proceedings of the Conference on Data Compression
Automatic text categorization in terms of genre and author

Computational Linguistics
Automatic authorship attribution

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics

Applying Authorship Analysis to Extremist-Group Web Forum Messages

IEEE Intelligent Systems
Segmenting documents by stylistic character

Natural Language Engineering
Broad coverage paragraph segmentation across languages and domains

ACM Transactions on Speech and Language Processing (TSLP)
Author identification: Using text sampling to handle the class imbalance problem

Information Processing and Management: an International Journal
Authorship attribution

Foundations and Trends in Information Retrieval
Stylometric Identification in Electronic Markets: Scalability and Robustness

Journal of Management Information Systems
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Using distributional similarity to identify individual verb choice

INLG '06 Proceedings of the Fourth International Natural Language Generation Conference
A classifier system for author recognition using synonym-based features

MICAI'07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence
Author attribution of Turkish texts by feature mining

ICIC'07 Proceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications
Text-based video content classification for online video-sharing sites

Journal of the American Society for Information Science and Technology
Authorship attribution using probabilistic context-free grammars

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Automatic authorship attribution for texts in croatian language using combinations of features

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Local histograms of character N-grams for authorship attribution

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Gender attribution: tracing stylometric evidence beyond topic and genre

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Effective and scalable authorship attribution using function words

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Using relative entropy for authorship attribution

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
N-Gram feature selection for authorship identification

AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
Applying authorship analysis to arabic web content

ISI'05 Proceedings of the 2005 IEEE international conference on Intelligence and Security Informatics
Characterizing stylistic elements in syntactic structure

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Recognition of word collocation habits using frequency rank ratio and inter-term intimacy

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a method for computer-assisted authorship attribution based on character-level n-gram language models. Our approach is based on simple information theoretic principles, and achieves improved performance across a variety of languages without requiring extensive pre-processing or feature selection. To demonstrate the effectiveness and language independence of our approach, we present experimental results on Greek, English, and Chinese data. We show that our approach achieves state of the art performance in each of these cases. In particular, we obtain a 18% accuracy improvement over the best published results for a Greek data set, while using a far simpler technique than previous investigations.