N-Gram feature selection for authorship identification

Authors:
John Houvardas;Efstathios Stamatatos
Affiliations:
Dept. of Information and Communication Systems Eng., University of the Aegean, Karlovassi, Greece;Dept. of Information and Communication Systems Eng., University of the Aegean, Karlovassi, Greece
Venue:
AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
Year:
2006

Citing 13
Cited 14

Word association norms, mutual information, and lexicography

Computational Linguistics
Discrimination of authorship using visualization

Information Processing and Management: an International Journal
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Mining e-mail content for author identification forensics

ACM SIGMOD Record
Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units

EPIA '99 Proceedings of the 9th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A repetition based measure for verification of text collections and for text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Automatic text categorization in terms of genre and author

Computational Linguistics
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Language independent authorship attribution using character level language models

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Applying Authorship Analysis to Extremist-Group Web Forum Messages

IEEE Intelligent Systems
Linguistic profiling for author recognition and verification

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
On compression-based text classification

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Summarization system evaluation revisited: N-gram graphs

ACM Transactions on Speech and Language Processing (TSLP)
Tensor Space Models for Authorship Identification

SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Learning to recognize webpage genres

Information Processing and Management: an International Journal
Classifying Web Pages by Genre: An n-Gram Approach

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Particle Swarm Model Selection for Authorship Verification

CIARP '09 Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Improving gender classification of blog authors

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Local histograms of character N-grams for authorship attribution

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
A weighted profile intersection measure for profile-based authorship attribution

MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
A new document author representation for authorship attribution

MCPR'12 Proceedings of the 4th Mexican conference on Pattern Recognition
Characterizing stylistic elements in syntactic structure

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Semi-random subspace method for writeprint identification

Neurocomputing
The use of orthogonal similarity relations in the prediction of authorship

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.