Word association norms, mutual information, and lexicography
Computational Linguistics
Discrimination of authorship using visualization
Information Processing and Management: an International Journal
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Mining e-mail content for author identification forensics
ACM SIGMOD Record
EPIA '99 Proceedings of the 9th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A repetition based measure for verification of text collections and for text categorization
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Automatic text categorization in terms of genre and author
Computational Linguistics
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Language independent authorship attribution using character level language models
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Applying Authorship Analysis to Extremist-Group Web Forum Messages
IEEE Intelligent Systems
Linguistic profiling for author recognition and verification
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
On compression-based text classification
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Summarization system evaluation revisited: N-gram graphs
ACM Transactions on Speech and Language Processing (TSLP)
Tensor Space Models for Authorship Identification
SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
A survey of modern authorship attribution methods
Journal of the American Society for Information Science and Technology
Learning to recognize webpage genres
Information Processing and Management: an International Journal
Classifying Web Pages by Genre: An n-Gram Approach
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Particle Swarm Model Selection for Authorship Verification
CIARP '09 Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Improving gender classification of blog authors
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Local histograms of character N-grams for authorship attribution
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
A weighted profile intersection measure for profile-based authorship attribution
MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
A new document author representation for authorship attribution
MCPR'12 Proceedings of the 4th Mexican conference on Pattern Recognition
Characterizing stylistic elements in syntactic structure
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Semi-random subspace method for writeprint identification
Neurocomputing
The use of orthogonal similarity relations in the prediction of authorship
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Hi-index | 0.00 |
Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.