Authorship Attribution Based on Specific Vocabulary

Authors:
Jacques Savoy
Affiliations:
University of Neuchatel
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2012

Citing 35
Cited 1

Once. A test of authorship based on words which are not repeated in the sample

Literary & Linguistic Computing
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

Machine Learning
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
A stop list for general text

ACM SIGIR Forum
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Machine Learning

Machine Learning
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Automatic text categorization in terms of genre and author

Computational Linguistics
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
A framework for authorship identification of online messages: Writing-style features and classification techniques

Journal of the American Society for Information Science and Technology
An Introduction to Language Processing with Perl and Prolog: An Outline of Theories, Implementation, and Application with Special Consideration of English, French, and German (Cognitive Technologies)

An Introduction to Language Processing with Perl and Prolog: An Outline of Theories, Implementation, and Application with Special Consideration of English, French, and German (Cognitive Technologies)
Introduction to the special topic section on the computational analysis of style: Special Topic Section on Computational Analysis of Style

Journal of the American Society for Information Science and Technology
Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style

Journal of the American Society for Information Science and Technology
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Comparative Evaluation of Multilingual Information Access Systems: 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003, Trondheim, Norway, August ... Papers (Lecture Notes in Computer Science)

Comparative Evaluation of Multilingual Information Access Systems: 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003, Trondheim, Norway, August ... Papers (Lecture Notes in Computer Science)
Searching with style: authorship attribution in classic literature

ACSC '07 Proceedings of the thirtieth Australasian conference on Computer science - Volume 62
Authorship attribution

Foundations and Trends in Information Retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Practical Text Mining with Perl

Practical Text Mining with Perl
Automatically profiling the author of an anonymous text

Communications of the ACM - Inspiring Women in Computing
Computational methods in authorship attribution

Journal of the American Society for Information Science and Technology
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
The R Book

The R Book
Algorithmic stemmers or morphological analysis? An evaluation

Journal of the American Society for Information Science and Technology
When stopword lists make the difference

Journal of the American Society for Information Science and Technology
Entropy-based authorship search in large document collections

ECIR'07 Proceedings of the 29th European conference on IR research
Fundamentals of Predictive Text Mining

Fundamentals of Predictive Text Mining
Effective and scalable authorship attribution using function words

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Which is the best multiclass SVM method? an empirical study

MCS'05 Proceedings of the 6th international conference on Multiple Classifier Systems

Feature selections for authorship attribution

Proceedings of the 28th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article we propose a technique for computing a standardized Z score capable of defining the specific vocabulary found in a text (or part thereof) compared to that of an entire corpus. Assuming that the term occurrence follows a binomial distribution, this method is then applied to weight terms (words and punctuation symbols in the current study), representing the lexical specificity of the underlying text. In a final stage, to define an author profile we suggest averaging these text representations and then applying them along with a distance measure to derive a simple and efficient authorship attribution scheme. To evaluate this algorithm and demonstrate its effectiveness, we develop two experiments, the first based on 5,408 newspaper articles (Glasgow Herald) written in English by 20 distinct authors and the second on 4,326 newspaper articles (La Stampa) written in Italian by 20 distinct authors. These experiments demonstrate that the suggested classification scheme tends to perform better than the Delta rule method based on the most frequent words, better than the chi-square distance based on word profiles and punctuation marks, better than the KLD scheme based on a predefined set of words, and better than the naïve Bayes approach.