Linguistic correlates of style: authorship classification with deep linguistic analysis features

Authors:
Michael Gamon
Affiliations:
Microsoft Corp., One Microsoft Way, Redmond, WA
Venue:
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Year:
2004

Citing 6
Cited 17

Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Authorship Attribution with Support Vector Machines

Applied Intelligence
Automatic text categorization in terms of genre and author

Computational Linguistics
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I

Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Tensor Space Models for Authorship Identification

SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Authorship attribution and verification with many authors and limited data

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Forensic Authorship Attribution Using Compression Distances to Prototypes

IWCF '09 Proceedings of the 3rd International Workshop on Computational Forensics
The contribution of stylistic information to content-based mobile spam filtering

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Authorship classification: a syntactic tree mining approach

Proceedings of the ACM SIGKDD Workshop on Useful Patterns
Lost in translation: authorship attribution using frame semantics

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Authorship classification: a discriminative syntactic tree mining approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Authorship attribution with latent Dirichlet allocation

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Exploiting parse structures for native language identification

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Exploration of document relation quality with consideration of term representation basis, term weighting and association measure

PAISI'10 Proceedings of the 2010 Pacific Asia conference on Intelligence and Security Informatics
An efficient alternative to SVM based recursive feature elimination with applications in natural language processing and bioinformatics

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
Mining writeprints from anonymous e-mails for forensic investigation

Digital Investigation: The International Journal of Digital Forensics & Incident Response
Stylometric analysis of scientific articles

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Towards a model for replicating aesthetic literary appreciation

Proceedings of the Fifth Workshop on Semantic Web Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The identification of authorship falls into the category of style classification, an interesting sub-field of text categorization that deals with properties of the form of linguistic expression as opposed to the content of a text. Various feature sets and classification methods have been proposed in the literature, geared towards abstracting away from the content of a text, and focusing on its stylistic properties. We demonstrate that in a realistically difficult authorship attribution scenario, deep linguistic analysis features such as context free production frequencies and semantic relationship frequencies achieve significant error reduction over more commonly used "shallow" features such as function word frequencies and part of speech trigrams. Modern machine learning techniques like support vector machines allow us to explore large feature vectors, combining these different feature sets to achieve high classification accuracy in style-based tasks.