Authorship classification: a discriminative syntactic tree mining approach

Authors:
Sangkyum Kim;Hyungsul Kim;Tim Weninger;Jiawei Han;Hyun Duk Kim
Affiliations:
UIUC, Urbana, IL, USA;UIUC, Urbana, IL, USA;UIUC, Urbana, IL, USA;UIUC, Urbana, IL, USA;UIUC, Urbana, IL, USA
Venue:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Year:
2011

Citing 27
Cited 1

Spelling checkers,spelling correctors and the misspellings of poor spellers

Information Processing and Management: an International Journal
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
Mining e-mail content for author identification forensics

ACM SIGMOD Record
PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth

Proceedings of the 17th International Conference on Data Engineering
Efficiently mining frequent trees in a forest

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Authorship Attribution with Support Vector Machines

Applied Intelligence
XRules: an effective structural classifier for XML data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Style mining of electronic messages for multiple authorship discrimination: first results

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
CTC — Correlating Tree Patterns for Classification

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A framework for authorship identification of online messages: Writing-style features and classification techniques

Journal of the American Society for Information Science and Technology
Authorship attribution with thousands of candidate authors

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Mining Frequent Induced Subtrees by Prefix-Tree-Projected Pattern Growth

WAIMW '06 Proceedings of the Seventh International Conference on Web-Age Information Management Workshops
Linguistic correlates of style: authorship classification with deep linguistic analysis features

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Searching with style: authorship attribution in classic literature

ACSC '07 Proceedings of the thirtieth Australasian conference on Computer science - Volume 62
Mining significant graph patterns by leap search

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Direct mining of discriminative and essential frequent patterns via model-based search tree

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Application of Information Retrieval Techniques for Source Code Authorship Attribution

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Direct Discriminative Pattern Mining for Effective Classification

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Classification of software behaviors for failure detection: a discriminative pattern mining approach

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Correlated itemset mining in ROC space: a constraint programming approach

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Authorship attribution and verification with many authors and limited data

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Authorship attribution via combination of evidence

ECIR'07 Proceedings of the 29th European conference on IR research
NDPMine: efficiently mining discriminative numerical features for pattern-based classification

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Effective and scalable authorship attribution using function words

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Using relative entropy for authorship attribution

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology

The construction complexity of orgraphs: Some mathematical models and their applications

Automatic Documentation and Mathematical Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the past, there have been dozens of studies on automatic authorship classification, and many of these studies concluded that the writing style is one of the best indicators for original authorship. From among the hundreds of features which were developed, syntactic features were best able to reflect an author's writing style. However, due to the high computational complexity for extracting and computing syntactic features, only simple variations of basic syntactic features such as function words, POS(Part of Speech) tags, and rewrite rules were considered. In this paper, we propose a new feature set of k-embedded-edge subtree patterns that holds more syntactic information than previous feature sets. We also propose a novel approach to directly mining them from a given set of syntactic trees. We show that this approach reduces the computational burden of using complex syntactic structures as the feature set. Comprehensive experiments on real-world datasets demonstrate that our approach is reliable and more accurate than previous studies.