Authorship classification: a syntactic tree mining approach

Authors:
Sangkyum Kim;Hyungsul Kim;Tim Weninger;Jiawei Han
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
Proceedings of the ACM SIGKDD Workshop on Useful Patterns
Year:
2010

Citing 26
Cited 2

Spelling checkers,spelling correctors and the misspellings of poor spellers

Information Processing and Management: an International Journal
Mining e-mail content for author identification forensics

ACM SIGMOD Record
PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth

Proceedings of the 17th International Conference on Data Engineering
Efficiently mining frequent trees in a forest

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Authorship Attribution with Support Vector Machines

Applied Intelligence
CloseGraph: mining closed frequent graph patterns

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
XRules: an effective structural classifier for XML data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Style mining of electronic messages for multiple authorship discrimination: first results

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
BIDE: Efficient Mining of Frequent Closed Sequences

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees

IEEE Transactions on Knowledge and Data Engineering
CTC — Correlating Tree Patterns for Classification

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A framework for authorship identification of online messages: Writing-style features and classification techniques

Journal of the American Society for Information Science and Technology
Mining Frequent Induced Subtrees by Prefix-Tree-Projected Pattern Growth

WAIMW '06 Proceedings of the Seventh International Conference on Web-Age Information Management Workshops
Linguistic correlates of style: authorship classification with deep linguistic analysis features

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Structure and semantics for expressive text kernels

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
DryadeParent, An Efficient and Robust Closed Attribute Tree Mining Algorithm

IEEE Transactions on Knowledge and Data Engineering
Mining significant graph patterns by leap search

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Direct Discriminative Pattern Mining for Effective Classification

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Classification of software behaviors for failure detection: a discriminative pattern mining approach

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A syntactic tree matching approach to finding similar questions in community-based qa services

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Efficient convolution kernels for dependency and constituent syntactic trees

ECML'06 Proceedings of the 17th European conference on Machine Learning
Tree2: decision trees for tree structured data

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Effective and scalable authorship attribution using function words

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Using relative entropy for authorship attribution

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology

Native language detection with tree substitution grammars

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Combining Entity Matching Techniques for Detecting Extremist Behavior on Discussion Boards

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the past, there have been dozens of studies on automatic authorship classification, and many of these studies concluded that the writing style is one of the best indicators of original authorship. From among the hundreds of features which were developed, syntactic features were best able to reflect an author's writing style. However, due to the high computational complexity of extracting and computing syntactic features, only simple variations of basic syntactic features of function words and part-of-speech tags were considered. In this paper, we propose a novel approach to mining discriminative k-embedded-edge subtree patterns from a given set of syntactic trees that reduces the computational burden of using complex syntactic structures as a feature set. This method is shown to increase the classification accuracy. We also design a new kernel based on these features. Comprehensive experiments on real datasets of news articles and movie reviews demonstrate that our approach is reliable and more accurate than previous studies.