Local histograms of character N-grams for authorship attribution

Authors:
Hugo Jair Escalante;Thamar Solorio;Manuel Montes-y-Gómez
Affiliations:
Universidad Autónoma de Nuevo León, San Nicolás de los Garza, NL, México;University of Alabama at Birmingham, Birmingham, AL;University of Alabama at Birmingham, Birmingham, AL
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Year:
2011

Citing 20
Cited 7

The Earth Mover's Distance as a Metric for Image Retrieval

International Journal of Computer Vision
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
In Defense of One-Vs-All Classification

The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Diffusion Kernels on Statistical Manifolds

The Journal of Machine Learning Research
Language independent authorship attribution using character level language models

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Matching sets of features for efficient retrieval and recognition

Matching sets of features for efficient retrieval and recognition
Sequential Document Visualization

IEEE Transactions on Visualization and Computer Graphics
The Locally Weighted Bag of Words Framework for Document Representation

The Journal of Machine Learning Research
Tensor Space Models for Authorship Identification

SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
Computational methods in authorship attribution

Journal of the American Society for Information Science and Technology
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Author Identification Using a Tensor Space Representation

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Forensic Authorship Attribution Using Compression Distances to Prototypes

IWCF '09 Proceedings of the 3rd International Workshop on Computational Forensics
Movie segmentation into scenes and chapters using locally weighted bag of visual words

Proceedings of the ACM International Conference on Image and Video Retrieval
Authorship attribution using word sequences

CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
Effective and scalable authorship attribution using function words

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
N-Gram feature selection for authorship identification

AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications

A weighted profile intersection measure for profile-based authorship attribution

MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Representation models for text classification: a comparative analysis over three web document types

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Content vs. context for sentiment analysis: a comparative analysis over microblogs

Proceedings of the 23rd ACM conference on Hypertext and social media
Modeling coherence in ESOL learner texts

Proceedings of the Seventh Workshop on Building Educational Applications Using NLP
The use of orthogonal similarity relations in the prediction of authorship

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Syntactic dependency-based n-grams as classification features

MICAI'12 Proceedings of the 11th Mexican international conference on Advances in Computational Intelligence - Volume Part II
Syntactic N-grams as machine learning features for natural language processing

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for AA. We show that LHs are particularly helpful for AA, because they provide useful information for uncovering, to some extent, the writing style of authors. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. We found that LHs are even more advantageous in challenging conditions, such as having imbalanced and small training sets. Our results motivate further research on the use of LHs for modeling the writing style of authors for related tasks, such as authorship verification and plagiarism detection.