A comparative study of language models for book and author recognition

Authors:
Özlem Uzuner;Boris Katz
Affiliations:
Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA
Venue:
IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Year:
2005

Citing 12
Cited 8

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Using English for indexing and retrieving

Artificial intelligence at MIT expanding frontiers
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
A vector space model for automatic indexing

Communications of the ACM
Using Literal and Grammatical Statistics for Authorship Attribution

Problems of Information Transmission
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Exploiting lexical regularities in designing natural language systems

COLING '88 Proceedings of the 12th conference on Computational linguistics - Volume 1
Identifying expression fingerprints using linguistic information

Identifying expression fingerprints using linguistic information
Using syntactic information to identify plagiarism

EdAppsNLP 05 Proceedings of the second workshop on Building Educational Applications Using NLP
Capturing expression using linguistic information

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3

A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Forensic Authorship Attribution Using Compression Distances to Prototypes

IWCF '09 Proceedings of the 3rd International Workshop on Computational Forensics
Automatic authorship attribution for texts in croatian language using combinations of features

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Explicit length modelling for statistical machine translation

IbPRIA'11 Proceedings of the 5th Iberian conference on Pattern recognition and image analysis
Lotka phenomenon in the words' syntactic distribution complexity

Scientometrics
Explicit length modelling for statistical machine translation

Pattern Recognition
Use fewer instances of the letter "i": toward writing style anonymization

PETS'12 Proceedings of the 12th international conference on Privacy Enhancing Technologies
Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity

ACM Transactions on Information and System Security (TISSEC)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Linguistic information can help improve evaluation of similarity between documents; however, the kind of linguistic information to be used depends on the task. In this paper, we show that distributions of syntactic structures capture the way works are written and accurately identify individual books more than 76% of the time. In comparison, baseline features, e.g., tfidf-weighted keywords, function words, etc., give an accuracy of at most 66%. However, testing the same features on authorship attribution shows that distributions of syntactic structures are less successful than function words on this task; syntactic structures vary even among the works of the same author whereas features such as function words are distributed more similarly among the works of an author and can more effectively capture authorship.