Entropy-based authorship search in large document collections

Authors:
Ying Zhao;Justin Zobel
Affiliations:
School of Computer Science and Information Technology, RMIT University, Melbourne, Australia;School of Computer Science and Information Technology, RMIT University, Melbourne, Australia
Venue:
ECIR'07 Proceedings of the 29th European conference on IR research
Year:
2007

Citing 21
Cited 3

Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
Learning Bayesian Networks: The Combination of Knowledge and Statistical Data

Machine Learning
Exploring the similarity space

ACM SIGIR Forum
A probabilistic model of information retrieval: development and comparative experiments

Information Processing and Management: an International Journal
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology

ACM Transactions on Asian Language Information Processing (TALIP)
Authorship Attribution with Support Vector Machines

Applied Intelligence
Language Modeling for Information Retrieval

Language Modeling for Information Retrieval
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Corpus structure, language models, and ad hoc information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Authorship verification as a one-class classification problem

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Searching with style: authorship attribution in classic literature

ACSC '07 Proceedings of the thirtieth Australasian conference on Computer science - Volume 62
Effective and scalable authorship attribution using function words

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Using relative entropy for authorship attribution

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology

Authorship Attribution Based on Specific Vocabulary

ACM Transactions on Information Systems (TOIS)
Authorship attribution based on a probabilistic topic model

Information Processing and Management: an International Journal
Feature selections for authorship attribution

Proceedings of the 28th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The purpose of authorship search is to identify documents written by a particular author in large document collections. Standard search engines match documents to queries based on topic, and are not applicable to authorship search. In this paper we propose an approach to authorship search based on information theory. We propose relative entropy of style markers for ranking, inspired by the language models used in information retrieval. Our experiments on collections of newswire texts show that, with simple style markers and sufficient training data, documents by a particular author can be accurately found from within large collections. Although effectiveness does degrade as collection size is increased, with even 500,000 documents nearly half of the top-ranked documents are correct matches. We have also found that the authorship search approach can be used for authorship attribution, and is much more scalable than state-of-art approaches in terms of the collection size and the number of candidate authors.