Information theoretic text classification using the ziv-merhav method

Authors:
David Pereira Coutinho;Mário A. T. Figueiredo
Affiliations:
Depart. de Engenharia de Electrónica e Telecomunicações e de Computadores, Instituto Superior de Engenharia de Lisboa, Lisboa, Portugal;Instituto de Telecomunicações, Instituto Superior Técnico, Lisboa, Portugal
Venue:
IbPRIA'05 Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II
Year:
2005

Citing 4
Cited 1

Elements of information theory

Elements of information theory
The data compression book (2nd ed.)

The data compression book (2nd ed.)
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Information distance

IEEE Transactions on Information Theory

Measuring structural similarity of semistructured data based on information-theoretic approaches

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most approaches to text classification rely on some measure of (dis)similarity between sequences of symbols. Information theoretic measures have the advantage of making very few assumptions on the models which are considered to have generated the sequences, and have been the focus of recent interest. This paper addresses the use of the Ziv-Merhav method (ZMM) for the estimation of relative entropy (or Kullback-Leibler divergence) from sequences of symbols as a tool for text classification. We describe an implementation of the ZMM based on a modified version of the Lempel-Ziv algorithm (LZ77). Assessing the accuracy of the ZMM on synthetic Markov sequences shows that it yields good estimates of the Kullback-Leibler divergence. Finally, we apply the method in a text classification problem (more specifically, authorship attribution) outperforming a previously proposed (also information theoretic) method.