A Source Coding Approach to Classification by Vector Quantization and the Principle of Minimum Description Length

  • Authors:
  • Jia Li

  • Affiliations:
  • -

  • Venue:
  • DCC '02 Proceedings of the Data Compression Conference
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

An algorithm for supervised classification using vector quantization and entropy coding is presented. The classification rule is formed from a set of training data $\{(X_i, Y_i)\}_{i=1}^{n}$, which are independent samples from a joint distribution $P_{XY}$. Based on the principle of Minimum Description Length (MDL), a statistical model that approximates the distribution $P_{XY}$ ought to enable efficient coding of $X$ and $Y$. On the other hand, we expect a system that encodes $(X,Y)$ efficiently to provide ample information on the distribution $P_{XY}$. This information can then be used to classify $X$, i.e., to predict the corresponding $Y$ based on $X$. To encode both $X$ and $Y$, a two-stage vector quantizer is applied to $X$ and a Huffman code is formed for $Y$ conditioned on each quantized value of $X$. The optimization of the encoder is equivalent to the design of a vector quantizer with an objective function reflecting the joint penalty of quantization error and misclassification rate. This vector quantizer provides an estimation of the conditional distribution of $Y$ given $X$, which in turn yields an approximation to the Bayes classification rule. This algorithm, namely Discriminant Vector Quantization (DVQ), is compared with Learning Vector Quantization (LVQ) and CART on a number of data sets. DVQ outperforms the other two on several data sets. The relation between DVQ, density estimation, and regression is also discussed.