Using MDL for grammar induction

  • Authors:
  • Pieter Adriaans;Ceriel Jacobs

  • Affiliations:
  • Department of Computer Science, University of Amsterdam, Amsterdam, The Netherlands;Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands

  • Venue:
  • ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we study the application of the Minimum Description Length principle (or two-part-code optimization) to grammar induction in the light of recent developments in Kolmogorov complexity theory. We focus on issues that are important for construction of effective compression algorithms. We define an independent measure for the quality of a theory given a data set: the randomness deficiency. This is a measure of how typical the data set is for the theory. It can not be computed, but it can in many relevant cases be approximated. An optimal theory has minimal randomness deficiency. Using results from [4] and [2] we show that: – Shorter code not necessarily leads to better theories. We prove that, in DFA induction, already as a result of a single deterministic merge of two nodes, divergence of randomness deficiency and MDL code can occur. – Contrary to what is suggested by the results of [6] there is no fundamental difference between positive and negative data from an MDL perspective. – MDL is extremely sensitive to the correct calculation of code length: model code and data-to-model code. These results show why the applications of MDL to grammar induction so far have been disappointing. We show how the theoretical results can be deployed to create an effective algorithm for DFA induction. However, we believe that, since MDL is a global optimization criterion, MDL based solutions will in many cases be less effective in problem domains where local optimization criteria can be easily calculated. The algorithms were tested on the Abbadingo problems ([10]). The code was in Java, using the Satin ([17]) divide-and-conquer system that runs on top of the Ibis ([18]) Grid programming environment.