Using MDL for grammar induction

Authors:
Pieter Adriaans;Ceriel Jacobs
Affiliations:
Department of Computer Science, University of Amsterdam, Amsterdam, The Netherlands;Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Venue:
ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
Year:
2006

Citing 8
Cited 5

The minimum consistent DFA problem cannot be approximated within any polynomial

Journal of the ACM (JACM)
An introduction to Kolmogorov complexity and its applications (2nd ed.)

An introduction to Kolmogorov complexity and its applications (2nd ed.)
Machine Learning

Machine Learning
Information Compression by Multiple Alignment, Unification and Search as a Unifying Principle in Computing and Cognition

Artificial Intelligence Review
Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm

ICGI '98 Proceedings of the 4th International Colloquium on Grammatical Inference
The EMILE 4.1 Grammar Induction Toolbox

ICGI '02 Proceedings of the 6th International Colloquium on Grammatical Inference: Algorithms and Applications
Ibis: a flexible and efficient Java-based Grid programming environment: Research Articles

Concurrency and Computation: Practice & Experience - 2002 ACM Java Grande–ISCOPE Conference Part II
Kolmogorov's structure functions and model selection

IEEE Transactions on Information Theory

Learning as Data Compression

CiE '07 Proceedings of the 3rd conference on Computability in Europe: Computation and Logic in the Real World
Grid management support by means of collaborative learning agents

GMAC '09 Proceedings of the 6th international conference industry session on Grids meets autonomic computing
Satin: A high-level and efficient grid programming model

ACM Transactions on Programming Languages and Systems (TOPLAS)
Using grammar induction to model adaptive behavior of networks of collaborative agents

ICGI'10 Proceedings of the 10th international colloquium conference on Grammatical inference: theoretical results and applications
STAMINA: a competition to encourage the development and assessment of software model inference techniques

Empirical Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we study the application of the Minimum Description Length principle (or two-part-code optimization) to grammar induction in the light of recent developments in Kolmogorov complexity theory. We focus on issues that are important for construction of effective compression algorithms. We define an independent measure for the quality of a theory given a data set: the randomness deficiency. This is a measure of how typical the data set is for the theory. It can not be computed, but it can in many relevant cases be approximated. An optimal theory has minimal randomness deficiency. Using results from [4] and [2] we show that: – Shorter code not necessarily leads to better theories. We prove that, in DFA induction, already as a result of a single deterministic merge of two nodes, divergence of randomness deficiency and MDL code can occur. – Contrary to what is suggested by the results of [6] there is no fundamental difference between positive and negative data from an MDL perspective. – MDL is extremely sensitive to the correct calculation of code length: model code and data-to-model code. These results show why the applications of MDL to grammar induction so far have been disappointing. We show how the theoretical results can be deployed to create an effective algorithm for DFA induction. However, we believe that, since MDL is a global optimization criterion, MDL based solutions will in many cases be less effective in problem domains where local optimization criteria can be easily calculated. The algorithms were tested on the Abbadingo problems ([10]). The code was in Java, using the Satin ([17]) divide-and-conquer system that runs on top of the Ibis ([18]) Grid programming environment.