Towards High Speed Grammar Induction on Large Text Corpora

  • Authors:
  • Pieter W. Adriaans;Marten Trautwein;Marco Vervoort

  • Affiliations:
  • -;-;-

  • Venue:
  • SOFSEM '00 Proceedings of the 27th Conference on Current Trends in Theory and Practice of Informatics
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we describe an efficient and scalable implementation for grammar induction based on the EMILE approach [2,3,4,5,6]. The current EMILE 4.1 implementation [11] is one of the first efficient grammar induction algorithms that work on free text. Although EMILE 4.1 is far from perfect, it enables researchers to do empirical grammar induction research on various types of corpora. The EMILE approach is based on notions from categorial grammar (cf. [10]), which is known to generate the class of context-free languages. EMILE learns from positive examples only (cf. [1,7,9]). We describe the algorithms underlying the approach and some interesting practical results on small and large text collections. As shown in the articles mentioned above, in the limit EMILE learns the correct grammatical structure of a language from sentences of that language. The conducted experiments show that, put into practice, EMILE 4.1 is efficient and scalable. This current implementation learns a subclass of the shallow context-free languages. This subclass seems sufficiently rich to be of practical interest. Especially Emile seems to be a valuable tool in the context of syntactic and semantic analysis of large text corpora.