Generating grammars for SGML tagged texts lacking DTD

  • Authors:
  • H. Ahonen;H. Mannila;E. Nikunen

  • Affiliations:
  • Department of Computer Science, University of Helsinki P.O. Box 26 (Teollisuuskatu 23) FIN-00014 Helsinki, Finland;Department of Computer Science, University of Helsinki P.O. Box 26 (Teollisuuskatu 23) FIN-00014 Helsinki, Finland;Research Centre for Finnish Languages Sörnäisten rantatie 25, FIN-00500 Helsinki, Finland

  • Venue:
  • Mathematical and Computer Modelling: An International Journal
  • Year:
  • 1997

Quantified Score

Hi-index 0.98

Visualization

Abstract

We describe a technique for forming a context free grammar for a document that has some kind of tagging-structural or typographical-but no concise description of the structure is available. The technique is based on ideas from machine learning. It forms first a set of finite-state automata describing the document completely. These automata are modified by considering certain context conditions; the modifications correspond to generalizing the underlying languages. Finally, the automata are converted into regular expressions, which are then used to construct the grammar. An alternative representation, characteristic k-grams, is also introduced. Additionally, the paper describes some interactive operations necessary for generating a grammar for a large and complicated document.