Evaluating two methods for Treebank grammar compaction

  • Authors:
  • Alexander Krotov;Mark Hepple;Robert Gaizauskas;Yorick Wilks

  • Affiliations:
  • Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK/ alexk@dcs.shef.ac.uk, hepple@dcs.shef.ac.uk, robertg@dcs.shef.ac.uk, yorick@dcs. ...;Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK/ alexk@dcs.shef.ac.uk, hepple@dcs.shef.ac.uk, robertg@dcs.shef.ac.uk, yorick@dcs. ...;Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK/ alexk@dcs.shef.ac.uk, hepple@dcs.shef.ac.uk, robertg@dcs.shef.ac.uk, yorick@dcs. ...;Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK/ alexk@dcs.shef.ac.uk, hepple@dcs.shef.ac.uk, robertg@dcs.shef.ac.uk, yorick@dcs. ...

  • Venue:
  • Natural Language Engineering
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of the corpus, producing either a simple or probabilistic context-free grammar. Such grammars, however, can be very large, presenting problems for the subsequent computational costs of parsing under the grammar. In this paper, we explore ways by which a treebank grammar can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i) thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which has both probabilistic and non-probabilistic variants. Our results show that by a combined use of these two techniques, a probabilistic context-free grammar can be reduced in size by 62% without any loss in parsing performance, and by 71% to give a gain in recall, but some loss in precision.