Compacting the Penn Treebank grammar

Authors:
Alexander Krotov;Mark Hepple;Robert Gaizauskas;Yorick Wilks
Affiliations:
Sheffield University, Sheffield, UK;Sheffield University, Sheffield, UK;Sheffield University, Sheffield, UK;Sheffield University, Sheffield, UK
Venue:
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Year:
1998

Citing 5
Cited 13

An efficient probabilistic context-free parsing algorithm that computes prefix probabilities

Computational Linguistics
A DOP model for semantic interpretation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
A new statistical parser based on bigram lexical dependencies

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
The Penn Treebank: annotating predicate argument structure

HLT '94 Proceedings of the workshop on Human Language Technology
Tree-bank grammars

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2

Logic-based machine learning

Logic-based artificial intelligence
Tree k-Grammar Models for Natural Language Modelling and Parsing

Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition
A machine learning approach to modeling scope preferences

Computational Linguistics
Evaluating two methods for Treebank grammar compaction

Natural Language Engineering
Parsing with Probabilistic Strictly Locally Testable Tree Languages

IEEE Transactions on Pattern Analysis and Machine Intelligence
Is it harder to parse Chinese, or the Chinese Treebank?

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks

Computational Linguistics
A uniform method of grammar extraction and its applications

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Translating treebank annotation for evaluation

ELDS '01 Proceedings of the workshop on Evaluation for Language and Dialogue Systems - Volume 9
Large-scale induction and evaluation of lexical resources from the Penn-II treebank

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Alternative approaches for generating bodies of grammar rules

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Techniques to incorporate the benefits of a hierarchy in a modified hidden Markov model

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Phrase structure parsing with dependency structure

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Treebanks, such as the Penn Treebank (PTB), offer a simple approach to obtaining a broad coverage grammar: one can simply read the grammar off the parse trees in the treebank. While such a grammar is easy to obtain, a square-root rate of growth of the rule set with corpus size suggests that the derived grammar is far from complete and that much more treebanked text would be required to obtain a complete grammar, if one exists at some limit. However, we offer an alternative explanation in terms of the underspecification of structures within the treebank. This hypothesis is explored by applying an algorithm to compact the derived grammar by eliminating redundant rules - rules whose right hand sides can be parsed by other rules. The size of the resulting compacted grammar, which is significantly less than that of the full treebank grammar, is shown to approach a limit. However, such a compacted grammar does not yield very good performance figures. A version of the compaction algorithm taking rule probabilities into account is proposed, which is argued to be more linguistically motivated. Combined with simple thresholding, this method can be used to give a 58% reduction in grammar size without significant change in parsing performance, and can produce a 69% reduction with some gain in recall, but a loss in precision.