Using an annotated language corpus as a virtual stochastic grammar

Authors:
Rens Bod
Affiliations:
Department of Computational Linguistics, University of Amsterdam, Amsterdam
Venue:
AAAI'93 Proceedings of the eleventh national conference on Artificial intelligence
Year:
1993

Citing 6
Cited 1

The ATIS spoken language systems pilot corpus

HLT '90 Proceedings of the workshop on Speech and Natural Language
Inside-outside reestimation from partially bracketed corpora

ACL '92 Proceedings of the 30th annual meeting on Association for Computational Linguistics
Probabilistic tree-adjoining grammar as a framework for statistical natural language processing

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Stochastic lexicalized tree-adjoining grammars

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
A computational model of language performance: Data Oriented Parsing

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 3
Very large annotated database of American English

HLT '91 Proceedings of the workshop on Speech and Natural Language

Inducing Tree-Substitution Grammars

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

In Data Oriented Parsing (DOP), an annotated language corpus is used as a virtual stochastic grammar. An input string is parsed by combining subtrees from the corpus. As a consequence, one parse tree can usually be generated by several derivations that involve different subtrees. This leads to a statistics where the probability of a parse is equal to the sum of the probabilities of all its derivations. In (Scha, 1990) an informal introduction to DOP is given, while (Bod, 1992) provides a formalization of the theory. In this paper we show that the maximum probability parse can be estimated in polynomial time by applying Monte Carlo techniques. The model was tested on a set of hand-parsed strings from the Air Travel Information System (ATIS) corpus. Preliminary experiments yield 96% test set parsing accuracy.