What are the productive units of natural language grammar?: a DOP approach to the automatic identification of constructions

Authors:
Willem Zuidema
Affiliations:
University of Amsterdam, Amsterdam, The Netherlands
Venue:
CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Year:
2006

Citing 10
Cited 1

The ATIS spoken language systems pilot corpus

HLT '90 Proceedings of the workshop on Speech and Natural Language
Multiword Expressions: A Pain in the Neck for NLP

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Data-Oriented Parsing

Data-Oriented Parsing
Using an annotated corpus as a stochastic grammar

EACL '93 Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics
Probabilistic tree-adjoining grammar as a framework for statistical natural language processing

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Stochastic lexicalized tree-adjoining grammars

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
An efficient implementation of a new DOP model

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Efficient parsing of highly ambiguous context-free grammars with bit vectors

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Theoretical evaluation of estimation methods for data-oriented parsing

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations

An all-subtrees approach to unsupervised parsing

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We explore a novel computational approach to identifying "constructions" or "multi-word expressions" (MWEs) in an annotated corpus. In this approach, MWEs have no special status, but emerge in a general procedure for finding the best statistical grammar to describe the training corpus. The statistical grammar formalism used is that of stochastic tree substitution grammars (STSGs), such as used in Data-Oriented Parsing. We present an algorithm for calculating the expected frequencies of arbitrary subtrees given the parameters of an STSG, and a method for estimating the parameters of an STSG given observed frequencies in a tree bank. We report quantitative results on the ATIS corpus of phrase-structure annotated sentences, and give examples of the MWEs extracted from this corpus.