What are the productive units of natural language grammar?: a DOP approach to the automatic identification of constructions

  • Authors:
  • Willem Zuidema

  • Affiliations:
  • University of Amsterdam, Amsterdam, The Netherlands

  • Venue:
  • CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

We explore a novel computational approach to identifying "constructions" or "multi-word expressions" (MWEs) in an annotated corpus. In this approach, MWEs have no special status, but emerge in a general procedure for finding the best statistical grammar to describe the training corpus. The statistical grammar formalism used is that of stochastic tree substitution grammars (STSGs), such as used in Data-Oriented Parsing. We present an algorithm for calculating the expected frequencies of arbitrary subtrees given the parameters of an STSG, and a method for estimating the parameters of an STSG given observed frequencies in a tree bank. We report quantitative results on the ATIS corpus of phrase-structure annotated sentences, and give examples of the MWEs extracted from this corpus.