Natural language grammar induction with a generative constituent-context model

  • Authors:
  • Dan Klein;Christopher D. Manning

  • Affiliations:
  • Computer Science Department, Stanford University, 353 Serra Mall, Room 418, Stanford, CA 94305-9040, USA;Computer Science Department, Stanford University, 353 Serra Mall, Room 418, Stanford, CA 94305-9040, USA

  • Venue:
  • Pattern Recognition
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

We present a generative probabilistic model for the unsupervised learning of hierarchical natural language syntactic structure. Unlike most previous work, we do not learn a context-free grammar, but rather induce a distributional model of constituents which explicitly relates constituent yields and their linear contexts. Parameter search with EM produces higher quality analyses for human language data than those previously exhibited by unsupervised systems, giving the best published unsupervised parsing results on the ATIS corpus. Experiments on Penn treebank sentences of comparable length show an even higher constituent F"1 of 71% on non-trivial brackets. We compare distributionally induced and actual part-of-speech tags as input data, and examine extensions to the basic model. We discuss errors made by the system, compare the system to previous models, and discuss upper bounds, lower bounds, and stability for this task.