From baby steps to Leapfrog: how "Less is More" in unsupervised dependency parsing

  • Authors:
  • Valentin I. Spitkovsky;Hiyan Alshawi;Daniel Jurafsky

  • Affiliations:
  • Stanford University and Google Inc.;Google Inc., Mountain View, CA;Stanford University, Stanford, CA

  • Venue:
  • HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present three approaches for unsupervised grammar induction that are sensitive to data complexity and apply them to Klein and Manning's Dependency Model with Valence. The first, Baby Steps, bootstraps itself via iterated learning of increasingly longer sentences and requires no initialization. This method substantially exceeds Klein and Manning's published scores and achieves 39.4% accuracy on Section 23 (all sentences) of the Wall Street Journal corpus. The second, Less is More, uses a low-complexity subset of the available data: sentences up to length 15. Focusing on fewer but simpler examples trades off quantity against ambiguity; it attains 44.1% accuracy, using the standard linguistically-informed prior and batch training, beating state-of-the-art. Leapfrog, our third heuristic, combines Less is More with Baby Steps by mixing their models of shorter sentences, then rapidly ramping up exposure to the full training set, driving up accuracy to 45.0%. These trends generalize to the Brown corpus; awareness of data complexity may improve other parsing models and unsupervised algorithms.