Unsupervised dependency parsing without gold part-of-speech tags

  • Authors:
  • Valentin I. Spitkovsky;Hiyan Alshawi;Angel X. Chang;Daniel Jurafsky

  • Affiliations:
  • Stanford University, Stanford, CA, and Google Research, Google Inc., Mountain View, CA;Google Research, Google Inc., Mountain View, CA;Stanford University, Stanford, CA, and Google Research, Google Inc., Mountain View, CA;Stanford University, Stanford, CA, and Stanford University, Stanford, CA

  • Venue:
  • EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We show that categories induced by unsupervised word clustering can surpass the performance of gold part-of-speech tags in dependency grammar induction. Unlike classic clustering algorithms, our method allows a word to have different tags in different contexts. In an ablative analysis, we first demonstrate that this context-dependence is crucial to the superior performance of gold tags --- requiring a word to always have the same part-of-speech significantly degrades the performance of manual tags in grammar induction, eliminating the advantage that human annotation has over unsupervised tags. We then introduce a sequence modeling technique that combines the output of a word clustering algorithm with context-colored noise, to allow words to be tagged differently in different contexts. With these new induced tags as input, our state-of-the-art dependency grammar inducer achieves 59.1% directed accuracy on Section 23 (all sentences) of the Wall Street Journal (WSJ) corpus --- 0.7% higher than using gold tags.