Improving generative statistical parsing with semi-supervised word clustering

  • Authors:
  • Marie Candito;Benoît Crabbé

  • Affiliations:
  • Université Paris/INRIA (Alpage), Paris;Université Paris/INRIA (Alpage), Paris

  • Venue:
  • IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a semi-supervised method to improve statistical parsing performance. We focus on the well-known problem of lexical data sparseness and present experiments of word clustering prior to parsing. We use a combination of lexicon-aided morphological clustering that preserves tagging ambiguity, and unsupervised word clustering, trained on a large unannotated corpus. We apply these clusterings to the French Treebank, and we train a parser with the PCFG-LA unlexicalized algorithm of (Petrov et al., 2006). We find a gain in French parsing performance: from a baseline of F1=86.76% to F1=87.37% using morphological clustering, and up to F1=88.29% using further unsupervised clustering. This is the best known score for French probabilistic parsing. These preliminary results are encouraging for statistically parsing morphologically rich languages, and languages with small amount of annotated data.