Parsing word clusters

  • Authors:
  • Marie Candito;Djamé Seddah

  • Affiliations:
  • Alpage (Université Paris), Paris, France;Alpage (Université Paris), Paris, France and Université Paris-Sorbonne, Paris, France

  • Venue:
  • SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present and discuss experiments in statistical parsing of French, where terminal forms used during training and parsing are replaced by more general symbols, particularly clusters of words obtained through un-supervised linear clustering. We build on the work of Candito and Crabbé (2009) who proposed to use clusters built over slightly coarsened French inflected forms. We investigate the alternative method of building clusters over lemma/part-of-speech pairs, using a raw corpus automatically tagged and lemmatized. We find that both methods lead to comparable improvement over the baseline (we obtain F1=86.20% and F1=86.21% respectively, compared to a baseline of F1=84.10%). Yet, when we replace gold lemma/POS pairs with their corresponding cluster, we obtain an upper bound (F1=87.80) that suggests room for improvement for this technique, should tag-ging/lemmatisation performance increase for French. We also analyze the improvement in performance for both techniques with respect to word frequency. We find that replacing word forms with clusters improves attachment performance for words that are originally either unknown or low-frequency, since these words are replaced by cluster symbols that tend to have higher frequencies. Furthermore, clustering also helps significantly for medium to high frequency words, suggesting that training on word clusters leads to better probability estimates for these words.