Lemmatization and lexicalized statistical parsing of morphologically rich languages: the case of French

  • Authors:
  • Djamé Seddah;Grzegorz Chrupała;Özlem Çetinoğlu;Josef van Genabith;Marie Candito

  • Affiliations:
  • Alpage Inria & Univ. Paris-Sorbonne, Paris, France;Saarland Univ., Saarbrücken, Germany;Dublin City Univ., Dublin, Ireland;Dublin City Univ., Dublin, Ireland;Alpage Inria & Univ., Paris, France

  • Venue:
  • SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper shows that training a lexicalized parser on a lemmatized morphologically-rich treebank such as the French Treebank slightly improves parsing results. We also show that lemmatizing a similar in size subset of the English Penn Treebank has almost no effect on parsing performance with gold lemmas and leads to a small drop of performance when automatically assigned lemmas and POS tags are used. This highlights two facts: (i) lemmatization helps to reduce lexicon data-sparseness issues for French, (ii) it also makes the parsing process sensitive to correct assignment of POS tags to unknown words.