Influence of treebank design on representation of multiword expressions

  • Authors:
  • Eduard Bejček;Pavel Straňák;Daniel Zeman

  • Affiliations:
  • Charles University in Prague, Institute of Formal and Applied Linguistics;Charles University in Prague, Institute of Formal and Applied Linguistics;Charles University in Prague, Institute of Formal and Applied Linguistics

  • Venue:
  • CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multiword Expressions (MWEs) are important linguistic units that require special treatment in many NLP applications. It is thus desirable to be able to recognize them automatically. Semantically annotated corpora should mark MWEs in a clear way that facilitates development of automatic recognition tools. In the present paper we discuss various corpus design decisions from this perspective. We propose guidelines that should lead to MWE-friendly annotation and evaluate them on numerous sentence examples. Our experience of identifying MWEs in the Prague Dependency Treebank provides the base for the discussion and examples from other languages are added whenever appropriate.