Towards a bank of constituent parse trees for Polish

  • Authors:
  • Marek Świdziński;Marcin Woliński

  • Affiliations:
  • Institute of Polish, Warsaw University;Institute of Computer Science, Polish Academy of Sciences

  • Venue:
  • TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a project aimed at construction of a bank of constituent parse trees for 20,000 Polish sentences taken from the balanced hand-annotated subcorpus of the National Corpus of Polish (NKJP). The treebank is to be obtained by automatic parsing and manual disambiguation of resulting trees. The grammar applied by the project is a new version of Swidzinski's formal definition of Polish. Each sentence is disambiguated independently by two linguists and, if needed, adjudicated by a supervisor. The feedback from this process is used to iteratively improve the grammar. In the paper, we describe linguistic but also technical decisions made in the project. We discuss the overall shape of the parse trees including the extent of encoded grammatical information. We also delve into the problem of syntactic disambiguation as a challenge for our job.