A dependency treebank of Urdu and its evaluation

  • Authors:
  • Riyaz Ahmad Bhat;Dipti Misra Sharma

  • Affiliations:
  • LTRC, IIIT Hyderabad;LTRC, IIIT Hyderabad

  • Venue:
  • LAW VI '12 Proceedings of the Sixth Linguistic Annotation Workshop
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we describe a currently underway treebanking effort for Urdu-a South Asian language. The treebank is built from a newspaper corpus and uses a Karaka based grammatical framework inspired by Paninian grammatical theory. Thus far 3366 sentences (0.1M words) have been annotated with the linguistic information at morpho-syntactic (morphological, part-of-speech and chunk information) and syntactico-semantic (dependency) levels. This work also aims to evaluate the correctness or reliability of this manual annotated dependency treebank. Evaluation is done by measuring the inter-annotator agreement on a manually annotated data set of 196 sentences (5600 words) annotated by two annotators. We present the qualitative analysis of the agreement statistics and identify the possible reasons for the disagreement between the annotators. We also show the syntactic annotation of some constructions specific to Urdu like Ezafe and discuss the problem of word segmentation (tokenization).