A dependency treebank of Urdu and its evaluation

Authors:
Riyaz Ahmad Bhat;Dipti Misra Sharma
Affiliations:
LTRC, IIIT Hyderabad;LTRC, IIIT Hyderabad
Venue:
LAW VI '12 Proceedings of the Sixth Linguistic Annotation Workshop
Year:
2012

Citing 9
Cited 0

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
The reliability of a dialogue structure coding scheme

Computational Linguistics
Dependency tree kernels for relation extraction

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
A Karaka Based Annotation Scheme for English

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Evaluation of the Syntactic Annotation in EPEC, the Reference Corpus for the Processing of Basque

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Dependency Tree Kernels for Relation Extraction from Natural Language Text

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
A multi-representational and multi-layered treebank for Hindi/Urdu

ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
Urdu word segmentation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Identification of conjunct verbs in hindi and its effect on parsing accuracy

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe a currently underway treebanking effort for Urdu-a South Asian language. The treebank is built from a newspaper corpus and uses a Karaka based grammatical framework inspired by Paninian grammatical theory. Thus far 3366 sentences (0.1M words) have been annotated with the linguistic information at morpho-syntactic (morphological, part-of-speech and chunk information) and syntactico-semantic (dependency) levels. This work also aims to evaluate the correctness or reliability of this manual annotated dependency treebank. Evaluation is done by measuring the inter-annotator agreement on a manually annotated data set of 196 sentences (5600 words) annotated by two annotators. We present the qualitative analysis of the agreement statistics and identify the possible reasons for the disagreement between the annotators. We also show the syntactic annotation of some constructions specific to Urdu like Ezafe and discuss the problem of word segmentation (tokenization).