Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
The reliability of a dialogue structure coding scheme
Computational Linguistics
Dependency tree kernels for relation extraction
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
A Karaka Based Annotation Scheme for English
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Evaluation of the Syntactic Annotation in EPEC, the Reference Corpus for the Processing of Basque
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Dependency Tree Kernels for Relation Extraction from Natural Language Text
ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
A multi-representational and multi-layered treebank for Hindi/Urdu
ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Identification of conjunct verbs in hindi and its effect on parsing accuracy
CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
Hi-index | 0.00 |
In this paper we describe a currently underway treebanking effort for Urdu-a South Asian language. The treebank is built from a newspaper corpus and uses a Karaka based grammatical framework inspired by Paninian grammatical theory. Thus far 3366 sentences (0.1M words) have been annotated with the linguistic information at morpho-syntactic (morphological, part-of-speech and chunk information) and syntactico-semantic (dependency) levels. This work also aims to evaluate the correctness or reliability of this manual annotated dependency treebank. Evaluation is done by measuring the inter-annotator agreement on a manually annotated data set of 196 sentences (5600 words) annotated by two annotators. We present the qualitative analysis of the agreement statistics and identify the possible reasons for the disagreement between the annotators. We also show the syntactic annotation of some constructions specific to Urdu like Ezafe and discuss the problem of word segmentation (tokenization).