Building a hierarchical annotated corpus of urdu: the URDU.KON-TB treebank

Authors:
Qaiser Abbas
Affiliations:
Department of Linguistics, University of Konstanz, Konstanz, Germany
Venue:
CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Year:
2012

Citing 10
Cited 0

Partial parsing: a report on work in progress

HLT '91 Proceedings of the workshop on Speech and Natural Language
Studies in part of speech labelling

HLT '91 Proceedings of the workshop on Speech and Natural Language
Deducing linguistic structure from the statistics of large corpora

HLT '90 Proceedings of the workshop on Speech and Natural Language
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
An annotation scheme for free word order languages

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Discovering the lexical features of a language

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Inside-outside reestimation from partially bracketed corpora

ACL '92 Proceedings of the 30th annual meeting on Association for Computational Linguistics
Probabilistic parse scoring based on prosodic phrasing

HLT '91 Proceedings of the workshop on Speech and Natural Language
Tagging Urdu text with parts of speech: a tagger comparison

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Parsing a natural language using mutual information statistics

AAAI'90 Proceedings of the eighth National conference on Artificial intelligence - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work aims at the development of a representative treebank for the South Asian language Urdu. Urdu is a comparatively under resourced language and the development of a reliable treebank for Urdu will have significant impact on the state-of-the-art for Urdu language processing. In URDU.KON-TB treebank described here, a POS tagset, a syntactic tagset and a functional tagset have been proposed. The construction of the treebank is based on an existing corpus of 19 million words for the Urdu language. Part of speech (POS) tagging and annotation of a selected set of sentences from different sub-domains of this corpus is in process manually and the work performed till to date is presented here. The hierarchical annotation scheme we adopted has a combination of a phrase structure (PS) and a hybrid dependency structure (HDS).