Tagging Urdu text with parts of speech: a tagger comparison

Authors:
Hassan Sajjad;Helmut Schmid
Affiliations:
Universität Stuttgart, Stuttgart, Germany;Universität Stuttgart, Stuttgart, Germany
Venue:
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2009

Citing 7
Cited 3

A Computational Approach to Grammatical Coding of English Words

Journal of the ACM (JACM)
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Acquiring disambiguation rules from text

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
Part-of-speech tagging with neural networks

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1

Analysis and development of Urdu POS tagged corpus

ALR7 Proceedings of the 7th Workshop on Asian Language Resources
An Information-Extraction System for Urdu---A Resource-Poor Language

ACM Transactions on Asian Language Information Processing (TALIP)
Building a hierarchical annotated corpus of urdu: the URDU.KON-TB treebank

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, four state-of-art probabilistic taggers i.e. TnT tagger, TreeTagger, RF tagger and SVM tool, are applied to the Urdu language. For the purpose of the experiment, a syntactic tagset is proposed. A training corpus of 100,000 tokens is used to train the models. Using the lexicon extracted from the training corpus, SVM tool shows the best accuracy of 94.15%. After providing a separate lexicon of 70,568 types, SVM tool again shows the best accuracy of 95.66%.