Analysis and development of Urdu POS tagged corpus

Authors:
Ahmed Muaz;Aasim Ali;Sarmad Hussain
Affiliations:
NUCES, Pakistan;NUCES, Pakistan;NUCES, Pakistan
Venue:
ALR7 Proceedings of the 7th Workshop on Asian Language Resources
Year:
2009

Citing 2
Cited 4

TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Tagging Urdu text with parts of speech: a tagger comparison

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics

An Information-Extraction System for Urdu---A Resource-Poor Language

ACM Transactions on Asian Language Information Processing (TALIP)
Sentiment analysis of urdu language: handling phrase-level negation

MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Analyzing Urdu social media for sentiments using transfer learning with controlled translations

LSM '12 Proceedings of the Second Workshop on Language in Social Media
Associating targets with SentiUnits: a step forward in sentiment analysis of Urdu text

Artificial Intelligence Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, two corpora of Urdu (with 110K and 120K words) tagged with different POS tagsets are used to train TnT and Tree taggers. Error analysis of both taggers is done to identify frequent confusions in tagging. Based on the analysis of tagging, and syntactic structure of Urdu, a more refined tagset is derived. The existing tagged corpora are tagged with the new tagset to develop a single corpus of 230K words and the TnT tagger is retrained. The results show improvement in tagging accuracy for individual corpora to 94.2% and also for the merged corpus to 91%. Implications of these results are discussed.