Urdu word segmentation

Authors:
Nadir Durrani;Sarmad Hussain
Affiliations:
Universität Stuttgart;National University of Computer and Emerging Sciences
Venue:
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Year:
2010

Citing 3
Cited 4

Empirical Support for Winnow and Weighted-MajorityAlgorithms: Results on a Calendar Scheduling Domain

Machine Learning
Chinese word segmentation based on maximum matching and word binding force

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Letter-to-sound conversion for Urdu text-to-speech system

Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages

An Information-Extraction System for Urdu---A Resource-Poor Language

ACM Transactions on Asian Language Information Processing (TALIP)
Lexicon based sentiment analysis of Urdu text using SentiUnits

MICAI'10 Proceedings of the 9th Mexican international conference on Advances in artificial intelligence: Part I
A dependency treebank of Urdu and its evaluation

LAW VI '12 Proceedings of the Sixth Linguistic Annotation Workshop
Associating targets with SentiUnits: a step forward in sentiment analysis of Urdu text

Artificial Intelligence Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmentation challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper discusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8% and overall accuracy of 95.8%. Further issues and possible future directions are also discussed.