A study in Urdu corpus construction

Authors:
Dara Becker;Kashif Riaz
Affiliations:
University of St. Thomas, St. Paul, MN;University of Minnesota-Twin Cities, Minneapolis, MN
Venue:
COLING '02 Proceedings of the 3rd workshop on Asian language resources and international standardization - Volume 12
Year:
2002

Citing 0
Cited 4

Concept search in Urdu

Proceedings of the 2nd PhD workshop on Information and knowledge management
Baseline for Urdu IR evaluation

Proceedings of the 2nd ACM workshop on Improving non english web searching
Rule-based named entity recognition in Urdu

NEWS '10 Proceedings of the 2010 Named Entities Workshop
Challenges in Urdu stemming: a progress report

FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access

Quantified Score

Hi-index	0.00

Visualization

Abstract

We are interested in contributing a small, publicly available Urdu corpus of written text to the natural language processing community. The Urdu text is stored in the Unicode character set, in its native Arabic script, and marked up according to the Corpus Encoding Standard (CES) XML Document Type Definition (DTD). All the tags and metadata are in English. To date, the corpus is made entirely of data from British Broadcasting Company's (BBC) Urdu Web site, although we plan to add data from other Urdu newspapers. Upon completion, the corpus will consist mostly of raw Urdu text marked up only to the paragraph level so it can be used as input for natural language processing (NLP) tasks. In addition, it will be hand-tagged for parts of speech so the data can be used to train and test NLP tools.