A study in Urdu corpus construction

  • Authors:
  • Dara Becker;Kashif Riaz

  • Affiliations:
  • University of St. Thomas, St. Paul, MN;University of Minnesota-Twin Cities, Minneapolis, MN

  • Venue:
  • COLING '02 Proceedings of the 3rd workshop on Asian language resources and international standardization - Volume 12
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

We are interested in contributing a small, publicly available Urdu corpus of written text to the natural language processing community. The Urdu text is stored in the Unicode character set, in its native Arabic script, and marked up according to the Corpus Encoding Standard (CES) XML Document Type Definition (DTD). All the tags and metadata are in English. To date, the corpus is made entirely of data from British Broadcasting Company's (BBC) Urdu Web site, although we plan to add data from other Urdu newspapers. Upon completion, the corpus will consist mostly of raw Urdu text marked up only to the paragraph level so it can be used as input for natural language processing (NLP) tasks. In addition, it will be hand-tagged for parts of speech so the data can be used to train and test NLP tools.