An Information-Extraction System for Urdu---A Resource-Poor Language

  • Authors:
  • Smruthi Mukund;Rohini Srihari;Erik Peterson

  • Affiliations:
  • State University of New York at Buffalo;State University of New York at Buffalo;Janya, Inc.

  • Venue:
  • ACM Transactions on Asian Language Information Processing (TALIP)
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

There has been an increase in the amount of multilingual text on the Internet due to the proliferation of news sources and blogs. The Urdu language, in particular, has experienced explosive growth on the Web. Text mining for information discovery, which includes tasks such as identifying topics, relationships and events, and sentiment analysis, requires sophisticated natural language processing (NLP). NLP systems begin with modules such as word segmentation, part-of-speech tagging, and morphological analysis and progress to modules such as shallow parsing and named entity tagging. While there have been considerable advances in developing such comprehensive NLP systems for English, the work for Urdu is still in its infancy. The tasks of interest in Urdu NLP includes analyzing data sources such as blogs and comments to news articles to provide insight into social and human behavior. All of this requires a robust NLP system. The objective of this work is to develop an NLP infrastructure for Urdu that is customizable and capable of providing basic analysis on which more advanced information extraction tools can be built. This system assimilates resources from various online sources to facilitate improved named entity tagging and Urdu-to-English transliteration. The annotated data required to train the learning models used here is acquired by standardizing the currently limited resources available for Urdu. Techniques such as bootstrap learning and resource sharing from a syntactically similar language, Hindi, are explored to augment the available annotated Urdu data. Each of the new Urdu text processing modules has been integrated into a general text-mining platform. The evaluations performed demonstrate that the accuracies have either met or exceeded the state of the art.