NE tagging for Urdu based on bootstrap POS learning

  • Authors:
  • Smruthi Mukund;Rohini K. Srihari

  • Affiliations:
  • University at Buffalo, SUNY Amherst, NY;University at Buffalo, SUNY Amherst, NY

  • Venue:
  • CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Part of Speech (POS) tagging and Named Entity (NE) tagging have become important components of effective text analysis. In this paper, we propose a bootstrapped model that involves four levels of text processing for Urdu. We show that increasing the training data for POS learning by applying bootstrapping techniques improves NE tagging results. Our model overcomes the limitation imposed by the availability of limited ground truth data required for training a learning model. Both our POS tagging and NE tagging models are based on the Conditional Random Field (CRF) learning approach. To further enhance the performance, grammar rules and lexicon lookups are applied on the final output to correct any spurious tag assignments. We also propose a model for word boundary segmentation where a bigram HMM model is trained for character transitions among all positions in each word. The generated words are further processed using a probabilistic language model. All models use a hybrid approach that combines statistical models with hand crafted grammar rules.