NE tagging for Urdu based on bootstrap POS learning

Authors:
Smruthi Mukund;Rohini K. Srihari
Affiliations:
University at Buffalo, SUNY Amherst, NY;University at Buffalo, SUNY Amherst, NY
Venue:
CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
Year:
2009

Citing 7
Cited 3

An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A maximum entropy approach to named entity recognition

A maximum entropy approach to named entity recognition
Non-dictionary-based Thai word segmentation using decision trees

HLT '01 Proceedings of the first international conference on Human language technology research
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Infoxtract: A customizable intermediate level information extraction engine

Natural Language Engineering
A hidden Markov model based named entity recognition system: Bengali and Hindi as case studies

PReMI'07 Proceedings of the 2nd international conference on Pattern recognition and machine intelligence

An Information-Extraction System for Urdu---A Resource-Poor Language

ACM Transactions on Asian Language Information Processing (TALIP)
Rule-based named entity recognition in Urdu

NEWS '10 Proceedings of the 2010 Named Entities Workshop
A vector space model for subjectivity classification in Urdu aided by co-training

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Part of Speech (POS) tagging and Named Entity (NE) tagging have become important components of effective text analysis. In this paper, we propose a bootstrapped model that involves four levels of text processing for Urdu. We show that increasing the training data for POS learning by applying bootstrapping techniques improves NE tagging results. Our model overcomes the limitation imposed by the availability of limited ground truth data required for training a learning model. Both our POS tagging and NE tagging models are based on the Conditional Random Field (CRF) learning approach. To further enhance the performance, grammar rules and lexicon lookups are applied on the final output to correct any spurious tag assignments. We also propose a model for word boundary segmentation where a bigram HMM model is trained for character transitions among all positions in each word. The generated words are further processed using a probabilistic language model. All models use a hybrid approach that combines statistical models with hand crafted grammar rules.