An Information-Extraction System for Urdu---A Resource-Poor Language

Authors:
Smruthi Mukund;Rohini Srihari;Erik Peterson
Affiliations:
State University of New York at Buffalo;State University of New York at Buffalo;Janya, Inc.
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2010

Citing 32
Cited 1

An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A maximum entropy approach to named entity recognition

A maximum entropy approach to named entity recognition
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

Natural Language Engineering
Named entity recognition using an HMM-based chunk tagger

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
A simple rule-based part of speech tagger

HLT '91 Proceedings of the workshop on Speech and Natural Language
Japanese word segmentation by hidden Markov model

HLT '94 Proceedings of the workshop on Human Language Technology
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
Machine transliteration of names in Arabic text

SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
AnnCorra: building tree-banks in Indian languages

COLING '02 Proceedings of the 3rd workshop on Asian language resources and international standardization - Volume 12
Urdu and the Parallel Grammar project

COLING '02 Proceedings of the 3rd workshop on Asian language resources and international standardization - Volume 12
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
A search-based Chinese word segmentation method

Proceedings of the 16th international conference on World Wide Web
Infoxtract: A customizable intermediate level information extraction engine

Natural Language Engineering
Introduction to the CoNLL-2001 shared task: clause identification

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Automatic part-of-speech tagging for Bengali: an approach for morphologically rich languages in a poor resource scenario

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
NE tagging for Urdu based on bootstrap POS learning

CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
Tagging Urdu text with parts of speech: a tagger comparison

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Improving machine translation quality with automatic named entity recognition

EAMT '03 Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT
Translating names and technical terms in Arabic text

Semitic '98 Proceedings of the Workshop on Computational Approaches to Semitic Languages
Letter-to-sound conversion for Urdu text-to-speech system

Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
Effects of morphological analysis in translation between German and English

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Study of some distance measures for language and encoding identification

LD '06 Proceedings of the Workshop on Linguistic Distances
The necessity of syntactic parsing for semantic role labeling

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Analysis and development of Urdu POS tagged corpus

ALR7 Proceedings of the 7th Workshop on Asian Language Resources
Word segmentation standard in Chinese, Japanese and Korean

ALR7 Proceedings of the 7th Workshop on Asian Language Resources
Mining complex predicates in Hindi using a parallel Hindi-English corpus

MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
A multi-representational and multi-layered treebank for Hindi/Urdu

ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
A hidden Markov model based named entity recognition system: Bengali and Hindi as case studies

PReMI'07 Proceedings of the 2nd international conference on Pattern recognition and machine intelligence
Urdu word segmentation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Using cross-lingual projections to generate semantic role labeled corpus for Urdu: a resource poor language

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

There has been an increase in the amount of multilingual text on the Internet due to the proliferation of news sources and blogs. The Urdu language, in particular, has experienced explosive growth on the Web. Text mining for information discovery, which includes tasks such as identifying topics, relationships and events, and sentiment analysis, requires sophisticated natural language processing (NLP). NLP systems begin with modules such as word segmentation, part-of-speech tagging, and morphological analysis and progress to modules such as shallow parsing and named entity tagging. While there have been considerable advances in developing such comprehensive NLP systems for English, the work for Urdu is still in its infancy. The tasks of interest in Urdu NLP includes analyzing data sources such as blogs and comments to news articles to provide insight into social and human behavior. All of this requires a robust NLP system. The objective of this work is to develop an NLP infrastructure for Urdu that is customizable and capable of providing basic analysis on which more advanced information extraction tools can be built. This system assimilates resources from various online sources to facilitate improved named entity tagging and Urdu-to-English transliteration. The annotated data required to train the learning models used here is acquired by standardizing the currently limited resources available for Urdu. Techniques such as bootstrap learning and resource sharing from a syntactically similar language, Hindi, are explored to augment the available annotated Urdu data. Each of the new Urdu text processing modules has been integrated into a general text-mining platform. The evaluations performed demonstrate that the accuracies have either met or exceeded the state of the art.