An Algorithm that Learns What‘s in a Name
Machine Learning - Special issue on natural language learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A maximum entropy approach to named entity recognition
A maximum entropy approach to named entity recognition
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
TnT: a statistical part-of-speech tagger
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
The Penn Chinese TreeBank: Phrase structure annotation of a large corpus
Natural Language Engineering
Named entity recognition using an HMM-based chunk tagger
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
A simple rule-based part of speech tagger
HLT '91 Proceedings of the workshop on Speech and Natural Language
Japanese word segmentation by hidden Markov model
HLT '94 Proceedings of the workshop on Human Language Technology
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach
Computational Linguistics
Machine transliteration of names in Arabic text
SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
AnnCorra: building tree-banks in Indian languages
COLING '02 Proceedings of the 3rd workshop on Asian language resources and international standardization - Volume 12
Urdu and the Parallel Grammar project
COLING '02 Proceedings of the 3rd workshop on Asian language resources and international standardization - Volume 12
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
A search-based Chinese word segmentation method
Proceedings of the 16th international conference on World Wide Web
Infoxtract: A customizable intermediate level information extraction engine
Natural Language Engineering
Introduction to the CoNLL-2001 shared task: clause identification
ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
NE tagging for Urdu based on bootstrap POS learning
CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
Tagging Urdu text with parts of speech: a tagger comparison
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Improving machine translation quality with automatic named entity recognition
EAMT '03 Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT
Translating names and technical terms in Arabic text
Semitic '98 Proceedings of the Workshop on Computational Approaches to Semitic Languages
Letter-to-sound conversion for Urdu text-to-speech system
Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
Effects of morphological analysis in translation between German and English
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Study of some distance measures for language and encoding identification
LD '06 Proceedings of the Workshop on Linguistic Distances
The necessity of syntactic parsing for semantic role labeling
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Analysis and development of Urdu POS tagged corpus
ALR7 Proceedings of the 7th Workshop on Asian Language Resources
Word segmentation standard in Chinese, Japanese and Korean
ALR7 Proceedings of the 7th Workshop on Asian Language Resources
Mining complex predicates in Hindi using a parallel Hindi-English corpus
MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
A multi-representational and multi-layered treebank for Hindi/Urdu
ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
A hidden Markov model based named entity recognition system: Bengali and Hindi as case studies
PReMI'07 Proceedings of the 2nd international conference on Pattern recognition and machine intelligence
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Hi-index | 0.00 |
There has been an increase in the amount of multilingual text on the Internet due to the proliferation of news sources and blogs. The Urdu language, in particular, has experienced explosive growth on the Web. Text mining for information discovery, which includes tasks such as identifying topics, relationships and events, and sentiment analysis, requires sophisticated natural language processing (NLP). NLP systems begin with modules such as word segmentation, part-of-speech tagging, and morphological analysis and progress to modules such as shallow parsing and named entity tagging. While there have been considerable advances in developing such comprehensive NLP systems for English, the work for Urdu is still in its infancy. The tasks of interest in Urdu NLP includes analyzing data sources such as blogs and comments to news articles to provide insight into social and human behavior. All of this requires a robust NLP system. The objective of this work is to develop an NLP infrastructure for Urdu that is customizable and capable of providing basic analysis on which more advanced information extraction tools can be built. This system assimilates resources from various online sources to facilitate improved named entity tagging and Urdu-to-English transliteration. The annotated data required to train the learning models used here is acquired by standardizing the currently limited resources available for Urdu. Techniques such as bootstrap learning and resource sharing from a syntactically similar language, Hindi, are explored to augment the available annotated Urdu data. Each of the new Urdu text processing modules has been integrated into a general text-mining platform. The evaluations performed demonstrate that the accuracies have either met or exceeded the state of the art.