Learning-based named entity recognition for morphologically-rich, resource-scarce languages

Authors:
Kazi Saidul Hasan;Altaf ur Rahman;Vincent Ng
Affiliations:
University of Texas at Dallas, Richardson, TX;University of Texas at Dallas, Richardson, TX;University of Texas at Dallas, Richardson, TX
Venue:
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2009

Citing 15
Cited 1

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Unsupervised learning of the morphology of a natural language

Computational Linguistics
Distributional part-of-speech tagging

EACL '95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics
Named Entity recognition without gazetteers

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Combining distributional and morphological information for part of speech induction

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Transformation-based learning in the fast lane

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Prototype-driven learning for sequence models

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

The Journal of Machine Learning Research
Morphological richness offsets resource demand- experiences in constructing a POS tagger for Hindi

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Automatic part-of-speech tagging for Bengali: an approach for morphologically rich languages in a poor resource scenario

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Bootstrapping named entity recognition with automatically generated gazetteer lists

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Deriving a large scale taxonomy from Wikipedia

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Knowledge derived from wikipedia for computing semantic relatedness

Journal of Artificial Intelligence Research
Jointly labeling multiple sequences: a factorial HMM approach

ACLstudent '05 Proceedings of the ACL Student Research Workshop

Automatic rule learning exploiting morphological features for named entity recognition in Turkish

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Named entity recognition for morphologically rich, case-insensitive languages, including the majority of semitic languages, Iranian languages, and Indian languages, is inherently more difficult than its English counterpart. Worse still, progress on machine learning approaches to named entity recognition for many of these languages is currently hampered by the scarcity of annotated data and the lack of an accurate part-of-speech tagger. While it is possible to rely on manually-constructed gazetteers to combat data scarcity, this gazetteer-centric approach has the potential weakness of creating irreproducible results, since these name lists are not publicly available in general. Motivated in part by this concern, we present a learning-based named entity recognizer that does not rely on manually-constructed gazetteers, using Bengali as our representative resource-scarce, morphologically-rich language. Our recognizer achieves a relative improvement of 7.5% in F-measure over a baseline recognizer. Improvements arise from (1) using induced affixes, (2) extracting information from online lexical databases, and (3) jointly modeling part-of-speech tagging and named entity recognition.