Feature-rich part-of-speech tagging for morphologically complex languages: application to Bulgarian

Authors:
Georgi Georgiev;Valentin Zhikov;Petya Osenova;Kiril Simov;Preslav Nakov
Affiliations:
Ontotext AD, Tsarigradsko Sh., Sofia, Bulgaria;Ontotext AD, Tsarigradsko Sh., Sofia, Bulgaria;IICT, Bulgarian Academy of Sciences, Sofia, Bulgaria;IICT, Bulgarian Academy of Sciences, Sofia, Bulgaria;Qatar Computing Research Institute, Qatar Foundation Tornado Tower, Doha, Qatar
Venue:
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2012

Citing 20
Cited 2

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Development of corpora within the CLaRK system: the BulTreeBank project experience

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
Shallow language processing architecture for Bulgarian

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Serial combination of rules and statistics: a case study in Czech tagging

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Incremental parsing with the perceptron algorithm

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Bidirectional inference with the easiest-first strategy for tagging sequence data

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Context-based morphological disambiguation with random fields

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Icelandic data driven part of speech tagging

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Alternating projections for learning with expectation constraints

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Cross-lingual adaptation as a baseline: adapting maximum entropy models to Bulgarian

AdaptLRTtoND '09 Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains
A multi-domain web-based algorithm for POS tagging of unknown words

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Web-scale features for full-scale parsing

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Semisupervised condensed nearest neighbor for part-of-speech tagging

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Learning with lookahead: can history-based models rival globally optimized models?

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning

Linguistically-augmented Bulgarian-to-English statistical machine translation model

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
Linguistically-enriched models for Bulgarian-to-English machine translation

SSST-6 '12 Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present experiments with part-of-speech tagging for Bulgarian, a Slavic language with rich inflectional and derivational morphology. Unlike most previous work, which has used a small number of grammatical categories, we work with 680 morpho-syntactic tags. We combine a large morphological lexicon with prior linguistic knowledge and guided learning from a POS-annotated corpus, achieving accuracy of 97.98%, which is a significant improvement over the state-of-the-art for Bulgarian.