Cross-lingual adaptation as a baseline: adapting maximum entropy models to Bulgarian

Authors:
Georgi Georgiev;Preslav Nakov;Petya Osenova;Kiril Simov
Affiliations:
Ontotext AD, Sofia, Bulgaria;National University of Singapore, Singapore;Bulgarian Academy of Sciences, Sofia, Bulgaria;Bulgarian Academy of Sciences, Sofia, Bulgaria
Venue:
AdaptLRTtoND '09 Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains
Year:
2009

Citing 5
Cited 1

A maximum entropy approach to natural language processing

Computational Linguistics
Maximum entropy models for natural language ambiguity resolution

Maximum entropy models for natural language ambiguity resolution
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Coarse-to-fine n-best parsing and MaxEnt discriminative reranking

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics

Feature-rich part-of-speech tagging for morphologically complex languages: application to Bulgarian

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe our efforts in adapting five basic natural language processing components to Bulgar-ian: sentence splitter, tokenizer, part-of-speech tagger, chunker, and syntactic parser. The components were originally developed for English within OpenNLP, an open source maximum entropy based machine learning toolkit, and were retrained based on manually annotated training data from the BulTreeBank. The evaluation results show an F1 score of 92.54% for the sentence splitter, 98.49% for the tokenizer, 94.43% for the part-of-speech tagger, 84.60% for the chunker, and 77.56% for the syntactic parser, which should be interpreted as baseline for Bulgarian.