A low-budget tagger for Old Czech

Authors:
Jirka Hana;Anna Feldman;Katsiaryna Aharodnik
Affiliations:
Charles University, MFF, Czech Republic;Montclair State University;Montclair State University
Venue:
LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Year:
2011

Citing 5
Cited 0

Tagging English text with a probabilistic model

Computational Linguistics
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
Language independent, minimally supervised induction of lexical probabilities

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Bootstrapping a multilingual part-of-speech tagger in one person-day

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper describes a tagger for Old Czech (1200-1500 AD), a fusional language with rich morphology. The practical restrictions (no native speakers, limited corpora and lexicons, limited funding) make Old Czech an ideal candidate for a resource-light cross-lingual method that we have been developing (e.g. Hana et al., 2004; Feldman and Hana, 2010). We use a traditional supervised tagger. However, instead of spending years of effort to create a large annotated corpus of Old Czech, we approximate it by a corpus of Modern Czech. We perform a series of simple transformations to make a modern text look more like a text in Old Czech and vice versa. We also use a resource-light morphological analyzer to provide candidate tags. The results are worse than the results of traditional taggers, but the amount of language-specific work needed is minimal.