Probabilistic and rule-based tagger of an inflective language: a comparison

Authors:
Jan Hajič;Barbora Hladká
Affiliations:
Institute of Formal and Applied Linguistics, Prague;Institute of Formal and Applied Linguistics, Prague
Venue:
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Year:
1997

Citing 6
Cited 13

A corpus-based approach to language learning

A corpus-based approach to language learning
Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Tagging English text with a probabilistic model

Computational Linguistics
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing

Grammatical Agreement and Automatic Morphological Disambiguation of Inflectional Languages

TSD '01 Proceedings of the 4th International Conference on Text, Speech and Dialogue
The Linguistic Basis of a Rule-Based Tagger of Czech

TDS '00 Proceedings of the Third International Workshop on Text, Speech and Dialogue
Tagging of very large corpora: topic-focus articulation

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Deletions and their reconstruction in tectogrammatical syntactic tagging of very large corpora

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Multilinguality in a text generation system for three Slavic languages

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Topic-focus and salience

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Automatic Extraction of Clause Relationships from a Treebank

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Slavonic information extraction and partial parsing

ACL '07 Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies
Glen, Glenda or Glendale: unsupervised and semi-supervised learning of English noun gender

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
From Czech morphology through partial parsing to disambiguation

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Towards the adequate evaluation of morphosyntactic taggers

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Dedicated nominal featurization of portuguese

PROPOR'06 Proceedings of the 7th international conference on Computational Processing of the Portuguese Language
Automatic evaluation of syntactic learners in typologically-different languages

Cognitive Systems Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present results of probabilistic tagging of Czech texts in order to show how these techniques work for one of the highly morphologically ambiguous inflective languages. After description of the tag system used, we show the results of four experiments using a simple probabilistic model to tag Czech texts (unigram, two bigram experiments, and a trigram one). For comparison, we have applied the same code and settings to tag an English text (another four experiments) using the same size of training and test data in the experiments in order to avoid any doubt concerning the validity of the comparison. The experiments use the source channel model and maximum likelihood training on a Czech hand-tagged corpus and on tagged Wall Street Journal (WSJ) from the LDC collection. The experiments show (not surprisingly) that the more training data, the better is the success rate. The results also indicate that for inflective languages with 1000+ tags we have to develop a more sophisticated approach in order to get closer to an acceptable error rate. In order to compare two different approaches to text tagging---statistical and rule-based --- we modified Eric Brill's rule-based part of speech tagger and carried out two more experiments on the Czech data, obtaining similar results in terms of the error rate. We have also run three more experiments with greatly reduced tagset to get another comparison based on similar tagset size.