Probabilistic and rule-based tagger of an inflective language: a comparison

  • Authors:
  • Jan Hajič;Barbora Hladká

  • Affiliations:
  • Institute of Formal and Applied Linguistics, Prague;Institute of Formal and Applied Linguistics, Prague

  • Venue:
  • ANLC '97 Proceedings of the fifth conference on Applied natural language processing
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present results of probabilistic tagging of Czech texts in order to show how these techniques work for one of the highly morphologically ambiguous inflective languages. After description of the tag system used, we show the results of four experiments using a simple probabilistic model to tag Czech texts (unigram, two bigram experiments, and a trigram one). For comparison, we have applied the same code and settings to tag an English text (another four experiments) using the same size of training and test data in the experiments in order to avoid any doubt concerning the validity of the comparison. The experiments use the source channel model and maximum likelihood training on a Czech hand-tagged corpus and on tagged Wall Street Journal (WSJ) from the LDC collection. The experiments show (not surprisingly) that the more training data, the better is the success rate. The results also indicate that for inflective languages with 1000+ tags we have to develop a more sophisticated approach in order to get closer to an acceptable error rate. In order to compare two different approaches to text tagging---statistical and rule-based --- we modified Eric Brill's rule-based part of speech tagger and carried out two more experiments on the Czech data, obtaining similar results in terms of the error rate. We have also run three more experiments with greatly reduced tagset to get another comparison based on similar tagset size.