Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

  • Authors:
  • Pascal Denis;Benoît Sagot

  • Affiliations:
  • Alpage, INRIA Paris-Rocquencourt & Universitéé Paris 7, Domaine de Voluceau, Rocquencourt, Le Chesnay Cedex, France 78153;Alpage, INRIA Paris-Rocquencourt & Universitéé Paris 7, Domaine de Voluceau, Rocquencourt, Le Chesnay Cedex, France 78153

  • Venue:
  • Language Resources and Evaluation
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on French tagging, we introduce a maximum entropy Markov model-based tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.75 % accuracy on the French Treebank, an error reduction of 25 % (38 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.