Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

Authors:
Pascal Denis;Benoît Sagot
Affiliations:
Alpage, INRIA Paris-Rocquencourt & Universitéé Paris 7, Domaine de Voluceau, Rocquencourt, Le Chesnay Cedex, France 78153;Alpage, INRIA Paris-Rocquencourt & Universitéé Paris 7, Domaine de Voluceau, Rocquencourt, Le Chesnay Cedex, France 78153
Venue:
Language Resources and Evaluation
Year:
2012

Citing 13
Cited 0

A maximum entropy approach to natural language processing

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Tagging English text with a probabilistic model

Computational Linguistics
Morphological tagging: data vs. dictionaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
A comparison of algorithms for maximum entropy parameter estimation

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Contrastive estimation: training log-linear models on unlabeled data

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Semi-supervised training for the averaged perceptron POS tagger

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Minimized models for unsupervised part-of-speech tagging

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
On statistical parsing of French with supervised and semi-supervised strategies

CLAGI '09 Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference
Automatic acquisition of a slovak lexicon from a raw corpus

TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on French tagging, we introduce a maximum entropy Markov model-based tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.75 % accuracy on the French Treebank, an error reduction of 25 % (38 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.