A maximum entropy approach to natural language processing
Computational Linguistics
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Tagging English text with a probabilistic model
Computational Linguistics
Morphological tagging: data vs. dictionaries
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger
EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
A comparison of algorithms for maximum entropy parameter estimation
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Contrastive estimation: training log-linear models on unlabeled data
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Semi-supervised training for the averaged perceptron POS tagger
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Minimized models for unsupervised part-of-speech tagging
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
On statistical parsing of French with supervised and semi-supervised strategies
CLAGI '09 Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference
Automatic acquisition of a slovak lexicon from a raw corpus
TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
Hi-index | 0.00 |
This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on French tagging, we introduce a maximum entropy Markov model-based tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.75 % accuracy on the French Treebank, an error reduction of 25 % (38 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.