Tagging with Small Training Corpora

Authors:
Nuno M. C. Marques;José Gabriel Pereira Lopes
Affiliations:
-;-
Venue:
IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
Year:
2001

Citing 6
Cited 8

Estimating lexical priors for low-frequency morphologically ambiguous forms

Computational Linguistics
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Maximum entropy models for natural language ambiguity resolution

Maximum entropy models for natural language ambiguity resolution
Tagging English text with a probabilistic model

Computational Linguistics
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Part-of-speech tagging with neural networks

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Clustering Syntactic Positions with Similar Semantic Requirements

Computational Linguistics
PGR: portuguese attorney general's office decisions on the web

INAP'01 Proceedings of the Applications of prolog 14th international conference on Web knowledge management and decision support
Selection restrictions acquisition for parsing improvement

INAP'01 Proceedings of the Applications of prolog 14th international conference on Web knowledge management and decision support
Detection of strange and wrong automatic part-of-speech tagging

EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
Towards encoding background knowledge with temporal extent into neural networks

KSEM'10 Proceedings of the 4th international conference on Knowledge science, engineering and management
Improving arabic part-of-speech tagging through morphological analysis

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
Determining the polarity of words through a common online dictionary

EPIA'11 Proceedings of the 15th Portugese conference on Progress in artificial intelligence
A bootstrapping algorithm for learning the polarity of words

PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

The analysis of textual data may start by classifying words using apredefined tag set. However, it is still a problem for natural language text understanding the assignment of part-of-speech tags to words in unrestricted text (called POS-tagging). Most part of current taggers require huge amounts of hand tagged text for training (in the order of 105 pretagged words): it requires linguistically highly trained man power for a highly repetitive and boring job, and the results obtained have no optimal quality. Moreover, when one wants to change to another text genre the same kind of problem must be faced again. Our proposal goes in another direction. By carefully combininga large lexicon with an efficient neural network based generator of taggers we can generate POS-taggers usingno more than 104 hand corrected tagged words for training. This training tagged text size can be feasibly hand corrected. Experimental results are presented and discussed for the SUSANNE Corpus. Results in three additional different Portuguese corpora are also discussed. 96% precision rates are obtained when unknown words occur in the test set. 98% precision rates are obtained when every word in the test set is known.