Tagging with Small Training Corpora

  • Authors:
  • Nuno M. C. Marques;José Gabriel Pereira Lopes

  • Affiliations:
  • -;-

  • Venue:
  • IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

The analysis of textual data may start by classifying words using apredefined tag set. However, it is still a problem for natural language text understanding the assignment of part-of-speech tags to words in unrestricted text (called POS-tagging). Most part of current taggers require huge amounts of hand tagged text for training (in the order of 105 pretagged words): it requires linguistically highly trained man power for a highly repetitive and boring job, and the results obtained have no optimal quality. Moreover, when one wants to change to another text genre the same kind of problem must be faced again. Our proposal goes in another direction. By carefully combininga large lexicon with an efficient neural network based generator of taggers we can generate POS-taggers usingno more than 104 hand corrected tagged words for training. This training tagged text size can be feasibly hand corrected. Experimental results are presented and discussed for the SUSANNE Corpus. Results in three additional different Portuguese corpora are also discussed. 96% precision rates are obtained when unknown words occur in the test set. 98% precision rates are obtained when every word in the test set is known.