Bootstrapping POS taggers using unlabelled data

  • Authors:
  • Stephen Clark;James R. Curran;Miles Osborne

  • Affiliations:
  • University of Edinburgh, Edinburgh;University of Edinburgh, Edinburgh;University of Edinburgh, Edinburgh

  • Venue:
  • CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper investigates booststrapping part-of-speech taggers using co-training, in which two taggers are iteratively re-trained on each other's output. Since the output of the taggers is noisy, there is a question of which newly labelled examples to add to the training set. We investigate selecting examples by directly maximising tagger agreement on unlabelled data, a method which has been theoretically and empirically motivated in the co-training literature. Our results show that agreement-based co-training can significantly improve tagging performance for small seed datasets. Further results show that this form of co-training considerably outperforms self-training. However, we find that simply re-training on all the newly labelled data can, in some cases, yield comparable results to agreement-based co-training, with only a fraction of the computational cost.