A multi-domain web-based algorithm for POS tagging of unknown words

  • Authors:
  • Shulamit Umansky-Pesin;Roi Reichart;Ari Rappoport

  • Affiliations:
  • The Hebrew University;The Hebrew University;The Hebrew University

  • Venue:
  • COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a web-based algorithm for the task of POS tagging of unknown words (words appearing only a small number of times in the training data of a supervised POS tagger). When a sentence s containing an unknown word u is to be tagged by a trained POS tagger, our algorithm collects from the web contexts that are partially similar to the context of u in s, which are then used to compute new tag assignment probabilities for u. Our algorithm enables fast multi-domain unknown word tagging, since, unlike previous work, it does not require a corpus from the new domain. We integrate our algorithm into the MXPOST POS tagger (Ratnaparkhi, 1996) and experiment with three languages (English, German and Chinese) in seven in-domain and domain adaptation scenarios. Our algorithm provides an error reduction of up to 15.63% (English), 18.09% (German) and 13.57% (Chinese) over the original tagger.