Toward unsupervised whole-corpus tagging

  • Authors:
  • Dayne Freitag

  • Affiliations:
  • HNC Software, LLC, San Diego, CA

  • Venue:
  • COLING '04 Proceedings of the 20th international conference on Computational Linguistics
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a system for unsupervised tagging of words into classes produced by a distributional clustering technique called co-clustering. A hidden Markov model (HMM), trained on the high-frequency terms in the lexicon, is used to "tag" occurrences of low-frequency terms. In experiments using the Wall Street Journal portion of the Penn Treebank, we show that previously reported problems in using Baum-Welch estimation for part-of-speech tagging do not occur in this context. We also show how state-level term emission models can be augmented to account for morphological patterns using features automatically derived from the output of co-clustering. Finally, we consider an alternative means of extending the coverage of the lexicon, in which low-frequency terms are added to the lexicon as types, and compare this approach with the token-level assignments made by the HMM.