Unsupervised part-of-speech tagging in noisy and esoteric domains with a syntactic-semantic Bayesian HMM

  • Authors:
  • William M. Darling;Michael J. Paul;Fei Song

  • Affiliations:
  • University of Guelph;Johns Hopkins University;University of Guelph

  • Venue:
  • Proceedings of the Workshop on Semantic Analysis in Social Media
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Unsupervised part-of-speech (POS) tagging has recently been shown to greatly benefit from Bayesian approaches where HMM parameters are integrated out, leading to significant increases in tagging accuracy. These improvements in unsupervised methods are important especially in specialized social media domains such as Twitter where little training data is available. Here, we take the Bayesian approach one step further by integrating semantic information from an LDA-like topic model with an HMM. Specifically, we present Part-of-Speech LDA (POSLDA), a syntactically and semantically consistent generative probabilistic model. This model discovers POS specific topics from an unlabelled corpus. We show that this model consistently achieves improvements in unsupervised POS tagging and language modeling over the Bayesian HMM approach with varying amounts of side information in the noisy and esoteric domain of Twitter.