Unsupervised part-of-speech tagging in noisy and esoteric domains with a syntactic-semantic Bayesian HMM

Authors:
William M. Darling;Michael J. Paul;Fei Song
Affiliations:
University of Guelph;Johns Hopkins University;University of Guelph
Venue:
Proceedings of the Workshop on Semantic Analysis in Social Media
Year:
2012

Citing 8
Cited 1

Latent dirichlet allocation

The Journal of Machine Learning Research
Tagging English text with a probabilistic model

Computational Linguistics
Contrastive estimation: training log-linear models on unlabeled data

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Comparing clusterings---an information based distance

Journal of Multivariate Analysis
Unsupervised modeling of Twitter conversations

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Term weighting schemes for Latent Dirichlet Allocation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Staying informed: supervised and semi-supervised multi-view topical analysis of ideological perspective

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Part-of-speech tagging for Twitter: annotation, features, and experiments

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2

Automatic ontology-based user profile learning from heterogeneous web resources in a big data context

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unsupervised part-of-speech (POS) tagging has recently been shown to greatly benefit from Bayesian approaches where HMM parameters are integrated out, leading to significant increases in tagging accuracy. These improvements in unsupervised methods are important especially in specialized social media domains such as Twitter where little training data is available. Here, we take the Bayesian approach one step further by integrating semantic information from an LDA-like topic model with an HMM. Specifically, we present Part-of-Speech LDA (POSLDA), a syntactically and semantically consistent generative probabilistic model. This model discovers POS specific topics from an unlabelled corpus. We show that this model consistently achieves improvements in unsupervised POS tagging and language modeling over the Bayesian HMM approach with varying amounts of side information in the noisy and esoteric domain of Twitter.