The Journal of Machine Learning Research
Tagging English text with a probabilistic model
Computational Linguistics
Contrastive estimation: training log-linear models on unlabeled data
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Comparing clusterings---an information based distance
Journal of Multivariate Analysis
Unsupervised modeling of Twitter conversations
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Term weighting schemes for Latent Dirichlet Allocation
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Part-of-speech tagging for Twitter: annotation, features, and experiments
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Unsupervised part-of-speech (POS) tagging has recently been shown to greatly benefit from Bayesian approaches where HMM parameters are integrated out, leading to significant increases in tagging accuracy. These improvements in unsupervised methods are important especially in specialized social media domains such as Twitter where little training data is available. Here, we take the Bayesian approach one step further by integrating semantic information from an LDA-like topic model with an HMM. Specifically, we present Part-of-Speech LDA (POSLDA), a syntactically and semantically consistent generative probabilistic model. This model discovers POS specific topics from an unlabelled corpus. We show that this model consistently achieves improvements in unsupervised POS tagging and language modeling over the Bayesian HMM approach with varying amounts of side information in the noisy and esoteric domain of Twitter.