Improving LDA topic models for microblogs via tweet pooling and automatic labeling

Authors:
Rishabh Mehrotra;Scott Sanner;Wray Buntine;Lexing Xie
Affiliations:
BITS Pilani, Pilani, India;NICTA & ANU, Canberra, Australia;NICTA & ANU, Canberra, Australia;ANU & NICTA, Canberra, Australia
Venue:
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Year:
2013

Citing 12
Cited 2

Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Introduction to Information Retrieval

Introduction to Information Retrieval
TwitterRank: finding topic-sensitive influential twitterers

Proceedings of the third ACM international conference on Web search and data mining
Automatic evaluation of topic coherence

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Empirical study of topic modeling in Twitter

Proceedings of the First Workshop on Social Media Analytics
Hip and trendy: Characterizing emerging trends on Twitter

Journal of the American Society for Information Science and Technology
Comparing twitter and traditional media using topic models

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Searching microblogs: coping with sparsity and document quality

Proceedings of the 20th ACM international conference on Information and knowledge management
We know what @you #tag: does the dual role affect hashtag adoption?

Proceedings of the 21st international conference on World Wide Web
Automatically constructing a normalisation dictionary for microblogs

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Lexical normalization for social media text

ACM Transactions on Intelligent Systems and Technology (TIST) - Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context

Are words enough?: a study on text-based representations and retrieval models for linking pins to online shops

Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing
The dual-sparse topic model: mining focused topics and focused terms in short text

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics. Overall, these two novel schemes lead to significantly improved LDA topic models on Twitter content.