A biterm topic model for short texts

Authors:
Xiaohui Yan;Jiafeng Guo;Yanyan Lan;Xueqi Cheng
Affiliations:
Institute of Computing Technology, CAS, Beijing, China;Institute of Computing Technology, CAS, Beijing, China;Institute of Computing Technology, CAS, Beijing, China;Institute of Computing Technology, CAS, Beijing, China
Venue:
Proceedings of the 22nd international conference on World Wide Web
Year:
2013

Citing 22
Cited 0

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
The author-topic model for authors and documents

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Topics over time: a non-Markov continuous-time model of topical trends

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Proceedings of the 17th international conference on World Wide Web
Modeling hidden topics on document manifold

Proceedings of the 17th ACM conference on Information and knowledge management
Named entity recognition in query

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Using twitter to recommend real-time topical news

Proceedings of the third ACM conference on Recommender systems
TwitterRank: finding topic-sensitive influential twitterers

Proceedings of the third ACM international conference on Web search and data mining
Short and tweet: experiments on recommending content from information streams

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
On smoothing and inference for topic models

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Query similarity by projecting the query-flow graph

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
PET: a statistical model for popular events tracking in social communities

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Empirical study of topic modeling in Twitter

Proceedings of the First Workshop on Social Media Analytics
Comparing twitter and traditional media using topic models

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Intent-aware query similarity

Proceedings of the 20th ACM international conference on Information and knowledge management
Transferring topical knowledge from auxiliary long texts for short text clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
Optimizing semantic coherence in topic models

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
TM-LDA: efficient online modeling of latent topic transitions in social media

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering short text using Ncut-weighted non-negative matrix factorization

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The fundamental reason lies in that conventional topic models implicitly capture the document-level word co-occurrence patterns to reveal topics, and thus suffer from the severe data sparsity in short documents. In this paper, we propose a novel way for modeling topics in short texts, referred as biterm topic model (BTM). Specifically, in BTM we learn the topics by directly modeling the generation of word co-occurrence patterns (i.e. biterms) in the whole corpus. The major advantages of BTM are that 1) BTM explicitly models the word co-occurrence patterns to enhance the topic learning; and 2) BTM uses the aggregated patterns in the whole corpus for learning topics to solve the problem of sparse word co-occurrence patterns at document-level. We carry out extensive experiments on real-world short text collections. The results demonstrate that our approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics. Furthermore, we find that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model.