Scalable multi stage clustering of tagged micro-messages

Authors:
Oren Tsur;Adi Littman;Ari Rappoport
Affiliations:
The Hebrew University, Jerusalem, Israel;The Hebrew University, Jerusalem, Israel;The Hebrew University, Jerusalem, Israel
Venue:
Proceedings of the 21st international conference companion on World Wide Web
Year:
2012

Citing 6
Cited 0

Improved annotation of the blogosphere via autotagging and hierarchical clustering

Proceedings of the 15th international conference on World Wide Web
The NVI clustering evaluation measure

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Web-scale k-means clustering

Proceedings of the 19th international conference on World wide web
Type level clustering evaluation: new measures and a POS induction case study

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Comparative study of clustering techniques for short text documents

Proceedings of the 20th international conference companion on World wide web
Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter

Proceedings of the 20th international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

The growing popularity of microblogging backed by services like Twitter, Facebook, Google+ and LinkedIn, raises the challenge of clustering short and extremely sparse documents. In this work we propose SMSC -- a scalable, accurate and efficient multi stage clustering algorithm. Our algorithm leverages users practice of adding tags to some messages by bootstrapping over virtual non sparse documents. We experiment on a large corpus of tweets from Twitter, and evaluate results against a gold-standard classification validated by seven clustering evaluation measures (information theoretic, paired and greedy). Results show that the algorithm presented is both accurate and efficient, significantly outperforming other algorithms. Under reasonable practical assumptions, our algorithm scales up sublinearly in time.