Seeking stable clusters in the blogosphere

Authors:
Nilesh Bansal;Fei Chiang;Nick Koudas;Frank Wm. Tompa
Affiliations:
University of Toronto;University of Toronto;University of Toronto;University of Waterloo
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 10
Cited 25

Introduction to algorithms

Introduction to algorithms
Multilevel k-way partitioning scheme for irregular graphs

Journal of Parallel and Distributed Computing
Foundations of statistical natural language processing

Foundations of statistical natural language processing
External-memory graph algorithms

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
On external memory graph traversal

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Correlation Clustering

Machine Learning
Correlation clustering with a fixed number of clusters

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation
BlogScope: a system for online analysis of high volume text streams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases

Building structured web community portals: a top-down, compositional, and incremental approach

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Ad-hoc aggregations of ranked lists in the presence of hierarchies

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Query by document

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Efficient identification of starters and followers in social media

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
An online blog reading system by topic clustering and personalized ranking

ACM Transactions on Internet Technology (TOIT)
Chinese Blog Clustering by Hidden Sentiment Factors

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Online Evaluation of Patterns from Evolving Web Data Streams

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
WisColl: Collective wisdom based blog clustering

Information Sciences: an International Journal
Measure-driven keyword-query expansion

Proceedings of the VLDB Endowment
A recall-based cluster formation game in peer-to-peer systems

Proceedings of the VLDB Endowment
A particle-and-density based evolutionary clustering method for dynamic networks

Proceedings of the VLDB Endowment
Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment
CHRONICLE: A Two-Stage Density-Based Clustering Algorithm for Dynamic Networks

DS '09 Proceedings of the 12th International Conference on Discovery Science
Early online identification of attention gathering items in social media

Proceedings of the third ACM international conference on Web search and data mining
Durable top-k search in document archives

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Identifying topic experts and topic communities in the blogspace

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Discovering burst areas in fast evolving graphs

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Fires on the web: towards efficient exploring historical web graphs

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Bursty event detection from collaborative tags

World Wide Web
Dense subgraph maintenance under streaming edge weight updates for real-time story identification

Proceedings of the VLDB Endowment
Community detection via heterogeneous interaction analysis

Data Mining and Knowledge Discovery
A novel approach for clustering sentiments in Chinese blogs based on graph similarity

Computers & Mathematics with Applications
Real time discovery of dense clusters in highly dynamic graphs: identifying real world events in highly dynamic environments

Proceedings of the VLDB Endowment
Who blogs what: understanding the publishing behavior of bloggers

World Wide Web
Extracting news blog hot topics based on the W2T Methodology

World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

The popularity of blogs has been increasing dramatically over the last couple of years. As topics evolve in the blogosphere, keywords align together and form the heart of various stories. Intuitively we expect that in certain contexts, when there is a lot of discussion on a specific topic or event, a set of keywords will be correlated: the keywords in the set will frequently appear together (pair-wise or in conjunction) forming a cluster. Note that such keyword clusters are temporal (associated with specific time periods) and transient. As topics recede, associated keyword clusters dissolve, because their keywords no longer appear frequently together. In this paper, we formalize this intuition and present efficient algorithms to identify keyword clusters in large collections of blog posts for specific temporal intervals. We then formalize problems related to the temporal properties of such clusters. In particular, we present efficient algorithms to identify clusters that persist over time. Given the vast amounts of data involved, we present algorithms that are fast (can efficiently process millions of blogs with multiple millions of posts) and take special care to make them efficiently realizable in secondary storage. Although we instantiate our techniques in the context of blogs, our methodology is generic enough to apply equally well to any temporally ordered text source. We present the results of an experimental study using both real and synthetic data sets, demonstrating the efficiency of our algorithms, both in terms of performance and in terms of the quality of the keyword clusters and associated temporal properties we identify.