Efficient community detection in large networks using content and links

Authors:
Yiye Ruan;David Fuhry;Srinivasan Parthasarathy
Affiliations:
The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA
Venue:
Proceedings of the 22nd international conference on World Wide Web
Year:
2013

Citing 21
Cited 0

Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs

SIAM Journal on Scientific Computing
Random Sampling in Cut, Flow, and Network Design Problems

Mathematics of Operations Research
Relevance score normalization for metasearch

Proceedings of the tenth international conference on Information and knowledge management
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Latent dirichlet allocation

The Journal of Machine Learning Research
Kernel k-means: spectral clustering and normalized cuts

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic models for discovering e-communities

Proceedings of the 15th international conference on World Wide Web
Combining content and link for classification using matrix factorization

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient aggregation for graph summarization

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Joint latent topic models for text and citations

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable graph clustering using stochastic flows: applications to community discovery

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Combining link and content for community detection: a discriminative approach

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering with Multiple Graphs

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Empirical comparison of algorithms for network community detection

Proceedings of the 19th international conference on World wide web
Sampling community structure

Proceedings of the 19th international conference on World wide web
Clustering Large Attributed Graphs: An Efficient Incremental Approach

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Local graph sparsification for scalable clustering

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Outlier detection in graph streams

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
DB-CSC: a density-based approach for subspace clustering in graphs with feature vectors

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we discuss a very simple approach of combining content and link information in graph structures for the purpose of community discovery, a fundamental task in network analysis. Our approach hinges on the basic intuition that many networks contain noise in the link structure and that content information can help strengthen the community signal. This enables ones to eliminate the impact of noise (false positives and false negatives), which is particularly prevalent in online social networks and Web-scale information networks. Specifically we introduce a measure of signal strength between two nodes in the network by fusing their link strength with content similarity. Link strength is estimated based on whether the link is likely (with high probability) to reside within a community. Content similarity is estimated through cosine similarity or Jaccard coefficient. We discuss a simple mechanism for fusing content and link similarity. We then present a biased edge sampling procedure which retains edges that are locally relevant for each graph node. The resulting backbone graph can be clustered using standard community discovery algorithms such as Metis and Markov clustering. Through extensive experiments on multiple real-world datasets (Flickr, Wikipedia and CiteSeer) with varying sizes and characteristics, we demonstrate the effectiveness and efficiency of our methods over state-of-the-art learning and mining approaches several of which also attempt to combine link and content analysis for the purposes of community discovery. Specifically we always find a qualitative benefit when combining content with link analysis. Additionally our biased graph sampling approach realizes a quantitative benefit in that it is typically several orders of magnitude faster than competing approaches.