Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs
SIAM Journal on Scientific Computing
Random Sampling in Cut, Flow, and Network Design Problems
Mathematics of Operations Research
Relevance score normalization for metasearch
Proceedings of the tenth international conference on Information and knowledge management
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions
The Journal of Machine Learning Research
The Journal of Machine Learning Research
Kernel k-means: spectral clustering and normalized cuts
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic models for discovering e-communities
Proceedings of the 15th international conference on World Wide Web
Combining content and link for classification using matrix factorization
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient aggregation for graph summarization
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Joint latent topic models for text and citations
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable graph clustering using stochastic flows: applications to community discovery
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Combining link and content for community detection: a discriminative approach
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering with Multiple Graphs
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Empirical comparison of algorithms for network community detection
Proceedings of the 19th international conference on World wide web
Proceedings of the 19th international conference on World wide web
Clustering Large Attributed Graphs: An Efficient Incremental Approach
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Local graph sparsification for scalable clustering
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Outlier detection in graph streams
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
DB-CSC: a density-based approach for subspace clustering in graphs with feature vectors
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
Hi-index | 0.00 |
In this paper we discuss a very simple approach of combining content and link information in graph structures for the purpose of community discovery, a fundamental task in network analysis. Our approach hinges on the basic intuition that many networks contain noise in the link structure and that content information can help strengthen the community signal. This enables ones to eliminate the impact of noise (false positives and false negatives), which is particularly prevalent in online social networks and Web-scale information networks. Specifically we introduce a measure of signal strength between two nodes in the network by fusing their link strength with content similarity. Link strength is estimated based on whether the link is likely (with high probability) to reside within a community. Content similarity is estimated through cosine similarity or Jaccard coefficient. We discuss a simple mechanism for fusing content and link similarity. We then present a biased edge sampling procedure which retains edges that are locally relevant for each graph node. The resulting backbone graph can be clustered using standard community discovery algorithms such as Metis and Markov clustering. Through extensive experiments on multiple real-world datasets (Flickr, Wikipedia and CiteSeer) with varying sizes and characteristics, we demonstrate the effectiveness and efficiency of our methods over state-of-the-art learning and mining approaches several of which also attempt to combine link and content analysis for the purposes of community discovery. Specifically we always find a qualitative benefit when combining content with link analysis. Additionally our biased graph sampling approach realizes a quantitative benefit in that it is typically several orders of magnitude faster than competing approaches.