Sampling from large graphs

Authors:
Jure Leskovec;Christos Faloutsos
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 7
Cited 79

Clique partitions, graph compression and speeding-up algorithms

Journal of Computer and System Sciences
On power-law relationships of the Internet topology

Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
ANF: a fast and scalable tool for data mining in massive graphs

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Towards Compressing Web Graphs

DCC '01 Proceedings of the Data Compression Conference
Graphs over time: densification laws, shrinking diameters and possible explanations

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Sampling algorithms for pure network topologies: a study on the stability and the separability of metric embeddings

ACM SIGKDD Explorations Newsletter
Reducing large internet topologies for faster simulations

NETWORKING'05 Proceedings of the 4th IFIP-TC6 international conference on Networking Technologies, Services, and Protocols; Performance of Computer and Communication Networks; Mobile and Wireless Communication Systems

Graph evolution: Densification and shrinking diameters

ACM Transactions on Knowledge Discovery from Data (TKDD)
Sampling large Internet topologies for simulation purposes

Computer Networks: The International Journal of Computer and Telecommunications Networking
Designing clustering-based web crawling policies for search engine crawlers

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
The very small world of the well-connected

Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
Comparison of online social relations in volume vs interaction: a case study of cyworld

Proceedings of the 8th ACM SIGCOMM conference on Internet measurement
The very small world of the well-connected

ACM SIGWEB Newsletter
Operators for propagating trust and their evaluation in social networks

Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume 2
Centralities: capturing the fuzzy notion of importance in social graphs

Proceedings of the Second ACM EuroSys Workshop on Social Network Systems
Word-of-mouth algorithms: what you don't know will hurt you

Proceedings of the ICMI-MLMI '09 Workshop on Multimodal Sensor-Based Systems and Mobile Phones for Social Computing
Privacy-enhanced public view for social graphs

Proceedings of the 2nd ACM workshop on Social web search and mining
A Survey of Statistical Network Models

Foundations and Trends® in Machine Learning
Efficiently detecting webpage updates using samples

ICWE'07 Proceedings of the 7th international conference on Web engineering
Sampling community structure

Proceedings of the 19th international conference on World wide web
Emerging topic detection on Twitter based on temporal and social terms evaluation

Proceedings of the Tenth International Workshop on Multimedia Data Mining
Time-based sampling of social network activity graphs

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Frequent subgraph mining on a single large graph using sampling techniques

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Walking in facebook: a case study of unbiased sampling of OSNs

INFOCOM'10 Proceedings of the 29th conference on Information communications
Clustering-based incremental web crawling

ACM Transactions on Information Systems (TOIS)
Estimating and sampling graphs with multidimensional random walks

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
A Socratic method for validation of measurement-based networking research

Computer Communications
SocialFilter: introducing social trust to collaborative spam mitigation

CollSec'10 Proceedings of the 2010 international conference on Collaborative methods for security and privacy
Correcting for missing data in information cascades

Proceedings of the fourth ACM international conference on Web search and data mining
Truthy: mapping the spread of astroturf in microblog streams

Proceedings of the 20th international conference companion on World wide web
Opinion Leadership and Social Contagion in New Product Diffusion

Marketing Science
Towards privacy for social networks: a zero-knowledge based definition of privacy

TCC'11 Proceedings of the 8th conference on Theory of cryptography
Crawling Facebook for social network analysis purposes

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Local graph sparsification for scalable clustering

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Graph cube: on warehousing and OLAP multidimensional networks

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Walking on a graph with a magnifying glass: stratified sampling via weighted random walks

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Albatross sampling: robust and effective hybrid vertex sampling for social graphs

HotPlanet '11 Proceedings of the 3rd ACM international workshop on MobiArch
Walking on a graph with a magnifying glass: stratified sampling via weighted random walks

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Link formation analysis in microblogs

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Benefits of bias: towards better characterization of network sampling

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
On sampling type distribution from heterogeneous social networks

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
SCENT: Scalable compressed monitoring of evolving multirelational social networks

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) - Special section on ACM multimedia 2010 best paper candidates, and issue on social media
Determining the diameter of small world networks

Proceedings of the 20th ACM international conference on Information and knowledge management
Efficient retrieval of 3D building models using embeddings of attributed subgraphs

Proceedings of the 20th ACM international conference on Information and knowledge management
Rumor spreading and vertex expansion

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Die free or live hard? empirical evaluation and new design for fighting evolving twitter spammers

RAID'11 Proceedings of the 14th international conference on Recent Advances in Intrusion Detection
A fast algorithm to find all high degree vertices in power law graphs

Proceedings of the 21st international conference companion on World Wide Web
Community detection in Social Media

Data Mining and Knowledge Discovery
Aiding the detection of fake accounts in large scale social online services

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Multi-agent adaptive boosting on semi-supervised water supply clusters

Advances in Engineering Software
Social sampling

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Cross-domain collaboration recommendation

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Coarse-grained topology estimation via graph sampling

Proceedings of the 2012 ACM workshop on Workshop on online social networks
Space-efficient sampling from social activity streams

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Sampling connected induced subgraphs uniformly at random

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Substructure clustering: a novel mining paradigm for arbitrary data types

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Making recommendations in a microblog to improve the impact of a focal user

Proceedings of the sixth ACM conference on Recommender systems
On computing the diameter of real-world directed (weighted) graphs

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Monte Carlo MCMC: efficient inference by approximate sampling

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Sampling online social networks by random walk

Proceedings of the First ACM International Workshop on Hot Topics on Interdisciplinary Social Networks Research
Enhancing community detection using a network weighting strategy

Information Sciences: an International Journal
Bridge analysis in a Social Internetworking Scenario

Information Sciences: an International Journal
Social network analysis of virtual worlds

AMT'12 Proceedings of the 8th international conference on Active Media Technology
Sparsification and sampling of networks for collective classification

SBP'13 Proceedings of the 6th international conference on Social Computing, Behavioral-Cultural Modeling and Prediction
Crawling Social Internetworking Systems

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
ProFID: Practical frequent items discovery in peer-to-peer networks

Future Generation Computer Systems
Estimating domain-based user influence in social networks

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Composite interests' exploration thanks to on-the-fly linked data spreading activation

Proceedings of the 24th ACM Conference on Hypertext and Social Media
Early experiences in using a domain-specific language for large-scale graph analysis

First International Workshop on Graph Data Management Experiences and Systems
Semantically sampling in heterogeneous social networks

Proceedings of the 22nd international conference on World Wide Web companion
Potential networks, contagious communities, and understanding social network structure

Proceedings of the 22nd international conference on World Wide Web
Solving the missing node problem using structure and attribute information

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Community detection in content-sharing social networks

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Detect inflated follower numbers in OSN using star sampling

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
A model for recursive propagations of reputations in social networks

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
What do large networks look like?

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Discovery hub: on-the-fly linked data exploratory search

Proceedings of the 9th International Conference on Semantic Systems
Random walk-based graphical sampling in unbalanced heterogeneous bipartite social graphs

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Specialization, homophily, and gender in a social curation site: findings from pinterest

Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing
Mixing local and global information for community detection in large networks

Journal of Computer and System Sciences
Piggybacking on social networks

Proceedings of the VLDB Endowment
Personalized emerging topic detection based on a term aging model

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining
Moving from social networks to social internetworking scenarios: The crawling perspective

Information Sciences: an International Journal
Leveraging Social Feedback to Verify Online Identity Claims

ACM Transactions on the Web (TWEB)
Prediction in a microblog hybrid network using bonacich potential

Proceedings of the 7th ACM international conference on Web search and data mining
PREDIcT: towards predicting the runtime of large scale iterative analytics

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a huge real graph, how can we derive a representative sample? There are many known algorithms to compute interesting measures (shortest paths, centrality, betweenness, etc.), but several of them become impractical for large graphs. Thus graph sampling is essential.The natural questions to ask are (a) which sampling method to use, (b) how small can the sample size be, and (c) how to scale up the measurements of the sample (e.g., the diameter), to get estimates for the large graph. The deeper, underlying question is subtle: how do we measure success?.We answer the above questions, and test our answers by thorough experiments on several, diverse datasets, spanning thousands nodes and edges. We consider several sampling methods, propose novel methods to check the goodness of sampling, and develop a set of scaling laws that describe relations between the properties of the original and the sample.In addition to the theoretical contributions, the practical conclusions from our work are: Sampling strategies based on edge selection do not perform well; simple uniform random node selection performs surprisingly well. Overall, best performing methods are the ones based on random-walks and "forest fire"; they match very accurately both static as well as evolutionary graph patterns, with sample sizes down to about 15% of the original graph.