Multilevel algorithms for multi-constraint graph partitioning
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A fast kernel-based multilevel algorithm for graph clustering
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Pattern Recognition and Machine Learning (Information Science and Statistics)
Pattern Recognition and Machine Learning (Information Science and Statistics)
Statistical properties of community structure in large social and information networks
Proceedings of the 17th international conference on World Wide Web
What is Twitter, a social network or a news media?
Proceedings of the 19th international conference on World wide web
Hermes: clustering users in large-scale e-mail services
Proceedings of the 1st ACM symposium on Cloud computing
The little engine(s) that could: scaling online social networks
Proceedings of the ACM SIGCOMM 2010 conference
Volley: automated data placement for geo-distributed cloud services
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Local graph sparsification for scalable clustering
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Balanced label propagation for partitioning massive graphs
Proceedings of the sixth ACM international conference on Web search and data mining
Hi-index | 0.00 |
Online social networking platforms regularly support hundreds of millions of users, who in aggregate generate substantially more data than can be stored on any single physical server. As such, user data are distributed, or sharded, across many machines. A key requirement in this setting is rapid retrieval not only of a given user's information, but also of all data associated with his or her social contacts, suggesting that one should consider the topology of the social network in selecting a sharding policy. In this paper we formalize the problem of efficiently sharding large social network databases, and evaluate several sharding strategies, both analytically and empirically. We find that random sharding---the de facto standard---results in provably poor performance even when frequently accessed nodes are replicated to many shards. By contrast, we demonstrate that one can substantially reduce querying costs by identifying and assigning tightly knit communities to shards. In particular, our theoretical analysis motivates a novel, scalable sharding algorithm that outperforms both random and location-based sharding schemes.