Serving large-scale batch computed data with project Voldemort

Authors:
Roshan Sumbaly;Jay Kreps;Lei Gao;Alex Feinberg;Chinmay Soman;Sam Shah
Affiliations:
LinkedIn;LinkedIn;LinkedIn;LinkedIn;LinkedIn;LinkedIn
Venue:
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Year:
2012

Citing 17
Cited 8

An adaptation of a rootfinding method to searching ordered disk files revisited

BIT
Interpolation search—a log logN search

Communications of the ACM
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Understanding MySQL Internals

Understanding MySQL Internals
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Berkeley DB

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Efficient bulk insertion into a distributed ordered table

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
PNUTS: Yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment
Distributed indexing of web scale datasets for the cloud

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing
Parallel bulk insertion for large-scale analytics applications

Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Distributed indexing for semantic search

Proceedings of the 3rd International Semantic Search Workshop
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
A batch of PNUTS: experiences connecting cloud batch and serving systems

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

LazyBase: trading freshness for performance in a scalable database

Proceedings of the 7th ACM european conference on Computer Systems
Avatara: OLAP for web-scale analytics products

Proceedings of the VLDB Endowment
Metaphor: a system for related search recommendations

Proceedings of the 21st ACM international conference on Information and knowledge management
ElastMan: autonomic elasticity manager for cloud-based key-value stores

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
The big data ecosystem at LinkedIn

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
DBalancer: distributed load balancing for NoSQL data-stores

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Large-scale social recommender systems: challenges and opportunities

Proceedings of the 22nd international conference on World Wide Web companion
ElastMan: elasticity manager for elastic key-value stores in the cloud

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current serving systems lack the ability to bulk load massive immutable data sets without affecting serving performance. The performance degradation is largely due to index creation and modification as CPU and memory resources are shared with request serving. We have extended Project Voldemort, a general-purpose distributed storage and serving system inspired by Amazon's Dynamo, to support bulk loading terabytes of read-only data. This extension constructs the index offline, by leveraging the fault tolerance and parallelism of Hadoop. Compared to MySQL, our compact storage format and data deployment pipeline scales to twice the request throughput while maintaining sub 5 ms median latency. At LinkedIn, the largest professional social network, this system has been running in production for more than 2 years and serves many of the data-intensive social features on the site.