Fast crash recovery in RAMCloud

  • Authors:
  • Diego Ongaro;Stephen M. Rumble;Ryan Stutsman;John Ousterhout;Mendel Rosenblum

  • Affiliations:
  • Stanford University;Stanford University;Stanford University;Stanford University;Stanford University

  • Venue:
  • SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

RAMCloud is a DRAM-based storage system that provides inexpensive durability and availability by recovering quickly after crashes, rather than storing replicas in DRAM. RAMCloud scatters backup data across hundreds or thousands of disks, and it harnesses hundreds of servers in parallel to reconstruct lost data. The system uses a log-structured approach for all its data, in DRAM as well as on disk: this provides high performance both during normal operation and during recovery. RAMCloud employs randomized techniques to manage the system in a scalable and decentralized fashion. In a 60-node cluster, RAMCloud recovers 35 GB of data from a failed server in 1.6 seconds. Our measurements suggest that the approach will scale to recover larger memory sizes (64 GB or more) in less time with larger clusters.