Decoupling storage and computation in Hadoop with SuperDataNodes

Authors:
George Porter
Affiliations:
UC San Diego, La Jolla, CA
Venue:
ACM SIGOPS Operating Systems Review
Year:
2010

Citing 5
Cited 1

Power provisioning for a warehouse-sized computer

Proceedings of the 34th annual international symposium on Computer architecture
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Distributed Computing Economics

Queue - Object-Relational Mapping
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

Zone-based data striping for cloud storage

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

The rise of ad-hoc data-intensive computing has led to the development of data-parallel programming systems such as Map/Reduce and Hadoop, which achieve scalability by tightly coupling storage and computation. This can be limiting when the ratio of computation to storage is not known in advance, or changes over time. In this work, we examine decoupling storage and computation in Hadoop through SuperDataNodes, which are servers that contain an order of magnitude more disks than traditional Hadoop nodes. We found that SuperDataNodes are not only capable of supporting workloads with high storage-to-processing workloads, but in some cases can outperform traditional Hadoop deployments through better management of a large centralized pool of disks.