Scale-up vs scale-out for Hadoop: time to rethink?

Authors:
Raja Appuswamy;Christos Gkantsidis;Dushyanth Narayanan;Orion Hodson;Antony Rowstron
Affiliations:
Microsoft Research, Cambridge, UK;Microsoft Research, Cambridge, UK;Microsoft Research, Cambridge, UK;Microsoft Research, Cambridge, UK;Microsoft Research, Cambridge, UK
Venue:
Proceedings of the 4th annual Symposium on Cloud Computing
Year:
2013

Citing 22
Cited 0

Parallel database systems: the future of high performance database systems

Communications of the ACM
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
FAWN: a fast array of wimpy nodes

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Web search using mobile cores: quantifying and mitigating the price of efficiency

Proceedings of the 37th annual international symposium on Computer architecture
Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Wimpy node clusters: what about non-wimpy workloads?

Proceedings of the Sixth International Workshop on Data Management on New Hardware
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Mesos: a platform for fine-grained resource sharing in the data center

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Phoenix++: modular MapReduce for shared-memory systems

Proceedings of the second international workshop on MapReduce and its applications
Windows Azure Storage: a highly available cloud storage service with strong consistency

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Mahout in Action

Mahout in Action
Nobody ever got fired for using Hadoop on a cluster

Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
PACMan: coordinated memory caching for parallel jobs

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
GraphChi: large-scale graph computation on just a PC

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Omega: flexible, scalable schedulers for large compute clusters

Proceedings of the 8th ACM European Conference on Computer Systems
Piranha: optimizing short jobs in Hadoop

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the last decade we have seen a huge deployment of cheap clusters to run data analytics workloads. The conventional wisdom in industry and academia is that scaling out using a cluster of commodity machines is better for these workloads than scaling up by adding more resources to a single server. Popular analytics infrastructures such as Hadoop are aimed at such a cluster scale-out environment. Is this the right approach? Our measurements as well as other recent work shows that the majority of real-world analytic jobs process less than 100 GB of input, but popular infrastructures such as Hadoop/MapReduce were originally designed for petascale processing. We claim that a single "scale-up" server can process each of these jobs and do as well or better than a cluster in terms of performance, cost, power, and server density. We present an evaluation across 11 representative Hadoop jobs that shows scale-up to be competitive in all cases and significantly better in some cases, than scale-out. To achieve that performance, we describe several modifications to the Hadoop runtime that target scale-up configuration. These changes are transparent, do not require any changes to application code, and do not compromise scale-out performance; at the same time our evaluation shows that they do significantly improve Hadoop's scale-up performance.