Nobody ever got fired for using Hadoop on a cluster

Authors:
Antony Rowstron;Dushyanth Narayanan;Austin Donnelly;Greg O'Shea;Andrew Douglas
Affiliations:
Microsoft Research, Cambridge;Microsoft Research, Cambridge;Microsoft Research, Cambridge;Microsoft Research, Cambridge;Microsoft Research, Cambridge
Venue:
Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing
Year:
2012

Citing 5
Cited 7

Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
An Efficient Algorithm for Mining Association Rules in Large Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Camdoop: exploiting in-network aggregation for big data applications

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
PACMan: coordinated memory caching for parallel jobs

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation

The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Computing while charging: building a distributed computing infrastructure using smartphones

Proceedings of the 8th international conference on Emerging networking experiments and technologies
Scale-up vs scale-out for Hadoop: time to rethink?

Proceedings of the 4th annual Symposium on Cloud Computing
A framework for an in-depth comparison of scale-up and scale-out

DISCS-2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
The energy case for graph processing on hybrid CPU and GPU systems

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Hone: "Scaling down" Hadoop on shared-memory systems

Proceedings of the VLDB Endowment
Scalable progressive analytics on big data in the cloud

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The norm for data analytics is now to run them on commodity clusters with MapReduce-like abstractions. One only needs to read the popular blogs to see the evidence of this. We believe that we could now say that "nobody ever got fired for using Hadoop on a cluster"! We completely agree that Hadoop on a cluster is the right solution for jobs where the input data is multi-terabyte or larger. However, in this position paper we ask if this is the right path for general purpose data analytics? Evidence suggests that many MapReduce-like jobs process relatively small input data sets (less than 14 GB). Memory has reached a GB/$ ratio such that it is now technically and financially feasible to have servers with 100s GB of DRAM. We therefore ask, should we be scaling by using single machines with very large memories rather than clusters? We conjecture that, in terms of hardware and programmer time, this may be a better option for the majority of data processing jobs.