Parallel database systems: the future of high performance database systems
Communications of the ACM
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Evaluating MapReduce for Multi-core and Multiprocessor Systems
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
FAWN: a fast array of wimpy nodes
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Web search using mobile cores: quantifying and mitigating the price of efficiency
Proceedings of the 37th annual international symposium on Computer architecture
Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Wimpy node clusters: what about non-wimpy workloads?
Proceedings of the Sixth International Workshop on Data Management on New Hardware
Piccolo: building fast, distributed programs with partitioned tables
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Mesos: a platform for fine-grained resource sharing in the data center
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Phoenix++: modular MapReduce for shared-memory systems
Proceedings of the second international workshop on MapReduce and its applications
Windows Azure Storage: a highly available cloud storage service with strong consistency
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Mahout in Action
Nobody ever got fired for using Hadoop on a cluster
Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
PACMan: coordinated memory caching for parallel jobs
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads
Proceedings of the VLDB Endowment
GraphChi: large-scale graph computation on just a PC
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Omega: flexible, scalable schedulers for large compute clusters
Proceedings of the 8th ACM European Conference on Computer Systems
Piranha: optimizing short jobs in Hadoop
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
In the last decade we have seen a huge deployment of cheap clusters to run data analytics workloads. The conventional wisdom in industry and academia is that scaling out using a cluster of commodity machines is better for these workloads than scaling up by adding more resources to a single server. Popular analytics infrastructures such as Hadoop are aimed at such a cluster scale-out environment. Is this the right approach? Our measurements as well as other recent work shows that the majority of real-world analytic jobs process less than 100 GB of input, but popular infrastructures such as Hadoop/MapReduce were originally designed for petascale processing. We claim that a single "scale-up" server can process each of these jobs and do as well or better than a cluster in terms of performance, cost, power, and server density. We present an evaluation across 11 representative Hadoop jobs that shows scale-up to be competitive in all cases and significantly better in some cases, than scale-out. To achieve that performance, we describe several modifications to the Hadoop runtime that target scale-up configuration. These changes are transparent, do not require any changes to application code, and do not compromise scale-out performance; at the same time our evaluation shows that they do significantly improve Hadoop's scale-up performance.