Only aggressive elephants are fast elephants

Authors:
Jens Dittrich;Jorge-Arnulfo Quiané-Ruiz;Stefan Richter;Stefan Schuh;Alekh Jindal;Jörg Schad
Affiliations:
Information Systems Group, Saarland University;Information Systems Group, Saarland University;Information Systems Group, Saarland University;Information Systems Group, Saarland University;Information Systems Group, Saarland University;Information Systems Group, Saarland University
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 27
Cited 7

Making B+- trees cache conscious in main memory

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Weaving Relations for Cache Performance

Proceedings of the 27th International Conference on Very Large Data Bases
Index Selection for Databases: A Hardness Study and a Principled Heuristic Solution

IEEE Transactions on Knowledge and Data Engineering
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Constrained physical design tuning

The VLDB Journal — The International Journal on Very Large Data Bases
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Data warehousing and analytics infrastructure at facebook

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Manimal: relational optimization for data-intensive programs

Procceedings of the 13th International Workshop on the Web and Databases
Energy management for MapReduce clusters

Proceedings of the VLDB Endowment
Runtime measurements in the cloud: observing, analyzing, and reducing variance

Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Cheetah: a high performance, custom data warehouse on top of MapReduce

Proceedings of the VLDB Endowment
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
CoPhy: a scalable, portable, and interactive index advisor for large workloads

Proceedings of the VLDB Endowment
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
In-situ MapReduce for log processing

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
CoHadoop: flexible data placement and its exploitation in Hadoop

Proceedings of the VLDB Endowment
RAFTing MapReduce: Fast recovery on the RAFT

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment

Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
Eagle-eyed elephant: split-oriented indexing in Hadoop

Proceedings of the 16th International Conference on Extending Database Technology
CARTILAGE: adding flexibility to the Hadoop skeleton

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
HadoopProv: towards provenance as a first class citizen in MapReduce

TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
HadoopProv: towards provenance as a first class citizen in MapReduce

Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
Mosquito: another one bites the data upload stream

Proceedings of the VLDB Endowment
Instant loading for main memory databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay off. We take a different approach. We make elephants aggressive; only this will make them very fast. We propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six clusters including physical and EC2 clusters of up to 100 nodes. A series of scalability experiments also demonstrates the superiority of HAIL.