Making B+- trees cache conscious in main memory
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Chord: A scalable peer-to-peer lookup service for internet applications
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Weaving Relations for Cache Performance
Proceedings of the 27th International Conference on Very Large Data Bases
Index Selection for Databases: A Hardness Study and a Principled Heuristic Solution
IEEE Transactions on Knowledge and Data Engineering
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce: a flexible data processing tool
Communications of the ACM - Amir Pnueli: Ahead of His Time
Constrained physical design tuning
The VLDB Journal — The International Journal on Very Large Data Bases
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling
Proceedings of the 5th European conference on Computer systems
A comparison of join algorithms for log processing in MaPreduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Data warehousing and analytics infrastructure at facebook
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Manimal: relational optimization for data-intensive programs
Procceedings of the 13th International Workshop on the Web and Databases
Energy management for MapReduce clusters
Proceedings of the VLDB Endowment
Runtime measurements in the cloud: observing, analyzing, and reducing variance
Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study
Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce
Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
Cheetah: a high performance, custom data warehouse on top of MapReduce
Proceedings of the VLDB Endowment
Hadoop: The Definitive Guide
CoPhy: a scalable, portable, and interactive index advisor for large workloads
Proceedings of the VLDB Endowment
Automatic optimization for MapReduce programs
Proceedings of the VLDB Endowment
Column-oriented storage techniques for MapReduce
Proceedings of the VLDB Endowment
Full-text indexing for optimizing selection operations in large-scale data analytics
Proceedings of the second international workshop on MapReduce and its applications
In-situ MapReduce for log processing
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
CoHadoop: flexible data placement and its exploitation in Hadoop
Proceedings of the VLDB Endowment
RAFTing MapReduce: Fast recovery on the RAFT
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Trojan data layouts: right shoes for a running elephant
Proceedings of the 2nd ACM Symposium on Cloud Computing
Efficient big data processing in Hadoop MapReduce
Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce
Proceedings of the VLDB Endowment
Eagle-eyed elephant: split-oriented indexing in Hadoop
Proceedings of the 16th International Conference on Extending Database Technology
CARTILAGE: adding flexibility to the Hadoop skeleton
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
HadoopProv: towards provenance as a first class citizen in MapReduce
TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
HadoopProv: towards provenance as a first class citizen in MapReduce
Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
Mosquito: another one bites the data upload stream
Proceedings of the VLDB Endowment
Instant loading for main memory databases
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay off. We take a different approach. We make elephants aggressive; only this will make them very fast. We propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six clusters including physical and EC2 clusters of up to 100 nodes. A series of scalability experiments also demonstrates the superiority of HAIL.