Eagle-eyed elephant: split-oriented indexing in Hadoop

Authors:
Mohamed Y. Eltabakh;Fatma Özcan;Yannis Sismanis;Peter J. Haas;Hamid Pirahesh;Jan Vondrak
Affiliations:
Worcester Polytechnic Institute, Worcester, MA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA
Venue:
Proceedings of the 16th International Conference on Extending Database Technology
Year:
2013

Citing 22
Cited 2

Expl: a comparison between a simple adaptive caching agent using document life histories and existing cache techniques

Computer Networks and ISDN Systems - Selected papers of the 3rd international caching workshop
Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
A survey of Web cache replacement strategies

ACM Computing Surveys (CSUR)
An XML transaction processing benchmark

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Principles of Distributed Database Systems

Principles of Distributed Database Systems
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
CoHadoop: flexible data placement and its exploitation in Hadoop

Proceedings of the VLDB Endowment
Submodular Approximation: Sampling-based Algorithms and Lower Bounds

SIAM Journal on Computing
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment

Next generation data analytics at IBM research

Proceedings of the VLDB Endowment
DGFIndex: a hive multidimensional range index for smart meter big data

Proceedings Demo & Poster Track of ACM/IFIP/USENIX International Middleware Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

An increasingly important analytics scenario for Hadoop involves multiple (often ad hoc) grouping and aggregation queries with selection predicates over a slowly changing dataset. These queries are typically expressed via high-level query languages such as Jaql, Pig, and Hive, and are used either directly for business-intelligence applications or to prepare the data for statistical model building and machine learning. In such scenarios it has been increasingly recognized that, as in classical databases, techniques for avoiding access to irrelevant data can dramatically improve query performance. Prior work on Hadoop, however, has simply ported classical techniques to the MapReduce setting, focusing on record-level indexing and key-based partition elimination. Unfortunately, record-level indexing only slightly improves overall query performance, because it does not minimize the number of mapper "waves", which is determined by the number of processed splits. Moreover, key-based partitioning requires data reorganization, which is usually impractical in Hadoop settings. We therefore need to re-envision how data access mechanisms are defined and implemented. To this end, we introduce the Eagle-Eyed Elephant (E3) framework for boosting the efficiency of query processing in Hadoop by avoiding accesses of data splits that are irrelevant to the query at hand. Using novel techniques involving inverted indexes over splits, domain segmentation, materialized views, and adaptive caching, E3 avoids accessing irrelevant splits even in the face of evolving workloads and data. Our experiments show that E3 can achieve up to 20x cost savings with small to moderate storage overheads.