Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Authors:
Jens Dittrich;Jorge-Arnulfo Quiané-Ruiz;Alekh Jindal;Yagiz Kargin;Vinay Setty;Jörg Schad
Affiliations:
Saarland University;Saarland University;Saarland University and International Max Planck Research School for Computer Science;International Max Planck Research School for Computer Science;International Max Planck Research School for Computer Science;Saarland University
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 19
Cited 58

Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
Cache Conscious Indexing for Decision-Support in Main Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Data reduction through early grouping

CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Manimal: relational optimization for data-intensive programs

Procceedings of the 13th International Workshop on the Web and Databases
Runtime measurements in the cloud: observing, analyzing, and reducing variance

Proceedings of the VLDB Endowment

Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A Hadoop based distributed loading approach to parallel data warehouses

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
RAFT at work: speeding-up mapreduce applications under task and node failures

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Adapting skyline computation to the MapReduce framework: algorithms and experiments

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
CoHadoop: flexible data placement and its exploitation in Hadoop

Proceedings of the VLDB Endowment
DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems

Proceedings of the 2nd ACM Symposium on Cloud Computing
Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
Building cubes with MapReduce

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Analytics over large-scale multidimensional data: the big data revolution!

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Efficient processing of RDF graph pattern matching on MapReduce platforms

Proceedings of the second international workshop on Data intensive computing in the clouds
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
PerfXplain: debugging MapReduce job performance

Proceedings of the VLDB Endowment
The HaLoop approach to large-scale iterative data analysis

The VLDB Journal — The International Journal on Very Large Data Bases
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Clydesdale: structured data processing on hadoop

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimizing analytic data flows for multiple execution engines

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
An optimization framework for map-reduce queries

Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
ComMapReduce: an improvement of mapreduce with lightweight communication mechanisms

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Integrating open government data with stratosphere for more transparency

Web Semantics: Science, Services and Agents on the World Wide Web
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
On saying "enough already!" in MapReduce

Proceedings of the 1st International Workshop on Cloud Intelligence
Efficient multi-way theta-join processing using MapReduce

Proceedings of the VLDB Endowment
Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment
TEEPA: a timely-aware elastic parallel architecture

Proceedings of the 16th International Database Engineering & Applications Sysmposium
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
AROMA: automated resource allocation and configuration of mapreduce environment in the cloud

Proceedings of the 9th international conference on Autonomic computing
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience
Sailfish: a framework for large scale data processing

Proceedings of the Third ACM Symposium on Cloud Computing
SemanMR: big data processing framework based on semantics

Proceedings of the Fourth Asia-Pacific Symposium on Internetware
A New Electronic Commerce Architecture in the Cloud

Journal of Electronic Commerce in Organizations
Eagle-eyed elephant: split-oriented indexing in Hadoop

Proceedings of the 16th International Conference on Extending Database Technology
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
CARTILAGE: adding flexibility to the Hadoop skeleton

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
HadoopProv: towards provenance as a first class citizen in MapReduce

TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
Issues in big data testing and benchmarking

Proceedings of the Sixth International Workshop on Testing Database Systems
HadoopProv: towards provenance as a first class citizen in MapReduce

Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
EMF modeling in traffic surveillance experiments

Proceedings of the Modelling of the Physical World Workshop
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Cloudy: heterogeneous middleware for in time queries processing

Proceedings of the 17th International Database Engineering & Applications Symposium
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
A data-centric heuristic for Hadoop provisioning in the cloud

Proceedings of the 6th ACM India Computing Convention
Piranha: optimizing short jobs in Hadoop

Proceedings of the VLDB Endowment
Mosquito: another one bites the data upload stream

Proceedings of the VLDB Endowment
DGFIndex: a hive multidimensional range index for smart meter big data

Proceedings Demo & Poster Track of ACM/IFIP/USENIX International Middleware Conference
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering
Parallel skyline queries over uncertain data streams in cloud computing environments

International Journal of Web and Grid Services
Hybrid Analytic Flows-the Case for Optimization

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology
A platform for eXtreme analytics

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop --- an open-source implementation of MapReduce --- often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even 'notice it'). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing.