Parallel data processing with MapReduce: a survey

Authors:
Kyong-Ha Lee;Yoon-Joon Lee;Hyunsik Choi;Yon Dohn Chung;Bongki Moon
Affiliations:
KAIST;KAIST;Korea University;Korea University;University of Arizona
Venue:
ACM SIGMOD Record
Year:
2012

Citing 65
Cited 14

Weaving Relations for Cache Performance

Proceedings of the 27th International Conference on Very Large Data Bases
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
"One Size Fits All": An Idea Whose Time Has Come and Gone

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Technical perspective: the data center is the computer

Communications of the ACM - 50th anniversary issue: 1958 - 2008
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Clustera: an integrated computation and data management system

Proceedings of the VLDB Endowment
Mars: a MapReduce framework on graphics processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Ad-hoc data processing in the cloud

Proceedings of the VLDB Endowment
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
MapReduce for Data Intensive Scientific Analyses

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
CloudBurst

Bioinformatics
Rethinking cost and performance of database systems

ACM SIGMOD Record
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed data-parallel computing using a high-level programming language

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Nephele: efficient parallel data processing in the cloud

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions

Proceedings of the VLDB Endowment
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
Efficiency matters!

ACM SIGOPS Operating Systems Review
On the energy (in)efficiency of Hadoop clusters

ACM SIGOPS Operating Systems Review
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
Towards automatic optimization of MapReduce programs

Proceedings of the 1st ACM symposium on Cloud computing
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ParaTimer: a progress indicator for MapReduce DAGs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Integrating hadoop and parallel DBMs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ricardo: integrating R and Hadoop

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A Map-Reduce System with an Alternate API for Multi-core Environments

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
An Analysis of Traces from a Production MapReduce Cluster

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
MapReduce for the cell broadband engine architecture

IBM Journal of Research and Development
To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency

Proceedings of the first ACM SIGCOMM workshop on Green networking
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Relational versus non-relational database systems for data warehousing

DOLAP '10 Proceedings of the ACM 13th international workshop on Data warehousing and OLAP
Energy management for MapReduce clusters

Proceedings of the VLDB Endowment
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Cheetah: a high performance, custom data warehouse on top of MapReduce

Proceedings of the VLDB Endowment
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

IEEE Transactions on Knowledge and Data Engineering

Compression-aware I/O performance analysis for big data clustering

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
High-performance online spatial and temporal aggregations on multi-core CPUs and many-core GPUs

Proceedings of the fifteenth international workshop on Data warehousing and OLAP
HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries

Proceedings of the 21st ACM international conference on Information and knowledge management
A framework for readapting and running bioinformatics applications in the cloud

Proceedings of the 2012 ACM Research in Applied Computation Symposium
Join processing using Bloom filter in MapReduce

Proceedings of the 2012 ACM Research in Applied Computation Symposium
Making use of the big data: next generation of algorithm trading

AICI'12 Proceedings of the 4th international conference on Artificial Intelligence and Computational Intelligence
Speeding up large-scale point-in-polygon test based spatial join on GPUs

Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data
An efficient quasi-identifier index based approach for privacy preservation over incremental data sets on cloud

Journal of Computer and System Sciences
Evaluating MapReduce for profiling application traffic

Proceedings of the first edition workshop on High performance and programmable networking
High performance risk aggregation: addressing the data processing challenge the hadoop mapreduce way

Proceedings of the 4th ACM workshop on Scientific cloud computing
On distributed computation rate optimization for deploying cloud computing programming frameworks

ACM SIGMETRICS Performance Evaluation Review
Toward intersection filter-based optimization for joins in MapReduce

Proceedings of the 2nd International Workshop on Cloud Intelligence
Speeding-up codon analysis on the cloud with local MapReduce aggregation

Information Sciences: an International Journal
Parallel labeling of massive XML data with MapReduce

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A prominent parallel data processing tool MapReduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. While MapReduce is used in many areas where massive data analysis is required, there are still debates on its performance, efficiency per node, and simple abstraction. This survey intends to assist the database and open source communities in understanding various technical aspects of the MapReduce framework. In this survey, we characterize the MapReduce framework and discuss its inherent pros and cons. We then introduce its optimization strategies reported in the recent literature. We also discuss the open issues and challenges raised on parallel data analysis with MapReduce.