Experiences with MapReduce, an abstraction for large-scale computation

Authors:
Jeffrey Dean
Affiliations:
Google, Inc., Mountain View, CA
Venue:
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Year:
2006

Citing 0
Cited 27

SQL TVF Controlling Forms - Express Structured Parallel Data Intensive Computing

DEXA '08 Proceedings of the 19th international conference on Database and Expert Systems Applications
User Defined Partitioning - Group Data Based on Computation Model

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Using realistic simulation for performance analysis of mapreduce setups

Proceedings of the 1st ACM workshop on Large-Scale system and application performance
Scaling-Up and Speeding-Up Video Analytics Inside Database Engine

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Extend UDF Technology for Integrated Analytics

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Efficiently support MapReduce-like computation models inside parallel DBMS

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Making cloud intermediate data fault-tolerant

Proceedings of the 1st ACM symposium on Cloud computing
Assigning tasks for efficiency in Hadoop: extended abstract

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
MRAP: a novel MapReduce-based framework to support HPC analytics applications with access patterns

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
On availability of intermediate data in cloud computations

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Generalized UDF for analytics inside database engine

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Scale out parallel and distributed CDR stream analytics

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Scalable information extraction for web queries

International Journal of Computational Science and Engineering
Data stream analytics as cloud service for mobile applications

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
Continuous mapreduce for In-DB stream analytics

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Experience in Continuous analytics as a Service (CaaaS)

Proceedings of the 14th International Conference on Extending Database Technology
A latency and fault-tolerance optimizer for online parallel query plans

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
HiTune: dataflow-based performance analysis for big data cloud

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Query engine grid for executing SQL streaming process

Globe'11 Proceedings of the 4th international conference on Data management in grid and peer-to-peer systems
On the duality of data-intensive file system design: reconciling HDFS and PVFS

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
SQL streaming process in query engine net

OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part I
A survey of emerging approaches to spam filtering

ACM Computing Surveys (CSUR)
HiTune: dataflow-based performance analysis for big data cloud

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Halt or continue: estimating progress of queries in the cloud

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
SymGrid: a framework for symbolic computation on the grid

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a Map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a Reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines.The MapReduce run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required intermachine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: thousands of MapReduce programs have been implemented and several thousand thousand MapReduce jobs are executed on Google's clusters every day.In this talk I'll describe the basic programming model, discuss our experience using it in a variety of domains, and talk about the implications of programming models like MapReduce as one paradigm to simplify development of parallel software for multi-core microprocessors.