MapReduce online

Authors:
Tyson Condie;Neil Conway;Peter Alvaro;Joseph M. Hellerstein;Khaled Elmeleegy;Russell Sears
Affiliations:
UC Berkeley;UC Berkeley;UC Berkeley;UC Berkeley;Yahoo! Research;Yahoo! Research
Venue:
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Year:
2010

Citing 24
Cited 104

Encapsulation of parallelism in the Volcano query processing system

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Eddies: continuously adaptive query processing

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
A scalable hash ripple join algorithm

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Interactive Data Analysis: The Control Project

Computer
Highly available, fault-tolerant, parallel dataflows

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Fault-tolerance in the Borealis distributed stream processing system

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Network-Aware Operator Placement for Stream-Processing Systems

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Online Random Shuffling of Large Database Tables

IEEE Transactions on Knowledge and Data Engineering
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Scalable approximate query processing with the DBO engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Ad-hoc data processing in the cloud

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
Distributed online aggregations

Proceedings of the VLDB Endowment

Stateful bulk processing for incremental analytics

Proceedings of the 1st ACM symposium on Cloud computing
Online aggregation and continuous query support in MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Web data processing on the cloud

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Mesos: a platform for fine-grained resource sharing in the data center

Proceedings of the 8th USENIX conference on Networked systems design and implementation
A hadoop-based packet trace processing tool

TMA'11 Proceedings of the Third international conference on Traffic monitoring and analysis
A latency and fault-tolerance optimizer for online parallel query plans

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Nova: continuous Pig/Hadoop workflows

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Exploring MapReduce efficiency with highly-distributed data

Proceedings of the second international workshop on MapReduce and its applications
In-situ MapReduce for log processing

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
G2: a graph processing system for diagnosing distributed systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Fault injection-based assessment of partial fault tolerance in stream processing applications

Proceedings of the 5th ACM international conference on Distributed event-based system
Sloppy Python: using dynamic analysis to automatically add error tolerance to ad-hoc data processing scripts

Proceedings of the Ninth International Workshop on Dynamic Analysis
Energy proportionality and performance in data parallel computing clusters

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Mining large distributed log data in near real time

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Incoop: MapReduce for incremental computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
Elastic phoenix: malleable mapreduce for shared-memory systems

NPC'11 Proceedings of the 8th IFIP international conference on Network and parallel computing
Hadoop acceleration through network levitated merge

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Geostreaming in cloud

Proceedings of the 2nd ACM SIGSPATIAL International Workshop on GeoStreaming
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
DVM: towards a datacenter-scale virtual machine

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
MadLINQ: large-scale distributed matrix computation for the cloud

Proceedings of the 7th ACM european conference on Computer Systems
Cutting MapReduce cost with spot market

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
TransMR: data-centric programming beyond data parallelism

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
In-situ MapReduce for log processing

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
The HaLoop approach to large-scale iterative data analysis

The VLDB Journal — The International Journal on Very Large Data Bases
iMapReduce: A Distributed Computing Framework for Iterative Computation

Journal of Grid Computing
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Camdoop: exploiting in-network aggregation for big data applications

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
P2P-MapReduce: Parallel data processing in dynamic Cloud environments

Journal of Computer and System Sciences
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
C-MR: continuously executing MapReduce workflows on multi-core processors

Proceedings of third international workshop on MapReduce and its Applications Date
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Massively-parallel stream processing under QoS constraints with Nephele

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Maestro: Replica-Aware Map Scheduling for MapReduce

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
MapReduce Workload Modeling with Statistical Approach

Journal of Grid Computing
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Efficient multi-way theta-join processing using MapReduce

Proceedings of the VLDB Endowment
REX: recursive, delta-based data-centric computation

Proceedings of the VLDB Endowment
M3R: increased performance for in-memory Hadoop jobs

Proceedings of the VLDB Endowment
Muppet: MapReduce-style processing of fast data

Proceedings of the VLDB Endowment
SkewTune in action: mitigating skew in MapReduce applications

Proceedings of the VLDB Endowment
Auto-parallelizing stateful distributed streaming applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
AROMA: automated resource allocation and configuration of mapreduce environment in the cloud

Proceedings of the 9th international conference on Autonomic computing
Hierarchical merge for scalable MapReduce

Proceedings of the 2012 workshop on Management of big data systems
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems (TODS)
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience
Coflow: a networking abstraction for cluster applications

Proceedings of the 11th ACM Workshop on Hot Topics in Networks
True elasticity in multi-tenant data-intensive compute clusters

Proceedings of the Third ACM Symposium on Cloud Computing
Designing good algorithms for MapReduce and beyond

Proceedings of the Third ACM Symposium on Cloud Computing
On-the-fly task execution for speeding up pipelined mapreduce

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Monte Carlo simulation on heterogeneous distributed systems: A computing framework with parallel merging and checkpointing strategies

Future Generation Computer Systems
Assessing MapReduce for Internet Computing: A Comparison of Hadoop and BitDew-MapReduce

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
MapReduce-Based data stream processing over large history data

ICSOC'12 Proceedings of the 10th international conference on Service-Oriented Computing
Scalable parallel computing on clouds using Twister4Azure iterative MapReduce

Future Generation Computer Systems
VScope: middleware for troubleshooting time-sensitive data center applications

Proceedings of the 13th International Middleware Conference
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
Incremental stream processing using computational conflict-free replicated data types

Proceedings of the 3rd International Workshop on Cloud Data and Platforms
Stat!: an interactive analytics environment for big data

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Execution and optimization of continuous queries with cyclops

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Data stream warehousing

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
BlinkDB: queries with bounded errors and bounded response times on very large data

Proceedings of the 8th ACM European Conference on Computer Systems
Modeling performance of a parallel streaming engine: bridging theory and costs

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Stream-monitoring with blockmon: convergence of network measurements and data analytics platforms

ACM SIGCOMM Computer Communication Review
MapReduce with communication overlap (MaRCO)

Journal of Parallel and Distributed Computing
Adaptive online scheduling in storm

Proceedings of the 7th ACM international conference on Distributed event-based systems
Grand challenge: MapReduce-style processing of fast sensor data

Proceedings of the 7th ACM international conference on Distributed event-based systems
Demo: elastic mapreduce-style processing of fast data

Proceedings of the 7th ACM international conference on Distributed event-based systems
Large-scale computation not at the cost of expressiveness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
SIDR: structure-aware intelligent data routing in Hadoop

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
CooMR: cross-task coordination for efficient data management in MapReduce programs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable Data Processing for Community Sensing Applications

Mobile Networks and Applications
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Discretized streams: fault-tolerant streaming computation at scale

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters

Proceedings of the 4th annual Symposium on Cloud Computing
Combination of in-memory graph computation with mapreduce: a subgraph-centric method of pagerank

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Does RDMA-based enhanced Hadoop MapReduce need a new performance model?

Proceedings of the 4th annual Symposium on Cloud Computing
Sampling estimators for parallel online aggregation

BNCOD'13 Proceedings of the 29th British National conference on Big Data
A catalog of stream processing optimizations

ACM Computing Surveys (CSUR)
Piranha: optimizing short jobs in Hadoop

Proceedings of the VLDB Endowment
Active data: a data-centric approach to data life-cycle management

PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop
Scalable progressive analytics on big data in the cloud

Proceedings of the VLDB Endowment
MRO-MPI: MapReduce overlapping using MPI and an optimized data exchange policy

Parallel Computing
iPACS: Power-aware covering sets for energy proportionality and performance in data parallel computing clusters

Journal of Parallel and Distributed Computing
Parallel skyline queries over uncertain data streams in cloud computing environments

International Journal of Web and Grid Services
Nephele streaming: stream processing under QoS constraints at scale

Cluster Computing
IBM streams processing language: analyzing big data in motion

IBM Journal of Research and Development
Libra: divide and conquer to verify forwarding tables in huge networks

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
GRASS: trimming stragglers in approximation analytics

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop and can run unmodified user-defined MapReduce programs.