Online aggregation and continuous query support in MapReduce

Authors:
Tyson Condie;Neil Conway;Peter Alvaro;Joseph M. Hellerstein;John Gerth;Justin Talbot;Khaled Elmeleegy;Russell Sears
Affiliations:
University of California at Berkeley, Berkeley, CA, USA;University of California at Berkeley, Berkeley, CA, USA;University of California at Berkeley, Berkeley, CA, USA;University of California at Berkeley, Berkeley, CA, USA;Stanford University, Stanford, CA, USA;Stanford University, Stanford, CA, USA;Yahoo! Research, Sunnyvale, CA, USA;Yahoo! Research, Sunnyvale, CA, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 10
Cited 11

Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Eddies: continuously adaptive query processing

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Interactive Data Analysis: The Control Project

Computer
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Protovis: A Graphical Toolkit for Visualization

IEEE Transactions on Visualization and Computer Graphics
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation

Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Improving online aggregation performance for skewed data distribution

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
Cost models for view materialization in the cloud

Proceedings of the 2012 Joint EDBT/ICDT Workshops
Early accurate results for advanced analytics on MapReduce

Proceedings of the VLDB Endowment
The only constant is change: incorporating time-varying network reservations in data centers

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
The only constant is change: incorporating time-varying network reservations in data centers

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
You can stop early with COLA: online processing of aggregate queries in the cloud

Proceedings of the 21st ACM international conference on Information and knowledge management
Processing online aggregation on skewed data in mapreduce

Proceedings of the fifth international workshop on Cloud data management
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Parallel computation of skyline and reverse skyline queries using mapreduce

Proceedings of the VLDB Endowment
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, the output of each MapReduce task and job is materialized to disk before it is consumed. In this demonstration, we describe a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We demonstrate a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop, and can run unmodified user-defined MapReduce programs.