Comet: batched stream processing for data intensive distributed computing

Authors:
Bingsheng He;Mao Yang;Zhenyu Guo;Rishan Chen;Bing Su;Wei Lin;Lidong Zhou
Affiliations:
Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Beijing University, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the 1st ACM symposium on Cloud computing
Year:
2010

Citing 28
Cited 22

Efficiently updating materialized views

SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
Multiple-query optimization

ACM Transactions on Database Systems (TODS)
Parallel database systems: the future of high performance database systems

Communications of the ACM
Continuous queries over append-only databases

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
A categorized bibliography on incremental computation

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Efficient and extensible algorithms for multi query optimization

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
NiagaraCQ: a scalable continuous query system for Internet databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Prototyping Bubba, A Highly Parallel Database System

IEEE Transactions on Knowledge and Data Engineering
The Gamma Database Machine Project

IEEE Transactions on Knowledge and Data Engineering
Automated Selection of Materialized Views and Indexes in SQL Databases

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Redbrick Vista: Aggregate Computation and Management

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Efficient exploitation of similar subexpressions for query processing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Cooperative scans: dynamic bandwidth sharing in a DBMS

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Scheduling shared scans of large data files

Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Parallel Evaluation of Composite Aggregate Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
DryadInc: reusing work in large-scale computations

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Wave computing in the cloud

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Nectar: automatic management of data and computation in datacenters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Nova: continuous Pig/Hadoop workflows

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
The case for being lazy: how to leverage lazy evaluation in MapReduce

Proceedings of the 2nd international workshop on Scientific cloud computing
In-situ MapReduce for log processing

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Incoop: MapReduce for incremental computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
PrIter: a distributed framework for prioritized iterative computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
Geostreaming in cloud

Proceedings of the 2nd ACM SIGSPATIAL International Workshop on GeoStreaming
An overview of CMPI: network performance aware MPI in the cloud

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Kineograph: taking the pulse of a fast-changing and connected world

Proceedings of the 7th ACM european conference on Computer Systems
In-situ MapReduce for log processing

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Maestro: Replica-Aware Map Scheduling for MapReduce

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Facilitating real-time graph mining

Proceedings of the fourth international workshop on Cloud data management
Improving large graph processing on partitioned graphs in the cloud

Proceedings of the Third ACM Symposium on Cloud Computing
Streaming big data with self-adjusting computation

DDFP '13 Proceedings of the 2013 workshop on Data driven functional programming
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Discretized streams: fault-tolerant streaming computation at scale

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
MROrder: flexible job ordering optimization for online mapreduce workloads

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Batched stream processing is a new distributed data processing paradigm that models recurring batch computations on incrementally bulk-appended data streams. The model is inspired by our empirical study on a trace from a large-scale production data-processing cluster; it allows a set of effective query optimizations that are not possible in a traditional batch processing model. We have developed a query processing system called Comet that embraces batched stream processing and integrates with DryadLINQ. We used two complementary methods to evaluate the effectiveness of optimizations that Comet enables. First, a prototype system deployed on a 40-node cluster shows an I/O reduction of over 40% using our benchmark. Second, when applied to a real production trace covering over 19 million machine-hours, our simulator shows an estimated I/O saving of over 50%.