A code generation approach to optimizing high-performance distributed data stream processing

Authors:
Buğra Gedik;Henrique Andrade;Kun-Lung Wu
Affiliations:
IBM Research, Hawthorne, NY, USA;IBM Research, Hawthorne, NY, USA;IBM Research, Hawthorne, NY, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 13
Cited 14

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Principles of database and knowledge-base systems, Vol. I

Principles of database and knowledge-base systems, Vol. I
Querying very large multi-dimensional datasets in ADR

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Retrospective on Aurora

The VLDB Journal — The International Journal on Very Large Data Bases
Design, implementation, and evaluation of the linear road bnchmark on the stream processing core

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Expressing and exploiting concurrency in networked applications with aspen

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Towards Autonomic Fault Recovery in System-S

ICAC '07 Proceedings of the Fourth International Conference on Autonomic Computing
SPC: a distributed, scalable platform for data mining

Proceedings of the 4th international workshop on Data mining standards, services and platforms
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SPADE: the system s declarative stream processing engine

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
XStream: a Signal-Oriented Data Stream Management System

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Scale-Up Strategies for Processing High-Rate Data Streams in System S

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering

Tools and strategies for debugging distributed stream processing applications

Software—Practice & Experience
COLA: optimizing stream processing applications via graph partitioning

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
COLA: optimizing stream processing applications via graph partitioning

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Workload characterization for operator-based distributed stream processing applications

Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems
Design principles for developing stream processing applications

Software—Practice & Experience - Focus on Selected PhD Literature Reviews in the Practical Aspects of Software Technology
Processing high data rate streams in System S

Journal of Parallel and Distributed Computing
From a stream of relational queries to distributed stream processing

Proceedings of the VLDB Endowment
Fault injection-based assessment of partial fault tolerance in stream processing applications

Proceedings of the 5th ACM international conference on Distributed event-based system
SpamWatcher: a streaming social network analytic on the IBM wire-speed processor

Proceedings of the 5th ACM international conference on Distributed event-based system
Hirundo: a mechanism for automated production of optimized data stream graphs

ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Highly scalable speech processing on data stream management system

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Evaluation of a high-volume, low-latency market data processing system implemented with IBM middleware

Software—Practice & Experience
A model-based framework for building extensible, high performance stream processing middleware and programming language for IBM InfoSphere Streams

Software—Practice & Experience
Automatic optimization of stream programs via source program operator graph transformations

Distributed and Parallel Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a code-generation-based optimization approach to bringing performance and scalability to distributed stream processing applications. We express stream processing applications using an operator-based, stream-centric language called SPADE, which supports composing distributed data flow graphs out of toolkits of type-generic operators. A major challenge in building such applications is to find an effective and flexible way of mapping the logical graph of operators into a physical one that can be deployed on a set of distributed nodes. This involves finding how best operators map to processes and how best processes map to computing nodes. In this paper, we take a two-stage optimization approach, where an instrumented version of the application is first generated by the SPADE compiler to profile and collect statistics about the processing and communication characteristics of the operators within the application. In the second stage, the profiling information is fed to an optimizer to come up with a physical data flow graph that is deployable across nodes in a computing cluster. This approach not only creates highly optimized applications that are tailored to the underlying computing and networking infrastructure, but also makes it possible to re-target the application to a different hardware setup by simply repeating the optimization step and re-compiling the application to match the physical flow graph produced by the optimizer. Using real-world applications, from diverse domains such as finance and radio-astronomy, we demonstrate the effectiveness of our approach on System S -- a large-scale, distributed stream processing platform.