Parallelizing stateful operators in a distributed stream processing system: how, should you and how much?

Authors:
Sai Wu;Vibhore Kumar;Kun-Lung Wu;Beng Chin Ooi
Affiliations:
Zhejiang University, Hangzhou, P. R. China;Thomas J. Watson Research Center, IBM Research, Hawthorne, NY;Thomas J. Watson Research Center, IBM Research, Hawthorne, NY;National University of Singapore, Singapore
Venue:
Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
Year:
2012

Citing 9
Cited 1

Chord: a scalable peer-to-peer lookup protocol for internet applications

IEEE/ACM Transactions on Networking (TON)
TelegraphCQ: continuous dataflow processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Load management and high availability in the Medusa distributed stream processing system

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Design, implementation, and evaluation of the linear road bnchmark on the stream processing core

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Monitoring streams: a new class of data management applications

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Fault-tolerance in the borealis distributed stream processing system

ACM Transactions on Database Systems (TODS)
SPADE: the system s declarative stream processing engine

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Elastic scaling of data parallel operators in stream processing

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing

RIP: run-based intra-query parallelism for scalable complex event processing

Proceedings of the 7th ACM international conference on Distributed event-based systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider a distributed stream processing application, expressed as a data-flow graph with operators as vertices connected by streams and deployed over a cluster of compute nodes, where a small subset of the operators are often the performance bottlenecks for the entire application. In cases where a bottleneck operator is stateless, it is obvious that parallelization by splitting the incoming stream among multiple parallel operators deployed on different nodes can help improve performance. However, it is not so obvious when the bottleneck operator is stateful. In such a case, parallelization is much more challenging as it often requires a state sharing mechanism for the parallel operators. Moreover, it incurs additional overheads of required accesses by the parallel operators to shared state and synchronization constructs. In this paper, we propose a parallelization framework for stateful stream processing operators. The framework not only addresses issues related to the system model and support for operator parallelization, but also delves into the theoretical details that model the suitability of parallelization and the optimal degree of parallelism. We have implemented and evaluated our framework in the context of IBM's System S distributed stream processing middleware. While microbenchmarks are used to validate the proposed theoretical model, a parallelized implementation of a moving KNN application is used for the purpose of evaluation.