Integrating scale out and fault tolerance in stream processing using operator state management

Authors:
Raul Castro Fernandez;Matteo Migliavacca;Evangelia Kalyvianaki;Peter Pietzuch
Affiliations:
Imperial College London, London, United Kingdom;University of Kent, Canterbury, United Kingdom;Imperial College London, London, United Kingdom;Imperial College London, London, United Kingdom
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 24
Cited 2

Run-time operator state spilling for memory intensive long-running queries

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Design, implementation, and evaluation of the linear road bnchmark on the stream processing core

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
SPC: a distributed, scalable platform for data mining

Proceedings of the 4th international workshop on Data mining standards, services and platforms
Linear road: a stream data management benchmark

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Scalable and near real-time burst detection from eCommerce queries

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
The cost of a cloud: research problems in data center networks

ACM SIGCOMM Computer Communication Review
Elastic scaling of data parallel operators in stream processing

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
An empirical study of high availability in stream processing systems

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A Hybrid Approach to High Availability in Stream Processing Systems

ICDCS '10 Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing Systems
S4: Distributed Stream Computing Platform

ICDMW '10 Proceedings of the 2010 IEEE International Conference on Data Mining Workshops
Big data and cloud computing: current state and future opportunities

Proceedings of the 14th International Conference on Extending Database Technology
Distributed middleware reliability and fault tolerance support in system S

Proceedings of the 5th ACM international conference on Distributed event-based system
Understanding network failures in data centers: measurement, analysis, and implications

Proceedings of the ACM SIGCOMM 2011 conference
Efficient and Adaptive Stateful Replication for Stream Processing Engines in High-Availability Cluster

IEEE Transactions on Parallel and Distributed Systems
Esc: Towards an Elastic Stream Computing Platform for the Cloud

CLOUD '11 Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing
CEC: Continuous eventual checkpointing for data stream processing operators

DSN '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks
Active Replication at (Almost) No Cost

SRDS '11 Proceedings of the 2011 IEEE 30th International Symposium on Reliable Distributed Systems
Managing parallelism for stream processing in the cloud

Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing
Virtualizing stream processing

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Partition and compose: parallel complex event processing

Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
DBToaster: higher-order delta processing for dynamic, frequently fresh views

Proceedings of the VLDB Endowment
Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
StreamCloud: An Elastic and Scalable Data Streaming System

IEEE Transactions on Parallel and Distributed Systems

Report on the first workshop on innovative querying of streams

ACM SIGMOD Record
Streamforce: outsourcing access control enforcement for stream data to the clouds

Proceedings of the 4th ACM conference on Data and application security and privacy

Quantified Score

Hi-index	0.00

Visualization

Abstract

As users of "big data" applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the "pay-as-you-go" model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs-systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the checkpointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.