Toward Predictive Failure Management for Distributed Stream Processing Systems

Authors:
Xiaohui Gu;Spiros Papadimitriou;Philip S. Yu;Shu-Ping Chang
Affiliations:
-;-;-;-
Venue:
ICDCS '08 Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems
Year:
2008

Citing 0
Cited 7

Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Self-correlating predictive information tracking for large-scale production systems

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Temporal data mining approaches for sustainable chiller management in data centers

ACM Transactions on Intelligent Systems and Technology (TIST)
Flow: A Stream Processing System Simulator

PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
Towards flexible exascale stream processing system simulation

Simulation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed stream processing systems (DSPSs) have many important applications such as sensor data analysis, network security, and business intelligence. Failure management is essential for DSPSs that often require highly-available system operations. In this paper, we explore a new predictive failure management approach that employs online failure prediction to achieve more efficient failure management than previous reactive or proactive failure management approaches. We employ light-weight stream-based classification methods to perform online failure forecast. Based on the prediction results, the system can take differentiated failure preventions on abnormal components only. Our failure prediction model is tunable, which can achieve a desired tradeoff between failure penalty reduction and prevention cost based on a user-defined reward function. To achieve low-overhead online learning, we propose adaptive data stream sampling schemes to adaptively adjust measurement sampling rates based on the states of monitored components, and maintain a limited size of historical training data using reservoir sampling. We have implemented an initial prototype of the predictive failure management framework within the IBM System S distributed stream processing system. Experiment results show that our system can achieve more efficient failure management than conventional reactive and proactive approaches, while imposing low overhead to the DSPS.