Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

Authors:
Dawei Sun;Guiran Chang;Changsheng Miao;Xingwei Wang
Affiliations:
School of Information Science and Engineering, Northeastern University, Shenyang, P.R. China 110819 and Department of Computer Science and Technology, Tsinghua University, Beijing, P.R. China 1000 ...;Computing Center, Northeastern University, Shenyang, P.R. China 110819;School of Information Science and Engineering, Northeastern University, Shenyang, P.R. China 110819;School of Information Science and Engineering, Northeastern University, Shenyang, P.R. China 110819
Venue:
The Journal of Supercomputing
Year:
2013

Citing 36
Cited 0

A Variational Calculus Approach to Optimal Checkpoint Placement

IEEE Transactions on Computers
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
An on-line replication strategy to increase availability in Data Grids

Future Generation Computer Systems
A dynamic data replication strategy using access-weights in data grids

The Journal of Supercomputing
Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

IEEE Transactions on Parallel and Distributed Systems
An interoperable context sensitive model of trust

Journal of Intelligent Information Systems
Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility

Future Generation Computer Systems
A near-optimal database allocation for reducing the average waiting time in the grid computing environment

Information Sciences: an International Journal
Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations

Future Generation Computer Systems
Improving reliability of a heterogeneous grid-based intrusion detection platform using levels of redundancies

Future Generation Computer Systems
A view of cloud computing

Communications of the ACM
Secure Data Objects Replication in Data Grid

IEEE Transactions on Dependable and Secure Computing
Low Overhead Incremental Checkpointing and Rollback Recovery Scheme on Windows Operating System

WKDD '10 Proceedings of the 2010 Third International Conference on Knowledge Discovery and Data Mining
Efficient Algorithms for Global Snapshots in Large Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Supporting fault-tolerance for time-critical events in distributed environments

Scientific Programming
A data placement strategy in scientific cloud workflows

Future Generation Computer Systems
Achieving efficient agreement within a dual-failure cloud-computing environment

Expert Systems with Applications: An International Journal
Performance evaluation of fault tolerance techniques in grid computing system

Computers and Electrical Engineering
FTCloud: A Component Ranking Framework for Fault-Tolerant Cloud Applications

ISSRE '10 Proceedings of the 2010 IEEE 21st International Symposium on Software Reliability Engineering
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Diskless Checkpointing with Rollback-Dependency Trackability

SRDS '10 Proceedings of the 2010 29th IEEE Symposium on Reliable Distributed Systems
Approaches to improve the resources management in the simulator cloudsim

ICICA'10 Proceedings of the First international conference on Information computing and applications
Hybrid Checkpointing for MPI Jobs in HPC Environments

ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
Cloud computing - The business perspective

Decision Support Systems
CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms

Software—Practice & Experience
Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families

Performance Evaluation
A hybrid fault tolerance technique in grid computing system

The Journal of Supercomputing
FREM: A Fast Restart Mechanism for General Checkpoint/Restart

IEEE Transactions on Computers
Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing

IEEE Transactions on Parallel and Distributed Systems
A New Diskless Checkpointing Approach for Multiple Processor Failures

IEEE Transactions on Dependable and Secure Computing
Job scheduling algorithm based on Berger model in cloud environment

Advances in Engineering Software
A survey on software checkpointing and mobility techniques in distributed systems

Concurrency and Computation: Practice & Experience
BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing

CLOUD '11 Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing
A Fault-Tolerant High Performance Cloud Strategy for Scientific Computing

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Live Virtual Machine Migration via Asynchronous Replication and State Synchronization

IEEE Transactions on Parallel and Distributed Systems
An effective job replication technique based on reliability and performance in mobile grids

GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Failures are normal rather than exceptional in cloud computing environments, high fault tolerance issue is one of the major obstacles for opening up a new era of high serviceability cloud computing as fault tolerance plays a key role in ensuring cloud serviceability. Fault tolerant service is an essential part of Service Level Objectives (SLOs) in clouds. To achieve high level of cloud serviceability and to meet high level of cloud SLOs, a foolproof fault tolerance strategy is needed. In this paper, the definitions of fault, error, and failure in a cloud are given, and the principles for high fault tolerance objectives are systematically analyzed by referring to the fault tolerance theories suitable for large-scale distributed computing environments. Based on the principles and semantics of cloud fault tolerance, a dynamic adaptive fault tolerance strategy DAFT is put forward. It includes: (i) analyzing the mathematical relationship between different failure rates and two different fault tolerance strategies, which are checkpointing fault tolerance strategy and data replication fault tolerance strategy; (ii) building a dynamic adaptive checkpointing fault tolerance model and a dynamic adaptive replication fault tolerance model by combining the two fault tolerance models together to maximize the serviceability and meet the SLOs; and (iii) evaluating the dynamic adaptive fault tolerance strategy under various conditions in large-scale cloud data centers and consider different system centric parameters, such as fault tolerance degree, fault tolerance overhead, response time, etc. Theoretical as well as experimental results conclusively demonstrate that the dynamic adaptive fault tolerance strategy DAFT has high potential as it provides efficient fault tolerance enhancements, significant cloud serviceability improvement, and great SLOs satisfaction. It efficiently and effectively achieves a trade-off for fault tolerance objectives in cloud computing environments.