A survey on reliability in distributed systems

Authors:
Waseem Ahmed;Yong Wei Wu
Affiliations:
-;-
Venue:
Journal of Computer and System Sciences
Year:
2013

Citing 27
Cited 0

Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
An analysis of factors affecting software reliability

Journal of Systems and Software
Implementing E-Transactions with Asynchronous Replication

IEEE Transactions on Parallel and Distributed Systems
Grid Services for Distributed System Integration

Computer
On Distributed Computing Systems Reliability Analysis Under Program Execution Constraints

IEEE Transactions on Computers
Transparent Fault Tolerance for Web Services Based Architectures

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
GridWorkflow: A Flexible Failure Handling Framework for the Grid

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Reliability Analysis of Grid Computing Systems

PRDC '02 Proceedings of the 2002 Pacific Rim International Symposium on Dependable Computing
A quantitative and qualitative analysis of factors affecting software processes

Journal of Systems and Software
FTWeb: A Fault Tolerant Infrastructure for Web Services

EDOC '05 Proceedings of the Ninth IEEE International EDOC Enterprise Computing Conference
Fault-tolerant grid services using primary-backup: feasibility and performance

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
An effective cache replacement algorithm in transcoding-enabled proxies

The Journal of Supercomputing
Evaluating the reliability of computational grids from the end user's point of view

Journal of Systems Architecture: the EUROMICRO Journal
Software Reliability Engineering: A Roadmap

FOSE '07 2007 Future of Software Engineering
A Hierarchical Modeling and Analysis for Grid Service Reliability

IEEE Transactions on Computers
Multimedia Object Placement for Transparent Data Replication

IEEE Transactions on Parallel and Distributed Systems
Early prediction of software component reliability

Proceedings of the 30th international conference on Software engineering
Designing Fault Tolerant Web Services Using BPEL

ICIS '08 Proceedings of the Seventh IEEE/ACIS International Conference on Computer and Information Science (icis 2008)
Quality Prediction of Service Compositions through Probabilistic Model Checking

QoSA '08 Proceedings of the 4th International Conference on Quality of Software-Architectures: Models and Architectures
Reliability in grid computing systems

Concurrency and Computation: Practice & Experience - A Special Issue from the Open Grid Forum
Tolerating hardware device failures in software

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
A Taxonomy and Survey of Cloud Computing Systems

NCM '09 Proceedings of the 2009 Fifth International Joint Conference on INC, IMS and IDC
Collaborative reliability prediction of service-oriented systems

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
Real-time distributed program reliability analysis

SPDP '93 Proceedings of the 1993 5th IEEE Symposium on Parallel and Distributed Processing
Architecture-based reliability prediction for service-oriented computing

Architecting Dependable Systems III
Making services fault tolerant

ISAS'06 Proceedings of the Third international conference on Service Availability

Quantified Score

Hi-index	0.00

Visualization

Abstract

Software@?s reliability in distributed systems has always been a major concern for all stake holders especially for application@?s vendors and its users. Various models have been produced to assess or predict reliability of large scale distributed applications including e-government, e-commerce, multimedia services, and end-to-end automotive solutions, but reliability issues with these systems still exists. Ensuring distributed system@?s reliability in turns requires examining reliability of each individual component or factors involved in enterprise distributed applications before predicting or assessing reliability of whole system, and Implementing transparent fault detection and fault recovery scheme to provide seamless interaction to end users. For this reason we have analyzed in detail existing reliability methodologies from viewpoint of examining reliability of individual component and explained why we still need a comprehensive reliability model for applications running in distributed system. In this paper we have described detailed technical overview of research done in recent years in analyzing and predicting reliability of large scale distributed applications in four parts. We first described some pragmatic requirements for highly reliable systems and highlighted significance and various issues of reliability in different computing environment such as Cloud Computing, Grid Computing, and Service Oriented Architecture. Then we elucidated certain possible factors and various challenges that are nontrivial for highly reliable distributed systems, including fault detection, recovery and removal through testing or various replication techniques. Later we scrutinize various research models which synthesize significant solutions to tackle possible factors and various challenges in predicting as well as measuring reliability of software applications in distributed systems. At the end of this paper we have discussed limitations of existing models and proposed future work for predicting and analyzing reliability of distributed applications in real environment in the light of our analysis.