Error Scope on a Computational Grid: Theory and Practice

Authors:
Douglas Thain;Miron Livny
Affiliations:
-;-
Venue:
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Year:
2002

Citing 31
Cited 9

Structured computer organization; (2nd ed.)

Structured computer organization; (2nd ed.)
A proposed solution to the problem of levels in error-message generation

Communications of the ACM
Making APL error messages kinder and gentler

APL '89 Conference proceedings on APL as a tool of thought
Resourceful systems for fault tolerance, reliability, and safety

ACM Computing Surveys (CSUR)
The annotated C++ reference manual

The annotated C++ reference manual
Exception handling in large Ada systems

WADAS '91 Proceedings of the eighth annual Washington Ada symposium & summer SIGAda meeting on Ada: software: foundation for competitveness
The Legion vision of a worldwide virtual computer

Communications of the ACM
Object-oriented software construction (2nd ed.)

Object-oriented software construction (2nd ed.)
A security architecture for computational grids

CCS '98 Proceedings of the 5th ACM conference on Computer and communications security
The Java programming language (2nd ed.)

The Java programming language (2nd ed.)
High-throughput resource management

The grid
Fundamentals of fault-tolerant distributed computing in asynchronous environments

ACM Computing Surveys (CSUR)
JavaGenes and Condor: cycle-scavenging genetic algorithms

Proceedings of the ACM 2000 conference on Java Grande
Abstract machines for programming language implementation

Future Generation Computer Systems
When the CRC and TCP checksum disagree

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
Reliability Issues in Computing System Design

ACM Computing Surveys (CSUR)
Implementing remote procedure calls

ACM Transactions on Computer Systems (TOCS)
End-to-end arguments in system design

ACM Transactions on Computer Systems (TOCS)
Exception handling: issues and a proposed notation

Communications of the ACM
An axiomatic basis for computer programming

Communications of the ACM
Asynchronous exceptions in Haskell

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Increasing relevance of memory hardware errors: a case for recoverable programming models

EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Some new transitions in hierarchical level structures

ACM SIGOPS Operating Systems Review
Replica Selection in the Globus Data Grid

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
The structure of the “the”-multiprogramming system

SOSP '67 Proceedings of the first ACM symposium on Operating System Principles
A world-wide distributed system using Java and the Internet

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Condor-G: A Computation Management Agent for Multi-Institutional Grids

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
NetSolve: A Network Server for Solving Computational Science Problems

NetSolve: A Network Server for Solving Computational Science Problems
Integrating fault-tolerance techniques in grid applications

Integrating fault-tolerance techniques in grid applications
Matchmaking frameworks for distributed resource management

Matchmaking frameworks for distributed resource management
Exception Handling in CLU

IEEE Transactions on Software Engineering

References

Grid resource management
Phoenix: Making Data-Intensive Grid Applications Fault-Tolerant

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Separating Abstractions from Resources in a Tactical Storage System

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
jContractor: Introducing Design-by-Contract to Java Using Reflective Bytecode Instrumentation

Formal Methods in System Design
EIO: error handling is occasionally correct

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
On the evaluation of gridification effort and runtime aspects of JGRIM applications

Future Generation Computer Systems
Error detection and error classification: failure awareness in data transfer scheduling

International Journal of Autonomic Computing
Towards reliable storage systems

Towards reliable storage systems
Survey: Survey of fault tolerant techniques for grid

Computer Science Review

Quantified Score

Hi-index	0.01

Visualization

Abstract

Error propagation is a central problem in grid computing. We re-learned this while adding a Java feature to the Condor computational grid. Our initial experience with the system was negative, due to the large number of new ways in which the system could fail. To reason about this problem, we developed a theory of error propagation. Central to our theory is the concept of an error's scope, defined as the portion of a system that it invalidates. With this theory in hand, we recognized that the expanded system did not properly consider the scope of errors it discovered. We modified the system according to our theory, and succeeded in making it a more robust platform for distributed computing.