Structured computer organization; (2nd ed.)
Structured computer organization; (2nd ed.)
A proposed solution to the problem of levels in error-message generation
Communications of the ACM
Making APL error messages kinder and gentler
APL '89 Conference proceedings on APL as a tool of thought
Resourceful systems for fault tolerance, reliability, and safety
ACM Computing Surveys (CSUR)
The annotated C++ reference manual
The annotated C++ reference manual
Exception handling in large Ada systems
WADAS '91 Proceedings of the eighth annual Washington Ada symposium & summer SIGAda meeting on Ada: software: foundation for competitveness
The Legion vision of a worldwide virtual computer
Communications of the ACM
Object-oriented software construction (2nd ed.)
Object-oriented software construction (2nd ed.)
A security architecture for computational grids
CCS '98 Proceedings of the 5th ACM conference on Computer and communications security
The Java programming language (2nd ed.)
The Java programming language (2nd ed.)
High-throughput resource management
The grid
Fundamentals of fault-tolerant distributed computing in asynchronous environments
ACM Computing Surveys (CSUR)
JavaGenes and Condor: cycle-scavenging genetic algorithms
Proceedings of the ACM 2000 conference on Java Grande
Abstract machines for programming language implementation
Future Generation Computer Systems
When the CRC and TCP checksum disagree
Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
Implementing remote procedure calls
ACM Transactions on Computer Systems (TOCS)
End-to-end arguments in system design
ACM Transactions on Computer Systems (TOCS)
Exception handling: issues and a proposed notation
Communications of the ACM
An axiomatic basis for computer programming
Communications of the ACM
Asynchronous exceptions in Haskell
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Increasing relevance of memory hardware errors: a case for recoverable programming models
EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Some new transitions in hierarchical level structures
ACM SIGOPS Operating Systems Review
Replica Selection in the Globus Data Grid
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
The structure of the “the”-multiprogramming system
SOSP '67 Proceedings of the first ACM symposium on Operating System Principles
A world-wide distributed system using Java and the Internet
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Condor-G: A Computation Management Agent for Multi-Institutional Grids
HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
NetSolve: A Network Server for Solving Computational Science Problems
NetSolve: A Network Server for Solving Computational Science Problems
Integrating fault-tolerance techniques in grid applications
Integrating fault-tolerance techniques in grid applications
Matchmaking frameworks for distributed resource management
Matchmaking frameworks for distributed resource management
IEEE Transactions on Software Engineering
Grid resource management
Phoenix: Making Data-Intensive Grid Applications Fault-Tolerant
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Separating Abstractions from Resources in a Tactical Storage System
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
jContractor: Introducing Design-by-Contract to Java Using Reflective Bytecode Instrumentation
Formal Methods in System Design
EIO: error handling is occasionally correct
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
On the evaluation of gridification effort and runtime aspects of JGRIM applications
Future Generation Computer Systems
Error detection and error classification: failure awareness in data transfer scheduling
International Journal of Autonomic Computing
Towards reliable storage systems
Towards reliable storage systems
Survey: Survey of fault tolerant techniques for grid
Computer Science Review
Hi-index | 0.01 |
Error propagation is a central problem in grid computing. We re-learned this while adding a Java feature to the Condor computational grid. Our initial experience with the system was negative, due to the large number of new ways in which the system could fail. To reason about this problem, we developed a theory of error propagation. Central to our theory is the concept of an error's scope, defined as the portion of a system that it invalidates. With this theory in hand, we recognized that the expanded system did not properly consider the scope of errors it discovered. We modified the system according to our theory, and succeeded in making it a more robust platform for distributed computing.